I have 2 clusters (1 has 3 hosts, 1 has 2 hosts) running ESX 3.0.2. Several times a day, I receive the following error:
HA agent has an error
HA always recovers, but I'd like to know what is causing the error. Here's my /opt/LGTOaam512/log/agent/autoRecover.log:
Running auto-recover script.
Backbone has failed. Restarting Agent.
Legato Automated Availability Manager startup script.
Setting environment from /opt/LGTOaam512/config/agent_env.Linux
Starting agent for domain vmware
Backbone started successfully.
Agent started successfully.
Has anyone seen this issue before? Do you know what causes this? Is it something I should be concerned with?
Check to you have no file locking contention in the vmkernel log. We had a simliar issue where a particular Vm that was running and a different host in the cluster. One host was trying to put a lock on the vswap file for that vm and HA kept dropping out.
Migrate the images of and give the Host a reboot.
This is occurring on all 5 hosts. No entry in the vmkernel.log that leads me to beleive is a locking problem. I will try as you suggested and reboot all the hosts and see if it continues.
Do the hosts in each cluster completely mirror each other for datastores and Networks/Port groups?
We've had errors like this when 1 host had an extra network, or different volumes names. It has also happened when we had a host in the cluster that could see SAN disk that the other 2 could not.
In all cases resolving the discrepencies removed the errors.
Have you checked the physical switch the service consoles connect too it may be suffering network clicthes. Have you recently updated to 3.0.2 if so was it all working prior to the update. This version added the uppercase issue with the host file patch update 1 for 3.0.2 resolved this issue.
When you originally added your hosts to Virtualcenter, did you use fully qualified domain names? Have you verified your DNS is working for every host?
No issues on the pSwitches. My SC is running on a pair of teamed vNics connected to 2 separate pSwitches. I'm not aware of the uppercase issue in the hosts file. My hosts file contains all lowercase entries.
Yes, when I added the hosts to VC, I used FQDNs. DNS is working on all hosts. I've also added all partner hosts to the /etc/hosts files on each host. Below is an example of one of my hosts files:
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
10.xxx.xxx.xxx esx1.domain.com esx1
10.xxx.xxx.xxx esx2.domain.com esx2
I'm not sure if this was an issue in 3.0.1. I just started to notice the errors.