HA agent has an error

ldornak · ‎11-08-2007

I have 2 clusters (1 has 3 hosts, 1 has 2 hosts) running ESX 3.0.2. Several times a day, I receive the following error:

HA always recovers, but I'd like to know what is causing the error. Here's my /opt/LGTOaam512/log/agent/autoRecover.log:

Running auto-recover script.
Backbone has failed. Restarting Agent.
Legato Automated Availability Manager startup script.
Setting environment from /opt/LGTOaam512/config/agent_env.Linux
Starting agent for domain vmware
Starting Backbone...
..
Backbone started successfully.
Starting Agent...
Agent started successfully.
Complete.

Has anyone seen this issue before? Do you know what causes this? Is it something I should be concerned with?

Natsidan · ‎11-09-2007

Check to you have no file locking contention in the vmkernel log. We had a simliar issue where a particular Vm that was running and a different host in the cluster. One host was trying to put a lock on the vswap file for that vm and HA kept dropping out.

Migrate the images of and give the Host a reboot.

ldornak · ‎11-09-2007

This is occurring on all 5 hosts. No entry in the vmkernel.log that leads me to beleive is a locking problem. I will try as you suggested and reboot all the hosts and see if it continues.

ldornak · ‎11-09-2007

Rebooting all hosts did not resolve the issue. I'm still receiving the "HA agent has an error". This happens 2 - 3 times a day on all my hosts.

TCronin · ‎11-09-2007

Do the hosts in each cluster completely mirror each other for datastores and Networks/Port groups?

We've had errors like this when 1 host had an extra network, or different volumes names. It has also happened when we had a host in the cluster that could see SAN disk that the other 2 could not.

In all cases resolving the discrepencies removed the errors.

Tom Cronin, VCP, VMware vExpert 2009 - 2021, Co-Leader Buffalo, NY VMUG

ldornak · ‎11-09-2007

Yes, each host in each cluster is identical to it's partner host(s). I verified twice.

Natsidan · ‎11-09-2007

Have you checked the physical switch the service consoles connect too it may be suffering network clicthes. Have you recently updated to 3.0.2 if so was it all working prior to the update. This version added the uppercase issue with the host file patch update 1 for 3.0.2 resolved this issue.

TCronin · ‎11-09-2007

When you originally added your hosts to Virtualcenter, did you use fully qualified domain names? Have you verified your DNS is working for every host?

Tom Cronin, VCP, VMware vExpert 2009 - 2021, Co-Leader Buffalo, NY VMUG

ldornak · ‎11-09-2007

No issues on the pSwitches. My SC is running on a pair of teamed vNics connected to 2 separate pSwitches. I'm not aware of the uppercase issue in the hosts file. My hosts file contains all lowercase entries.

ldornak · ‎11-09-2007

Yes, when I added the hosts to VC, I used FQDNs. DNS is working on all hosts. I've also added all partner hosts to the /etc/hosts files on each host. Below is an example of one of my hosts files:

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
10.xxx.xxx.xxx esx1.domain.com esx1
10.xxx.xxx.xxx esx2.domain.com esx2

I'm not sure if this was an issue in 3.0.1. I just started to notice the errors.

All

HA agent has an error