VMware Cloud Community
titn003
Contributor
Contributor

HA agent on server.domain in cluster has an error

"HA agent on server.domain in cluster has an error."

What does this mean and how to i fault find this error message to fix it

0 Kudos
22 Replies
conyards
Expert
Expert

to be honest there is probably no fault but a heartbeat error. the logs can be found at /opt/LGTOaam512/log. Right mouse click the host in question and 'reconfigure for HA' that should resolve the problem.

https://virtual-simon.co.uk/
0 Kudos
titn003
Contributor
Contributor

cannot see

Right mouse click the host in question and 'reconfigure for HA'[/i]

i am running ESX 3.0.1 & VC 2

please help

0 Kudos
conyards
Expert
Expert

I meant literarly right mouse click on the host that has the error within your HA cluster, from that sub menu select 'reconfigure for HA' its the bottom option on that sub menu.

https://virtual-simon.co.uk/
0 Kudos
mkopenski
Contributor
Contributor

I am having this issue repeatedly with one host in the cluster, Keeps going into Insufficient resources to satisfy HA failover level on cluster.

Just before that is says "HA agent on ***** in Cluster **** in ***** has an error.

Is there more information in other logs where I can find the answer?

0 Kudos
conyards
Expert
Expert

logs are located at;

/opt/LGTOaam512/log

on the ESX host.

https://virtual-simon.co.uk/
0 Kudos
kegwell
Enthusiast
Enthusiast

If I had to guess without looking at your logs, I'd say it was a DNS issue. HA requires your ESX hosts to have a FQDN registered with DNS. If you don't have access to a DNS server, you can manually add the entries in /etc/hosts. You will need to include an entry for each ESX host in the HA cluster on each ESX host in the HA cluster.

0 Kudos
MrPhoenix
Contributor
Contributor

Hi!

I have the same error on one of my esx3.0.1

The ESX-Servers do have working DNS-Names and a working DNS-Resolution.

According to the "aam_config_util_addnode.log" the HA-Agent can't use it's TCP-Ports. A few lines beneath this Errormessage it states that the ports are free....

Rebooting the ESX doesn't help. Dissolving the HA-Cluster also doesn't help.

So. What to do now? Any ideas?

Regards,

Philipp

0 Kudos
mkopenski
Contributor
Contributor

What worked for me was, adding entries in /etc/hosts for the short names of each ESX server, dissolving HA and re-enabling. Have not had the issue in a couple of days now

0 Kudos
MrPhoenix
Contributor
Contributor

wow ... a "couple of days now"

Shouldn't HA be a solution for redundancy? To protect the virtual-machines form going offline in case of a hardware- or esx-error? Smiley Wink

Thanks for the tipp with the entries in host-file!

Regards,

Philipp

0 Kudos
mkopenski
Contributor
Contributor

Actually only one ESX server was showing errors with HA and would come back on its own, the other 3 in the cluster were fine. Not sure why the fourth one was doing it, but adding the host entries seems to work.

Message was edited by:

mkopenski

0 Kudos
kegwell
Enthusiast
Enthusiast

"What worked for me was, adding entries in /etc/hosts for the short names of each ESX server, dissolving HA and re-enabling. Have not had the issue in a couple of days now."

Exactly...

0 Kudos
pmorrison
Enthusiast
Enthusiast

This is throwing me for a loop. I have 5 nodes in a cluster and all are working well except one.

I have added fqdn and shortnames and IP addresses to the /etc/hosts file, I have looked at the logs and found much of the same information you guys have already talked about. I am hesitant to disable HA and DRS since this is a production environment and would like to find a root cause before I make a change.

Has anyone heard from vmware as to what is causing this?

0 Kudos
pmorrison
Enthusiast
Enthusiast

Here is a weird one. I added a 6th host to the cluster and since then, host 5 has not had a single HA error.....

0 Kudos
postfixreload
Hot Shot
Hot Shot

/opt/LGTOaam512/config/vmware-sites

this file will lit all your primary hosts. /opt/LGTOaam512/log/aam_config_util_listnodes

this will show you how many hosts are in your cluster, which one are primaries also if agent is running or not.

/opt/LGTOaam512/log /aam_config_util_addnodes

this file will show you if any host has problem to add to host (bad thing about this is you won't see a time stamp)

HA use all FQDN to connect a host, however, it then will discard the domain name then try to use short name to communicate between servers. Only primary hosts has a DB with all the HA cluster info in it. When you add a new host into HA cluster it will add as secondary (except first node), it then will look for a primary host in the same HA domain (always "vmware" for HA) the first 5 host will be primary host (1 added as primary and other 4 promoted to primary) When you add a ost it can not talk to a primary host then the agent will fail.

0 Kudos
admin
Immortal
Immortal

Was it always that host that was having problems, or is it any one of the 4 hosts?

0 Kudos
msimpkin
Contributor
Contributor

Another thing to note:

Make sure you are using ntp and that you have allowed access for ntp client through the firewall.

0 Kudos
admin
Immortal
Immortal

One thing that I ran across, that might affect service console networking, and thus have a negative impact on HA - if "Serial over LAN" is enabled in your environment -

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1627&slice...

0 Kudos
m_d_sella
Enthusiast
Enthusiast

I have recently seen a recurrence of this problem in our environment. I have 4 hosts (ESX 3.0.1) running in a cluster with DRS and HA enabled. When I first configured the cluster, about 2 months ago, I started seeing frequent issues with "Operation Timed Out" messages and the HA Agent errors. I subsequently remedied the problem by making the appropriate entries in the /etc/hosts file to allow FQDN and shortname resolution to all hosts and the VC server. Up until now, I had not seen a single error in this regard. I recently applied the new ESX patches (Released 5/15/07) to one of my hosts. Since then, that host has been indicating HA Agent errors and is seeing the timeout issues frequently. The HA error will mysteriously disappear for periods of time, without any intervention, and will then return. There is no noticeable pattern to this, and the time with and without the error varies significantly. All other hosts in the farm are fine, and the "Operation Timed Out" messages that are most often seen when using VMotion, usually appear while the HA Agent error messages are present. Has anyone else seen this problem after applying the new patches.

Thanks in advance,

Mike

0 Kudos
walchst
Contributor
Contributor

You've precisely described my exact situation, though I've updated to 3.0.1 Update 1 + hotfixes. Dissolving HA, restarting the host etc has not made any difference. Alarms still occur sporadically on the one particular host.

0 Kudos