VMware Cloud Community
kreisbote
Contributor
Contributor

Getting HA error on one ESX node (no clue why)

Hello,

since a few days one of my ESX node in an HA-cluster loggs:

"HA agent has an error"

i tried to reconfigure HA but this seems to "fix" the problem just for a few hours.

Where to start looking for the reason?

How can i debug HA ?

DNS seems fine, all servers are forward and reverse resolved corretly.

0 Kudos
9 Replies
dblake15
Enthusiast
Enthusiast

How many hosts are in the cluster?? All same version of VI3?? any info in the Task and Events of that host.

You can also try under the HA settings to "Allow Virtual Machines to be powered on even if they violate availability constraints" Just to see if that gives you a different error or works.

0 Kudos
kreisbote
Contributor
Contributor

There are four hosts in the cluster "esx1".."esx4"

Alle same version.

No further info in the events (seen in virtual center)

"Allow Virtual Machines..." is already checked.

0 Kudos
dblake15
Enthusiast
Enthusiast

Do you see any errors on the hosts themselves?? Does it actually let you configure HA and then give you the error afterwards, or does not let you configure HA at all??

0 Kudos
jdaunt
Enthusiast
Enthusiast

You can check in /opt/LGTOaam512/log/ and see if you can spot any errors in the logs.

0 Kudos
kreisbote
Contributor
Contributor

It let's me reconfigure HA without any problem (the attention-sign disappears).

This state last for a while before it comes back again without any known reason.

0 Kudos
kreisbote
Contributor
Contributor

You can check in /opt/LGTOaam512/log/ and see if you

can spot any errors in the logs.

Puuh, there are about a zillion files in there, where do i start ?

aam_config_util_listnodes.log:

KEY: -z VAL: 1

KEY: domain VAL: vmware

KEY: cmd VAL: listnodes

CMD: hostname -s

RESULT:

\----


esx1

CMD: /opt/LGTOaam512/bin/ft_gethostbyname esx1 |grep FAILED

RESULT:

\----


list_nodes

CMD: /opt/LGTOaam512/bin/ftcli -domain vmware -connect esx4 -port 8042 -timeout 60 -cmd listnodes

RESULT:

\----


Node Type State

\--


\
\
--


esx1 Primary Agent Failed

esx2 Primary Agent Running

esx3 Primary Agent Running

esx4 Primary Agent Running

0 Kudos
CWedge
Enthusiast
Enthusiast

Do you have your hosts files set up with with the short name and the FQDN?

Even if your DNS works...HA for some odd reason needs the hosts files on all ESX servers to work properly.

Also Patch 2 of VC 2.0.1 \*seamed* to clear up some other HA problems.

0 Kudos
Kindred_VMSuppo
Contributor
Contributor

I just had a similar problem. 2 things that I had to do.

1. reboot the offending server (this fixed my ha agent continually disabling itself)

2. create a new ha cluster group and move all esx servers into it.

0 Kudos
kreisbote
Contributor
Contributor

2. create a new ha cluster group and move all esx

servers into it.

This seems to work. For about 2 hours i don't get any HA related errors.

Thank you very much !

0 Kudos