Just curious if anyone else has been experiencing issues where their cluster nodes randomly experiencing an error with HA. As a result the Cluster is degraded. Right now I have multiple nodes with the same error in a cluster of 3 nodes thus nullifying HA for the entire cluster. Ironically DRS still appears to work if for the nodes experiencing HA problems. The unfortunate part is that the interface simply states the agent has an error but doesn't actually provide information beyond that.
Thus far I have been unable to correct the issue (other than rebuilding the node).
I have an open case with VMware Support but we haven't found a resolution and I have just been handed off to another technician.
On the esx host:
tail -f /var/log/vmware/vpx/vpxa.log
In VC, right click on the host that shows the HA problem and click reconfigure for HA.
Watch the logs post the error message.
The usual problems with HA configuration are:
1) dns issues - make sure you can resolve the short hostname (without domain name) of each ESX host from each other ESX host in the cluster.
2) fqdn too long - make sure the fully qualified domain name of all hosts is less than 39 characters.
Could you look at the /opt/LGTOaam512/log/aam_config_util_addnode.log. That may give an indication about what went wrong.
I think that is shorter than 29 characters. That issue though should be fixed in the first patch according to a SR I filed on it.
In case anyone searches and finds this.
I was having the same issue, and I tried restarting VC, and the ESX boxes that were affected, to no avail. For some reason I couldn't even put the hosts in maint. mode. Reconfiguing HA on each host didn't work either.
However, before calling support, I decided to uncheck HA and DRS on this cluster, and then check them again, and shockingly, everything was happy again.
When I did the tail -f /blahblahblah what I was seeing was a VMap error - not sure if that's generic or what, but since this worked for me, I figured I'd post it.
\- Keith
However, before calling support, I decided to uncheck
HA and DRS on this cluster, and then check them
again, and shockingly, everything was happy again.
Same issue with one of my ESX 3.0.1 (35804) host.
I've tried a lot of things (no results). I read your post and unchecked "HA" on the cluster, than add it again after the reconfiguration. HA is working again.
Somehow, HA seems a wild child.
I have seen the same issue a couple of times. If I turn HA off then back on everything magically works again...
yeah most of the time its DNS issues
Thank you for your suggestion. It is so simple. Just click reconfigure for HA!
Much better then disabling and enabling HA on a cluster.
It seems that after host reboot it does not automatically joins HA cluster.
Thanks again
This is a ridiculous problem to be honest. I see it all the time. And we do NOT have any problems with out DNS setup. I have hosts drop off HA quite frequently. Oh sure, reconfiguring for HA seems to resolve the issue, but ??? this isn't how this should be. I shouldn't have to reconfigure every time one host seems to take a HA nap.
We had this error when the DNS server entry of one of our ESX hosts was not set correct. I the put the fqdn and the short names of our ESX hosts to the /etc/hosts files of all hosts. After that everythig worked fine.
I agree, our DNS is working just fine and this Legato agentry is buggy. We \*just* put this cluster into production and these agents are already creating problems; great feature, but I'm' not thrilled to be beta-testing it in production. I've just disabled it for now.
I can recreate and break this error.
Error configuring HA.
" Internalerror : vmap_HSLESX01 process failed to stop "
Not much help is available for this, but it consistantly occurs when the public (data) network cable is unpluged. I tried to reconfigure HA and it fails at 72% with the error above.
Remove HS and DRS from the cluster. (takes forever.....) then add the services back in.
This has worked consistantly for me. I can recreate the issue and fix it again this way......
Crazy...
We had a planned network outage last week. When I checked on the cluster after the outage 3 of 4 hosts in the cluster were showing the same HA Configuration Error or similar generic issue. Uncheck and recheck of HA did the trick, but still doesn't seem reliable. I mean, if a network cable being unplugged causes the cluster to fail it's pretty useless.
We are running ESX 3.0.1, 34176 on 12 Hosts and we also see this issue quite often. In our environment, it is definitely not a name resolution issue. To fix the issue I usually just reconfigure HA on the host it has failed on and wait. The host eventually (4 hrs or so) clears the issue itself, like magic?
Anyway, while we are waiting for VMware to fix this issue, it would be nice if they would give use the ability to setup an Alarm to notify us if HA fails on a given host. Currently, I do not see a trigger for HA?
We are running ESX 3.0.1, 34176 on 12 Hosts and we
also see this issue quite often. In our environment,
it is definitely not a name resolution issue. To fix
the issue I usually just reconfigure HA on the host
it has failed on and wait. The host eventually (4 hrs
or so) clears the issue itself, like magic?
Anyway, while we are waiting for VMware to fix this
issue, it would be nice if they would give use the
ability to setup an Alarm to notify us if HA fails on
a given host. Currently, I do not see a trigger for
HA?
Setup the hosts file and you'll be all set.
I have a 12 node cluster and i haven't had a blip since I set that up many months ago.
Thanks for the reply, but the hosts files are already populated with both the FQDN (and short names) of all the other ESX servers as well as all the DNS Servers FQDNs and short names.
And the '/etc/nsswitch.conf' is setup right.
So, it is definitely not a name resolution issue.
Ok My DR site had problems i couldn't resolve either..
I just installed VC Patch 2 (although it sucked to install)
My HA errors went away in my DR site..
I'll upgrade to VMware VirtualCenter 2.0.1 Patch 2 and see what happens.
thanks ...
I just set up a new production cluster and was seeing a lot of the same issues discussed here (e.g. HA activation fails with AAM agent unable to start, unable to reach primary cluster server, VMAP errors, etc.)
I configured the /etc/hosts file with VIS, DNS and cluster members to no avail.
I seem to have the problem limited at this point. I loaded DNS on my VIS and set that as my primary DNS on the cluster members, rebooted members and added to HA cluster.
I had updated to Patch 2 earlier, and was still seeing the problem until this point. I don't know if it was a coincidence, but there it is. Hope it helps.
Message was edited by:
mcvmwaresupport