Got a small twin host setup here running ESX 3i 3.5 servers iScsi'ed to an MSA disk shelf. Both hosts are clustered with HA & DRS enabled and all was well for a number of months since it's installation .... until sunday night. Our UPS decided to have a hernia and stopped passing power to our Server/Comms cabinets resulting in the enevitable instant power off of all hardware. Great way to start the year off !
Anyway, 99% of all systems came back up without a worry once we bypassed the dodgy UPS (waiting for that to be fixed) but ever since we have had an "HA Agent has an error" message with the red alert symbol on one host only. The second host has no problems. I have tried a number of things to repair the HA error from the VI console including:
a) Right-Clicking dodgy host and selecting "Reconfigure for VMWare HA" <-- This reults in a ": cmd remove failed:" error in the events list for the host just after the message "enabling HA agent"
b) Removing the dodgy host from the Cluster and then re-joining it
c) Dropping the dodgy host to maintenance mode and then re-enabling it
d) Restarting Dodgy Host
OK, now my linux skills can be easily classed as newbie, but i knew enough to get the following console info for troubleshooting :
1) I logged into the VMWare OS and ran a Test Management Network. Gateway, 2x DNS servers and hostname resolution all successful
2) Based on some other threads i have seen here, I then logged into the console and did a test ping by hostname to the other host and the Virtual Centre server. Both were successful, so i doubt i have a DNS or hostnames problem
3) Did an Alt-F12 to view the logging screen and noticed something that may be related ??? Every 10 minutes or so i get the following entrys :
WARNING: UserThread: 406: peer table full for sfcbd
WARNING: World: vm 204457: 910: init fn user failed with: Out of resources!
WARNING: World: vm 20457: 1775: WorldInit failed: trying to cleanup.
That may or may not be related, but i have a gut feeling it is.
Any ideas what i need to check next in order to get HA back functioning again on this host?
we had exactly the same error on of our ESXi 3.5 hosts that boots of a USB key drive.
It turned out that the USB key drive was defective causing inconsistencies in the internal file system (it's FAT by the way) that in turn led to failures of the HA agent.
Try to rewrite your boot media. If you have an HP server with a USB key drive there is a recovery CD available from HP that you can use for that. This helped us for a while, but the problem re-appeared then, because the key drive was defective.
Maybe in your case there is just a file corruption caused by the power outage that can be repaired by rewriting the media.
Yeah, i am keeping that option as a last resort. We went through that whole drama during the installation of the system. HP had a bad batch of USB keys supplied to customers and both of our keys were from that batch. Both have been replaced and functioning well for a few months now. I have that image CD from HP (dated around July 08) if i need to go down that path.
I tend to agree that there is some corruption there, but i am still hoping that i can be maybe delete all the HA files from the host and try adding to the cluster again. Just not sure where VMWare stores those files within the host OS.
Have given that a go now, still no good.
Also tried removing the host from the cluster, rebooting, adding back into the cluster without HA success.
Looking at the linux console it says vmware-aam service timed out at startup. Other post here say this is the HA service effectively. is there a way to RESET HA on the host from within the linux console? i really want to avoid rebuilding the usb key/patching etc if i can.
We had exactly the same drama here with the bad green keys which where then replaced by good black ones ...
I definitely recommend rewriting the key instead of fiddling around with the HA agent installation. It's not that hard:
You can save all the configuration of the host (networking etc.) using the Remote CLI interface (esxcfg-cfgbackup.pl) and restore it later using the same command after you have rewritten the key. Then you only need to re-add it to VirtualCenter and patch it to the current patchlevel.
I"m having the same error on the hosts that i want to add to my HA cluster on my VMware esxi 3.5 upd 2 platform.
Any thoughts on this ? (i''m using a local install and no usb disks).
I ended up rebuilding the USB key from the HP restore cd, repatching, rejoining to cluster. All back to 100%.
couple of hours work, but at least i have a working system now.