Earlier this week I logged into Virtual Center (2.01 + patches) and I saw a red flag on an ESX (3.01) host.
It turns out that earlier there was a failure of the HA agent.
Furthermore, the failure of this agent caused all VM's on that host to be hard-rebooted on different ESX hosts.
The worst of it is that we had no idea that we were running without redundancy.
There are alarms in Virtual Center but I can't find any that have to do with the health of the HA agent.
We have HP OpenView so I looked in the Event logs and for text logs or anything on the Virtual Center server that I could find as a "trigger" for such an event but I could not find anything.
This is a big problem, obviously because we need to be notified when we have lost redundancy (i.e. HA Agent has failed).
Does anyone know how this can be done (if it can be done), without buying some 3rd party ESX monitoring package?
Well, it's not all that bad. At least HA worked and your VMs were up and available.
There are actually a couple of options available to you. Remember that the HA component is really a piece of software that EMC acquired through Legato called Fulltime (ft) Autostart. Anything that can monitor ft can be used to monitor HA. This isn't well documented on the VMware front, but has been with Legato.
The first is to check for the existence of the EMC AAM agent's key processes. There are several processes that make up the HA component, they are:
ftbb and ftAgent are the primary one, though, and if one or both of them fail, a failover will take place. Note: ftbb is actually the process that maintains the "heartbeat" with the other nodes in the cluster of TCP 8042-8045. You can check for the existence with a simple "ps -ef | grep /opt/LGTOaam512". Parse the output and if you don't see all of the processes, throw up the red flag.
You can also monitor the log file for AAM. The main log file to focus on is vmware_yourESXservername.log, replacing "yourESXservername" with the local ESX host name. Any events are logged here.
The best option (saved the best for last) is to run the following command:
This command will either return output like this if the node is online from an AAM perspective:
...or the following if the AAM agent is not available:
Putting together a script on a monitoring node, like your HP OpenView server that attempts to SSH in and runs a script that grabs the host name (hostname -s) and then runs the ft_gethostbyname will let you know if everything the agent is online or not. Of course, if the SSH fails, then the ESX host is probably down all together.
Hope that helps. Good luck!