I have a three-node DRS/HA cluster where one host failed over the weekend. An important VM guest didn't fail over.
The host was down for three hours.
Hosts:
Failed host ESX 3.5.0 build 95350.
Two other hosts are both ESX 3.5.0 build 110268 (Update 2).
VC Server is 2.5.0 Build 104215 (Update 2).
We hadn't patched the failed host yet to the same level as the other two so it only had one VM guest running but it was an important one. If I attempt a manual vMotion from the failed (but now recovered) host to the others I see a warning "migration from host X to host Y will cause the virtual machine's configuration to be modified, to preserve the CPU feature requirements for its guest OS". If I click through the warning then the vMotion succeeds.
Is the patch level between my hosts to blame for the failed HA move? In the past we've tested isolation and seen the VM guests shut down on the failed host then come up on another host.
HA settings:
Host failures the cluster can tolerate: 1
Allow VMs to be powered on if the violate availability constraints
Default cluster settings: medium restart priority, Isolation response - power off vm (these are iSCSI connected).
Enable virtual machine monitoring (disabled)
Advanced settings: das.failureInterval 30, das.maxFailureWindow 3600, das.maxfailures 3, das.minUptime 120
The important guest is set: high restart priority, Isolation: use cluster settings.
Any suggested on finding out more?
Thank you!
Scott
Did you look through the vc's event tab? It should also have data when/if a host goes into isolation mode that is reported from the live hosts.
There should be an agent stdout and stderr files that exist in that location. They should record issues with the agent. I've had isolation kill a server before, and I'll try and find the log directly on ESX, but I know it shows up in VC as well. You should at least be able to see those logs in the aam directory to see if the agent was having issues during that time.
-KjB
The difference in patch level would be to blame for your error message, and may have caused vmotion to fail, but failover should have occurred. The caveat would be if the ha agent on the failed host had an issue or if there were an ha problem during the timeframe your host crashed. There should be evidence of this in your vc's events tab, and you should be able to see errors or warnings if ha did indeed fail.
Also check your vmkernel, ha, and vmware.log files during that timeframe to find more information.
-KjB
Where is the ha log? I can see vmware and vmkernel logs on the host but no ha logs on either the host or vc.
Thanks.
Look under /var/log/vmware/aam
-KjB
Over the weekend I patched the third host up to 3.5.0 Update 2 so it matches the other two hosts in the cluster. On the troublesome host HA service would not start. The error was similar to "HA agent has an error", which is pretty generic.
I tried a couple things like restarting the host, reconfigure for HA, checking that hosts were entered in lower case in the hosts file, VC DNS routing, and elsewhere, etc. In the end I had to disable HA on the cluster, wait for propagation to the hosts, then enable HA again. This time it worked.
I guess the HA agent was buggered on the troublesome host the whole time which would have been discovered had I kept it up to patch level equal to the other hosts in the cluster.
re: Logging
I reviewed many logs in that directory but didn't see any errors. Is there a certain log that captures events such as an attempted failover?
Did you look through the vc's event tab? It should also have data when/if a host goes into isolation mode that is reported from the live hosts.
There should be an agent stdout and stderr files that exist in that location. They should record issues with the agent. I've had isolation kill a server before, and I'll try and find the log directly on ESX, but I know it shows up in VC as well. You should at least be able to see those logs in the aam directory to see if the agent was having issues during that time.
-KjB