I have 5 hosts running ESX 4.0.0. All VMs reside on shared storage via iSCSI. I have plenty of CPU and RAM resources available in case I need to load up on a host or two. Vmotion runs with out a problem and I have been able to migrate the VMs across all 5 hosts in the cluster. Okay... so a couple of nights ago, the A/C in the data center died. The storage managed to stay up but three of the hosts turned off due to temperature problems. None of the VMs restarted on my other two hosts. Is this a configuration issue? Do I need to upgrade to 4.1? Any help would be appreciated, thanks!
Welcome to the Forums - How many host failures are you configured for? if it is 1, which is the default, it is entirely possible that 2 of the hosts that failed were the 'master nodes' which control HA restarts - the number of 'master nodes' is always N+1 to the number host failures supported -
If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
Admission control is disabled, my failover capacity is 4 hosts, configured failure capacity is N/A (is this the part I need to fix?)
Also the Advanced Runtime Info is as follows
Slot size: 256 MHz, 2 virtual CPUs, 256 MB
Total Slots in cluster: 348
Used slots: 31
Available slots: 242
Total powered on vms in cluster: 31
total hosts in cluster: 5
total good hosts in cluster: 5
what do you have configured for your virtual machine options for Host Isolation response?
Leave powered on
you use ISCSI and ESX? Do you have a second service console port that would have stayed pingable? You are sure there was an HA event?
Good point! I am not sure that HA was even attempted. I do use ESX and iSCSI (thru a qlogic hba) for my shared storage. I need a second service console port on each failed host?
no, just a question. I would look through your logs to see if there was an HA event generated. You can start at the cluster level using the vSphere Client and going to task and events.
Also, check
HA agent logs: /var/log/vmware/aam
Configuration files: /etc/opt/vmware/aam
Nothing in the logs or the event viewer show an HA event occurring...
my assumption is that something is keeping the heartbeat alive between the ESX hosts, or with such a huge outage, HA just didn't do anything, which is possible. We had 6 out of 8 clustered Hosts go down in an HA environment and with that large of an outage HA just won't do anything. Kind of a fail safe. It helps prevent split brain scenarios and guests partially registered throughout your environment.
However, if you can, I would open an SR, VMware Support may be able to give you a definitive answer as to what did or did not happen, and why.
Thanks. The logs do show that the cluster saw the hosts go down. I will open an SR. Thanks again!