VMware Cloud Community
gardengrove
Contributor
Contributor

multiple host failure, failed HA

I have 5 hosts running ESX 4.0.0. All VMs reside on shared storage via iSCSI. I have plenty of CPU and RAM resources available in case I need to load up on a host or two. Vmotion runs with out a problem and I have been able to migrate the VMs across all 5 hosts in the cluster. Okay... so a couple of nights ago, the A/C in the data center died. The storage managed to stay up but three of the hosts turned off due to temperature problems. None of the VMs restarted on my other two hosts. Is this a configuration issue? Do I need to upgrade to 4.1? Any help would be appreciated, thanks!

Reply
0 Kudos
10 Replies
weinstein5
Immortal
Immortal

Welcome to the Forums - How many host failures are you configured for? if it is 1, which is the default, it is entirely possible that 2 of the hosts that failed were the 'master nodes' which control HA restarts - the number of 'master nodes' is always N+1 to the number host failures supported -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
Reply
0 Kudos
gardengrove
Contributor
Contributor

Admission control is disabled, my failover capacity is 4 hosts, configured failure capacity is N/A (is this the part I need to fix?)

Also the Advanced Runtime Info is as follows

Slot size: 256 MHz, 2 virtual CPUs, 256 MB

Total Slots in cluster: 348

Used slots: 31

Available slots: 242

Total powered on vms in cluster: 31

total hosts in cluster: 5

total good hosts in cluster: 5

Reply
0 Kudos
Troy_Clavell
Immortal
Immortal

what do you have configured for your virtual machine options for Host Isolation response?

Reply
0 Kudos
gardengrove
Contributor
Contributor

Leave powered on

Reply
0 Kudos
Troy_Clavell
Immortal
Immortal

you use ISCSI and ESX? Do you have a second service console port that would have stayed pingable? You are sure there was an HA event?

Reply
0 Kudos
gardengrove
Contributor
Contributor

Good point! I am not sure that HA was even attempted. I do use ESX and iSCSI (thru a qlogic hba) for my shared storage. I need a second service console port on each failed host? I have uploaded an image of the network configuration of one of my hosts.

Reply
0 Kudos
Troy_Clavell
Immortal
Immortal

Good point! I am not sure that HA was even attempted. I do use ESX and iSCSI (thru a qlogic hba) for my shared storage. I need a second service console port on each failed host?

no, just a question. I would look through your logs to see if there was an HA event generated. You can start at the cluster level using the vSphere Client and going to task and events.

Also, check

HA agent logs: /var/log/vmware/aam

Configuration files: /etc/opt/vmware/aam

gardengrove
Contributor
Contributor

Nothing in the logs or the event viewer show an HA event occurring...

Reply
0 Kudos
Troy_Clavell
Immortal
Immortal

my assumption is that something is keeping the heartbeat alive between the ESX hosts, or with such a huge outage, HA just didn't do anything, which is possible. We had 6 out of 8 clustered Hosts go down in an HA environment and with that large of an outage HA just won't do anything. Kind of a fail safe. It helps prevent split brain scenarios and guests partially registered throughout your environment.

However, if you can, I would open an SR, VMware Support may be able to give you a definitive answer as to what did or did not happen, and why.

gardengrove
Contributor
Contributor

Thanks. The logs do show that the cluster saw the hosts go down. I will open an SR. Thanks again!

Reply
0 Kudos