VMware Cloud Community
Kermitdafwog
Contributor
Contributor

HA and VM restarts...

Hi

Our 4 node cluster had networking issues last night which resulted in isolated hosts and VMs getting shut down.  The hosts are running ESX 4.1 and have the default HA cluster settings defined (medium restart priority and VM shut down).  DRS is set to manual and we lots of DRS rules keeping clustered VMs apart etc, nothing major.

I came in this morning to find that all the VMs in the estate (about 200) were all powered off including vCentre.

Only after we managed to locate and power on the VC database and virtual centre did the VMs begin powering back on automatically and moving hosts (all done by vpxuser).  The network issue appears to have been sorted during the night (a possibly switch reboot or something, still trying to find out from the comms team) so why have the VMs attempted to be restarted now the hosts are not complaining about being isolated?

I was under the impression that HA works *without* vCentre being available (i.e. VMs will restart on other hosts) but this doesnt appear to have been the case here.

There seems to be a fair few events of "Virtual machine was restarted on host-esx03.fqdn.int since host-esx01.fqdn.int failed" followed immediately with "Failover unsuccessful for this virtual machine" even though the VM *was* moved from esx01 to esx03 but not powered on

Seems very odd...

Can anyone shed any light on this?

thanks

0 Kudos
17 Replies
MKguy
Virtuoso
Virtuoso

HA works completely indipendant of vCenter. Hosts really only need access to vCenter for configuring HA, but not for the handling of restarts.

There seems to be a fair few events of "Virtual machine was restarted on host-esx03.fqdn.int since host-esx01.fqdn.int failed" followed immediately with "Failover unsuccessful for this virtual machine" even though the VM *was* moved from esx01 to esx03 but not powered on

I suppose you use FC storage or your NFS/iSCSI network was not affected from these networking issues?

Then I can think of the following scenario as to what might have happened: The hosts were partitioned (not isolated or could still reach their default gateway) at first and tried to restart VMs of the unreachable hosts. However, they couldn't restart the VMs because the VMFS file lock was still active as they haven't been shutdowned. Later the issue expanded and the hosts actually all got isolated, applying the isolation policy of shutting down VMs. This could have happened through a switch reboot for example, causing (non-rapid) spanning tree to block all ports for a while until the tree is built. You may want to set the HA setting das.failuredetectiontime to a larger value like 60 seconds (default 15) to guard against this.

To really tell what was going on we'd need detailed aam/hostd/vmkernel logs and descriptions of what precisely went wrong on the networking side.

Anyways, I generally recommend using the leave powered on isolation policy to avoid such issues. VMFS file locking will do the job in 4.x and in vSphere5, you have datastore heartbeating adding even more resiliency.

Check the awesome HA deepdive and books by Duncan and Frank:

http://www.yellow-bricks.com/vmware-high-availability-deepdiv/#HA-41

-- http://alpacapowered.wordpress.com
jdptechnc
Expert
Expert

If you look at the Events of the one of the actual VMs, it may give you  more intel as to why it couldn't power on.  Perhaps the networking  issues took down your storage?  Perhaps you didn't have enough resources  to satisfy admission control requirements?

Were all hosts in your cluster network isolated last night (entire network went down)?

Please consider marking as "helpful", if you find this post useful. Thanks!... IT Guy since 12/2000... Virtual since 10/2006... VCAP-DCA #2222
0 Kudos
JCOEN
Enthusiast
Enthusiast

As you know, when a host fails, the VMs will try to restart on another host within the cluster. However, if the surviving hosts in the HA cluster think they, themselves, are isolated due to the fact that they cannot ping their isolation address (by default this is the default gatweway), then the surviving hosts WILL NOT restart the failed VMs.

You say there were "networking issues which resulted in isolated hosts" Well, the isolation response you said is set to Shut Down. Therefore, since the hosts were isolated the VMs were powered off, and since any surviving host in the cluster was also isolated (this is an assumption on my part) then those hosts would not have restarted the VMs that were powered off.

Since you are running 4.1 you don't have the option of using datastore heartbeats to help. If you want to try to avoid this in the future you can specify multiple isolation addresses by using the HA advanced option 'das.isolationaddress#' If you don't have other network addresses that are pingable in the case the default gateway connectivity is lost then I would suggest re-evaluating you host isolation response.

-Josh

0 Kudos
Kermitdafwog
Contributor
Contributor

Hi

There are no clues in the VM logs as to why the VMs didnt power on - like i mentioned, as soon as vCentre came back up they began to power on.#

We use FC storage so there were no issues with this and we have plenty of resources availabe in the cluster and no CPU or memory reservations to satisfy

thanks in advance

Kerm

0 Kudos
JCOEN
Enthusiast
Enthusiast

Kerm,

If you were having networking problems, and the hosts determined they were isolated then HA will not power the VMs back on to an isolated host.

-Josh

0 Kudos
Kermitdafwog
Contributor
Contributor

Hi Josh

I don't have confirmation yet but we have been told that the gateway had a very brief outage at 1.18am, the aam logs for all hosts tally with this as there are isolated events in there for each server. I realise that if everything was isolated then HA wouldn't restart the VMs in other isolated hosts but why did the restarts only happen when vCentre was brought back online when comms had been back up for approx 8 hours?

Thanks for your input so far Smiley Happy

Kerms

0 Kudos
jdptechnc
Expert
Expert

Grasping here... maybe your cluster was set to use the vCenter server IP as the host isolation address? Look at the HA advanced settings on your cluster to check for that.

Please consider marking as "helpful", if you find this post useful. Thanks!... IT Guy since 12/2000... Virtual since 10/2006... VCAP-DCA #2222
0 Kudos
Kermitdafwog
Contributor
Contributor

Nah, it was the networks detail gateway, a port on a Cisco switch that responds to icmp

Cheers

Kerms

0 Kudos
jdptechnc
Expert
Expert

I'm stumped then.  I know DRS will not function without vCenter, but HA should be fully functional.

Were there any DRS rules in play that are restrictive as to where VMs can be placed?

What is your HA Admission Control policy?

Please consider marking as "helpful", if you find this post useful. Thanks!... IT Guy since 12/2000... Virtual since 10/2006... VCAP-DCA #2222
0 Kudos
depping
Leadership
Leadership

Which version of vCenter Server are you using?

0 Kudos
Kermitdafwog
Contributor
Contributor

Hi

Version 5

I didn't mention earlier that we also use dvSwitches as didnt think it was relevant but perhaps it is?

Thanks

Kerms

0 Kudos
depping
Leadership
Leadership

Check the FDM log files, but it sounds like for whatever reason the host compatibility list was updated and stated no hosts were compatible for a restart. When vCenter came online again it probably pushed a new update with a full list of all compatible hosts and hence it started working again.

The FDM log files should reveal this. Maybe you can attach it here,

0 Kudos
Kermitdafwog
Contributor
Contributor

Hi

Yes all hosts went down.  The culprit was the default gateway is a VIP of a clustered firewall - still not sure what happened but it started dropping ICMP so all hosts declared themselves as isolated.  The issue was fixed during the night but no VM restarts occured until vCentre and vcentre SQL box brought back on line (both of which are VMs).

We use FC storage and there were no issues on that side of things

thanks

Kerms

0 Kudos
Kermitdafwog
Contributor
Contributor

Hi Duncan

We are using vCentre 4.1.0.14766

I have looked for the FDM logs and they dont seem to exist on any of our hosts.  The hosts are running ESX 4.1.0, 502767

Any more ideas?  I am about to log this up with VMWare so I'll post back any results they give in case it helps others....

cheers

Kerms

0 Kudos
depping
Leadership
Leadership

huh, now I am lost? you answered my question about which version of vCenter you are using with "5".

If you are using vCenter 4.x look at the AAM log files.

0 Kudos
Kermitdafwog
Contributor
Contributor

Hi

Yes sorry, thinking of two different things at the same time!  It is vCentre 4.x

What specifically am I looking for in the aam log folder?  I have looked at the log file witht eh same name as the host and that does say that isolation occured at the time I thought it did - what else would be helpful as there a lot of logs in there!

Thanks in advance

kerms

0 Kudos
depping
Leadership
Leadership

I would be looking for a fail-over attempt and then see if you can find messages around why it failed. But in these cases contacting support is probably 200 times faster.

0 Kudos