Re: Fault Tolerance

john_lauro · ‎03-22-2013

I am attempting to test fault tolerance and ran into a slight problem in disaster test case.

2 ESXi 5.1 servers in a cluster.

VM running under fault tolerance on both and is set to autostart.

Ungracefully (ie: pull the plug) on both boxes. (Pretend someone short circuited a rack and blew a breaker)

(If both are turned back on, there is no problem recovering)

Turn only one box on (doesn't appear to matter if it was running on primary or secondary, definitely doesn't like if only turn on secondary)

Also VC is unreachable (remote), only access is shell on the console.

Any way start the VM back up? I did try to use vim-cmd power.on, but that fails. Even connecting with the vsphere client gives only grey out options on the VM.

I know, the chances of both going down at the same time followed by only one coming back online should be rare and you wouldn't want an autostart in case of split brain (although vmware is pretty good about avoiding that with locks in VMFS). That said, it should be possible to manually start the VM....

If not directly, does anyone know if it can be cloned or FT / HA disabled from the command line, or some other trick? I'm not really familiar with using the cli for vmware (used it a little back on ESX 3).

Haven't tried plain HA. Is it any easier to restart a VM from secondary node (also assume communication to VC is down) after a major cluster failure?

weinstein5 · ‎03-22-2013

Welcome to the COmmunity - First thing to under stand is how FT works - When the primay VM of an FT pair os powered a shadow VM will be started on another host in the FT cluster - in your example you only have two hosts so the shadow VM will always be running on the second host - so this explains the behavior you are seeing - you need to start the primary host for the FT pair to function properly.-

SO the question is what are you trying to accomplish? If you are looking for a DR solution FT or HA will suffice - as they are both designed for typically a single failue -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

john_lauro · ‎03-22-2013

The main thing I am attempting to accomplish is contingency planning and to have recovery procedures planned out before an actual event.

Does HA make it any easier to power on a secondary node on a two node cluster?

If there is no easy work around, I can always do two VMs and heart beat between them and have the switch the service. More setup work, but better for autorecovery...

weinstein5 · ‎03-22-2013

I think HA is what you want - In the event of a host failure it will allow the VMs to restart on the remaining host or hosts -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

john_lauro · ‎03-22-2013

I'll test HA monday. FT works fine in the event of a single host failure. Problem is in the event of a two node cluster failure, if only one of the nodes comes back. FT is preferred due to lower down time in the more likely failure cases, but having a FT cluster where you can't easily bring up only one node after a power failure wouldn't be good.

weinstein5 · ‎03-22-2013

Remember if you want to protect against all nodes of the cluster failing you have moved from HA/FT to DR - in the case of a total cluster failure both HA and FT will require some type of human intervention to power on the hosts -

I am also going to move this thread to a more appropriate forum

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

kfarkas · ‎03-25-2013

I may be able to shed some additional light on this topic. But, first, to ensure we have a common understanding, I'll first mention that vSphere HA is a prerequisite for FT. HA is responsible for restarting the secondary VM of a FT VM pair after a single VM failure and both VMs of the FT pair after a dual failure. If two hosts fail at the same time and then one comes back up, HA should restart the primary VM on this host. Once the second host comes up, HA will then restart the secondary VM.

For HA to perform these actions, HA must be enabled on the cluster, the HA host monitoring feature must be enabled, and the cluster defult restart priority must not be "disabled". In addition, the VM must be reported as "HA protected" before the failure occurs. The HA protection state is reported on the summary pane of a VM. HA protection means that the HA agents have persisted on disk that they must restart the FT VM after a failure.

In your experiments, was HA configured in this way before you failed the two hosts? BTW, what vSphere HA version are you using?

Regarding manually restarting the FT VM after a failure if, if HA is configured as described, this should rarely be necessary because HA will attempt multiple times to restart a failed VM. But if these attempts should fail, to manually restart the FT VM, you need to manually power on the primary VM. In your experiments, after one host came up, did you try powering on both of the VMs of the FT pair or only the one registered to the host? Note that a FT role switch could have occurred just as the hosts went down and so the VM that was origianlly the primary VM could have become the secondary. You can, of course, register the 2nd VM of the pair using the command line.

john_lauro · ‎03-25-2013

HA was enabled and working.

Most of the HA settings are default, so not sure about the "HA host monitoring feature" and "cluster default restart priority". It was HA protected, but I think gets lost after one of the dual node goes down?

Ok, doing a check...

HA monitoring is enabled. Looks like Cluster-default VM-restart priority is set to medium.

Only attempted to power on the one registered to the one rebooted host. Was not able to attempt to power up the other as routing to vcenter is down in this failure scenario (and the host it was running on is also down, so only have access to the secondary host). Not sure how to register the 2nd VM of the pair using the command line.

Not sure how to tell what vSphere HA version I am using? I have 5.1 with the 3 patches for it.

kfarkas · ‎03-25-2013

Hmm. HA should have restarted the primary VM when you rebooted one of the hosts.

Before powering down the hosts, please make sure that the UI reports the FT VM as HA protected. HA reports a VM as protected after the master agent has saved that the VM needs to be restarted after a failure. This information is saved in a file on the datastore containing the VM's configuration file. We save the information in the file so it is not lost after a master fails or the entire cluster goes down.

In your test, after you power up one of the hosts, the HA agent on that host should restart, elect itself master, and should read this file. It then should attempt to restart both VMs in the pair (since we don't know which VM is the primary), and one power on should succeed.

If the VM is reported as HA protected and HA does not restart it after you powered up one host, please try the following:

- after you power on one of the hosts, wait for five minutes

- if the primary VM has not powered on at this point, bring up VC

- once VC is running, wait for VC to report the HA state of the host that is up as "master"

- then search the cluster event history for events with the phrase "vSphere HA"

- also check if the VM is powered on at this point

If HA was uanble to restart the VM for any reason, there should be one or more events reporting the restart failure. Please let us know what events you see, and whether the VM was restarted after VC was brought back up.

> Not sure how to register the 2nd VM of the pair using the command line.

See KB http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100616... for information on how to register a VM from the command line.

> Not sure how to tell what vSphere HA version I am using?

The HA version is the same as the VC version.

john_lauro · ‎03-25-2013

Did more testing. Switched to just testing HA.

Waited over 5 minutes after the host was fully up and some VMs autostarted.

With HA, the VM in question doesn't show in the list. That said, could add the host to inventory and start it, which is at least better than the situation with FT. VC and the hosts seem to figure out the duplicate ok after some warnings, but a couple of migrations back and forth clear up the alerts.

Tried a differnt test, and after waiting 5 minutes after one of the hosts was fully up... reproduced the same issue, this time I restored connectivity to virtual center. After that, then the host that was brought up erything it should. However, that means if virtual center is on the cluster, or as in my case behind a VPN that is routed through a VM on the cluster, things might not start back up automatically if the entire cluster goes down.

Should be rare to loose the entire cluster, and then have only one hosts come back. That said I'll setup the most critical network infrastructure VMs heart beat share resource between pairs of VMs on the two hosts. That way, routing back to virtual center should be redundant and each host can always start up a critical redundant VM. Plain HA or FT should suffice for most hosts.

admin · ‎03-26-2013

Hi John,

I work in GSS here at VMware and have been reading over your thread, there are a few things that could be going on here, including how you are reproducing the host failure.

The best way to get to the bottom of this and ensure your configuration will provide you with the best availabliity for your cluster, I would suggest that you open a support request with GSS and provide a log bundle capturing all of the steps and work with a TSE to investigate. Please refernece this community post in the ticket.

Thanks

All

Fault Tolerance