VMware Cloud Community
cabraun
Enthusiast
Enthusiast

Restarting datacenter after major outage

Last week (after everyone had left for the day of course) our primary datacenter experienced a power failure.

Here is a bit of info about our setup:

6 ESXi 4.1 hosts in production cluster with fiber channel storage to HP EVA6000

2 ESXi 4.1 hosts in test cluster with fiber channel storage to HP EVA3000

1 vCenter Server managing both clusters.  vCenter is a VM

vCenter DB is on a separate physical SQL server and that database is located on the EVA6000 storage as well

Enterprise licensing for all so HA is in full effect here.

So from the time we got notice that the power had failed in the building (No generator) it took 15-20 minutes for someone to arrive on location to start shutting things down as quickly as possible.  Unfortunately time ran out though and eventually the limited UPS power ran out as well and most systems in the datacenter went down HARD.  Up to that point, no VMs or ESX hosts were able to be shutdown, nor the SAN.  So in the end, the entire VM Infrastructure (VMs, Hosts, Storage, Network) all died simultaneously.

After several hours we were able to get a generator delivered and start bringing some systems back online.  At 1 point there was a miscommunication though that resulted in the hard shutdown of some systems a 2nd time, including the EVA6000 but no ESX hosts were running yet at that point.  Shortly after that SNAFU though, datacenter power was restored and now came the job of getting things running again.

We got the EVA6000 running and powered up our hosts and something I was not expecting happened.

After the first host was running, I connected directly to it with a vSphere Client and saw a list of VMs all reflecting as Powered Off.  Good, I thought to myself.  I connected to my 2nd host and all the VMs were listed as "Unknown".  Ohhhhh that's not good, I thought.

I tried to power on a critical VM on host 1, but I got a error that there was a problem with the vmx file.  So I tried a couple of others but I got the same thing.  I immediately thought corruption on the EVA6000 as a result of 2 very abrupt losses of power.  The other thing I noticed was I could not add VMs to inventory on my host.  The option was greyed out.  So I opened a critical support call to VMware thinking I was in for a very long day.  But as we were looking into things directly on my hosts I got word that some VMs were actually running.

How can this be, I am looking at my VMs here on my host directly and it clearly shows that they are all powered off?

I then tried to open a remote desktop session to my vCenter VM, and there it was!!!!  I was able to logon and although I had to restart the vCenter Services once I did that, I found that all the VMs in my production cluster were running.  None of my VMs in my test cluster were running and nothing showed up as running when logged directly into any host server.  Of course this was in great part due to the fact that my EVA3000 where all the test VMs live was still shutdown.

After a while and a couple of reboots of a few hosts in the prod cluster everything started reporting fine and things looked good from both vCenter and when logged directly onto a host with a vSphere Client.

Then we powered on the EVA3000 for the test cluster and brought those 2 hosts online and they actually behaved as I expected they would, where no VMs automatically powered on and I was allowed to start things in the order I wanted.

I assume that the reason I was not able to power on VMs directly from my production hosts or add VMs to inventory on my production host is because in reality they were currently running, even though the hosts showed them as powered off.

But my real question is that my understanding of HA is that it can tolerate up to 4 simultaneous host failures since there are 5 "Primary nodes" in an HA cluster, but in our case, we lost all 8 of our hosts simultaneously in both clusters (6 hosts in prod and 2 hosts in test) meaning we lost all "Primary HA" nodes in both clusters  Therefore my expectation was that when we got our hosts powered on that we would have to manually power on all VMs.  This was the behaviour I got in my 2 node test cluster but not in my 6 node production cluster.

Any thoughts on why?

0 Kudos
7 Replies
Techstarts
Expert
Expert

But my real question is that my understanding of HA is that it can tolerate up to 4 simultaneous host failures since there are 5 "Primary nodes" in an HA cluster, but in our case, we lost all 8 of our hosts simultaneously in both clusters (6 hosts in prod and 2 hosts in test) meaning we lost all "Primary HA" nodes in both clusters  Therefore my expectation was that when we got our hosts powered on that we would have to manually power on all VMs.  This was the behaviour I got in my 2 node test cluster but not in my 6 node production cluster.

Any thoughts on why?

Check your HA isolation response

With Great Regards,
0 Kudos
cabraun
Enthusiast
Enthusiast

Thanks for the reply.

Host Isolation Response is to "Shut Down" VMs.

Would Isolation Response even matter here though since all hosts went down at the same moment, along with all network equipment?  I don't think Isolation Response would come into play at all, would it?

0 Kudos
weinstein5
Immortal
Immortal

It is not an HA event that would have started the VMs but the VM autostart feature of your production cluster which probably is configured to start prioduction VMs - while the Test cluster they are not set that way -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
Walfordr
Expert
Expert

But my real question is that my understanding of HA is that it can tolerate up to 4 simultaneous host failures since there are 5 "Primary nodes" in an HA cluster, but in our case, we lost all 8 of our hosts simultaneously in both clusters (6 hosts in prod and 2 hosts in test) meaning we lost all "Primary HA" nodes in both clusters  Therefore my expectation was that when we got our hosts powered on that we would have to manually power on all VMs.  This was the behaviour I got in my 2 node test cluster but not in my 6 node production cluster.

Any thoughts on why?

Check your VM restart priority settings on the cluster.  Best practice to set vCenter to the Highest priority.  Also check the VM Startup/Shutdown settings under the hosts configuration.  The VMs can be set to automatically startup when the hosts are powered on.

Robert -- BSIT, VCP3/VCP4, A+, MCP (Wow I haven't updated my profile since 4.1 days) -- Please consider awarding points for "helpful" and/or "correct" answers.
0 Kudos
cabraun
Enthusiast
Enthusiast

The cluster setting for "VM Restart Priority" is set to Medium for both production and test.

In prod, I have a small number of VMs configured as High priority (Domain Controllers, DNS & DHCP servers, 2 SQL DB servers and vCenter) All other VMs are set to use the cluster default setting, which is "Medium". In test, everything is set to use the cluster default of Medium.

So maybe my question should be, why didn't the VMs in the test cluster auto start instead of why did the VMs in the prod cluster auto start?

My individual hosts are all configured to have all VMs startup "Manually" -- (Host > Configuration Tab > Virtual Machine Startup/Shutdown) Nothing is set to auto start in either test or prod.

0 Kudos
weinstein5
Immortal
Immortal

This is not a feature of the HA Cluster but a setting for the VM that is configured throught the host configuration tab - checl out http://www.vmware.com/pdf/vsphere4/r41/vsp_41_dc_admin_guide.pdf page 243

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
cabraun
Enthusiast
Enthusiast

I understand that Host>Configuration Tab>VM Startup/Shutdown has no bearing on HA.  It also has no bearing on why my VMs started automatically, since in that area all my VMs are set to Manual Startup.  But as has been established this seems to have nothing to do with things in this case other than to say that it was not this setting that auto-started my VMs.

When I view my cluster HA settings, the default cluster setting for HA is "Medium" for "Restart Priority" and the default cluster setting for "Host Isolation" is "Shutdown".  All my VMs are set to use the default cluster setting for "Host Isolation".  As for "Restart Priority" I have a small number of VMs configured as "High" (Including vCenter) but the other 98% are set to use the default cluster setting of "Medium".

But don't these settings only come into play when HA can be used, right?  Or am I wrong?

Since I lost every host simutaneously, which of course would include all 5 "Primary HA Nodes", when the datacenter power was restored and I got my hosts turned on again could HA still start all the VMs?  I mean obviously it did, but I don't understand why unless the hosts retain their "HA Agent Config" and remember that they were primary nodes before everything was shutoff, without being put into Maint Mode or anything.

And if they do retain their HA information when the plug is pulled on everything at the same time why, when vCenter was determined to be up and running again, did I have to reconfigure all my hosts for HA?  Obviously that information was lost, as it should have been.

In the end it is not a big deal that things happened the way that they did.  I am just trying to understand why things happened the way that they did.  My thought is that without any HA primary nodes, none of my VMs should have autostarted when power was restored.  That was the behaviour in my test cluster, but not in my prod cluster and the settings for HA are identical between the 2.

0 Kudos