dnetz
Hot Shot
Hot Shot

HA restarts powered off machines?

Hi,

We have a test cluster of three hosts (one host with ESX 4.1 U1, two hosts with ESXi 5.0 U1) and a vCenter 5.0 U1. HA and DRS is enabled.

This past sunday, in preparation for scheduled network tests, one administrator shut down all but two machines on this cluster, including the virtual vCenter server, the MSSQL server hosting the vCenter database and two domain controllers. On monday, the network tests were carried out, which led to hosts becoming isolated from eachother, which in turn triggered HA restarts on the two powered on machines, so far nothing out of the ordinary. But apparently HA also restarted the two domain controllers, the MSSQL server and the vCenter server even though they were in a powered off state. All other VM's in the cluster were left in their already powered off state.

Now to my question, under what circumstances does HA decide to restart a powered off VM? As far as I know, the only obvious difference between the HA-restarted VM's and the untouched VM's is that the restarted ones were not shut down via vSphere Client but rather through a Remote Desktop session. Is there perhaps a difference in how HA registers a VM's power state depending on how you shut it off?

Thanks in advance,

Daniel

Tags (3)
0 Kudos
6 Replies
depping
Leadership
Leadership

I tested it in my lab, but I cannot reproduce your problem. At least not with vSphere 5.1. When you shutdown a VM an entry is updated in the VMX file which mentions if the VM was cleanly shutdown or not. By default this is set to "FALSE", meaning that if the VM crashes or we can't write to disk this VM will be restarted/

I tried using the vCenter "shutdown button:

Before "vCenter Guest initiated shutdown" = cleanShutdown = "FALSE"

After  "vCenter Guest initiated shutdown" = cleanShutdown = "TRUE"

I tried it using the VM remote console:

Before "MKS initiated shutdown" = cleanShutdown = "FALSE"
After "MKS initiated shutdown" = cleanShutdown = "TRUE"

I tried it using RDP:
Before "RDP initiated shutdown" = cleanShutdown = "FALSE"
After "RDP initiated shutdown" = cleanShutdown = "TRUE"

Not sure what you encountered to be honest.

0 Kudos
dnetz
Hot Shot
Hot Shot

Hi Duncan,

Thanks for taking the time to test this out. I haven't seen any indication that cleanShutdown would be set to FALSE  on any powered off VM's, is this flag used by HA to determine the status of a VM?

In my attempts to understand the chain of events, I've gone through the host logs under /var/log/vmware/fdm, and there are lines stating "Writing power-on-list @ /vmfs/volumes/<volume-id>/.vSphere-HA/FDM-<id>-TU-VCENTER01/host-28-power-on" that suggests that HA keeps its own record of VM power states on file on the datastore. As far as I can see, the event logs on the hosts and on vCenter all state that the VM's were powered off before the network failures and that once datastores and vmnics were available again, HA restarted those four VM's.

The attached picture is the event view of one of the hosts and starts off with the powering off of the VM's, the various network outages and the restarting of the powered off VM's, it probably paints a better picture of what happened. I still don't understand how HA decided to restart powered off machines Smiley Happy We do have support from VMware, so perhaps I should create a case and send my system log bundles and let them find out what really happened.

0 Kudos
depping
Leadership
Leadership

Would be easiest indeed, and I suggest dropped the SR number here so I can have a look at it when I find the time.

HA does keep track of powered on VMs indeed in the poweron list, but this should have reflected reality at that point. Combined with the vmx entry there should not have been a restart

0 Kudos
dnetz
Hot Shot
Hot Shot

SR number is 12249799911. I'll try to update this thread when I've recieved an update on the case.

0 Kudos
dnetz
Hot Shot
Hot Shot

Update: the support case is now closed. VMware tried to replicate the exact problem but was unable to. When they read the logs carefully, it seems that the four VM's that where shut down from within the client OS, never actually registered as cleanly shutdown by HA, and therefor HA thought that they should be powered on during the host isolation events. The vmware.log file never specified "VMX has left the building".  It's unknown why this happened, if the VM's operating system hung on the way down or the VM process never cleanly exited.

I guess my lesson from all of this is that a) you should shut down VM's from the vSphere Client when possible, and b) that you should always disable HA or atleast host monitoring during any kind of network maintenance, even when VM's are being shut down as a precation.

0 Kudos
depping
Leadership
Leadership

Thanks for informing us!

0 Kudos