VMware Cloud Community
Cougar281
Enthusiast
Enthusiast

Cluster and VCSA restart after power loss

Let me start by saying I know how to have guests automatically start on a single host, and everything I've read so far is related to that - go on the host, go the startup/shutdown and adjust the settings accordingly.

The problem is you can't really do that with a HA cluster. In some cases you won't know where the machines might be, so setting that on an individual host is impossible.

So where a HA cluster is concerned, how would one go about setting that up to automatically recover from a power loss? I recently had a power failure at work and when power was restored, the hosts came up just fine, as did all other physical devices, but the guests did not, because it's a HA cluster controlled by a VCSA appliance (This is the main drawback to vCenter as a VM, IMO). In addition to that, I have found that APC has a network shutdown appliance that can interface with the UPS and vCenter and gracefully shut down or suspend VMs in the event of a power failure, which would further complicate things.

Maybe it's a feature I've missed or overlooked, but something that I think would be a GREAT feature, especially with the VCSA appliance being (IMO) quite good these days, is the ability to set auto start options for VMs, pretty much the same as you can do on the individual hosts, but in vCenter, not (or in lieu of) on the hosts, and some way for the VCSA appliance to have auto start settings in a HA environment, where it could be on any host at any given time. Yes, you could set rules so the VCSA appliance is only allowed to run on a specific host, but that's not without its own issues. This way, it doesn't matter where the VCSA appliance has migrated to, after a power failure it powers back on, and when it's all back up and operational, it proceeds to power on the machines that should be powered on. In the case of an automated process such as APC's network shutdown shutting down or suspending machines, a simple 'if it was powered on power it back on' wouldn't do.

So does anyone have any thoughts? Is it something I've missed, or maybe VMWare could look into developing?

0 Kudos
8 Replies
daphnissov
Immortal
Immortal

So firstly, the fact that vCenter is on a VM has absolutely no bearing on how HA is configured or operates. In fact, HA is in no way reliant upon vCenter being available to conduct failovers. vCenter can be powered off entirely and an ESXi host can still fail over VMs to surviving hosts. If you have HA that's configured on such a cluster and a host abruptly goes down but its VMs aren't restarted on other hosts, then there's a potential configuration issue with those VMs. One thing I routinely see is the VM has some host-local hardware attachment that makes powering up elsewhere impossible.

0 Kudos
StephenMoll
Expert
Expert

We have managed to do something very similar to what you describe. We did have the advantage that the VMs we needed to start automatically on the cluster were in failover pairs only, and we kept them all to two hosts.

Its not trivial though, as it has to be done by a shell script on each of the two hosts (/etc/rc.local.d/local.sh).

In brief:

  1. The script examines the /vmfs/volumes directory to make sure the required datastores are there. If the host has booted too quickly, and the SAN is not ready, the host reboots itself to give the SAN more time.
  2. The script then examines its own host to see what VMs are registered there.
  3. It then examines the VMX files of those VMs looking for a custom advanced value we place there.
  4. If it finds the custom value, the VM is powered on.
0 Kudos
Cougar281
Enthusiast
Enthusiast

Yes, in the event of a single host failure, you are correct - HA will bring the machines that were up on the failed host back up on a surviving host. But that's not what I'm asking about. In the event of a total failure of the entire cluster due to a power failure, nothing but the hosts comes back. No VMs power back on automatically. To the best of my knowledge, you can't set the auto start options on the individual host when the host is part of a vCenter HA cluster and have it (reliably) work due to the machine moving around. As far as I know, if set to auto start on a host, that setting doesn't follow the machine if/when it migrates over to another host, and the setting might even be removed from the host when the VM is migrated elsewhere.

0 Kudos
Cougar281
Enthusiast
Enthusiast

That sounds quite interesting, and like something that VMWare should think about implementing so that it's less trivial to implement.

0 Kudos
daphnissov
Immortal
Immortal

There may be something going on in your case because this is exactly what happens in my lab. I have a 3-node vSAN cluster (also using NFS storage) and have had several extended power outages that depleted my UPS. All hosts shutdown at the same time. When power was restored and they powered up, HA began to power up VMs even before vCenter was available.

0 Kudos
Cougar281
Enthusiast
Enthusiast

Interesting - what license level are you running? At my office where this has been an issue on the rare occasion power fails, we're running Essentials Plus - perhaps that has something to do with it? You'd expect it to recover to previous states after a total outage, but that's not what I've seen with our environment. With our last outage, everything except the VMs was up and running happily. Although I can't rule out the possibility of there being a setting somewhere that I missed - but all the same, more control like you have when you're talking about a single, standalone host, would be nice - it's not uncommon to want machines to start up in a specific order, rather than a massive boot storm that has machines essentially racing to see who can start first - In my case, I'd want vCenter to be up and operational first, and from there, domain controllers, then file servers, followed by SQL servers, and then anyone else.

0 Kudos
StephenMoll
Expert
Expert

That is interesting. That detail isn't mentioned in the "vSphere 6.7 Clustering Deep Dive". The whole of chapter 4 has been written on the basis that at least some hosts have survived the failure event.

So you're suggesting that the hosts' HA agents are 'remembering' the restart try count for all VMs in its recovery list, during a power cycle. Assuming that is true, that has some serious implications, particularly for us, because this would imply that during a period of shutdown, none of the retry attempts would be be counting down. In normal circumstances with a 5 retry maximum, attempts to restart a VM cease after 30 minutes. However if a system is knocked out severely and it is hours (or days) before a clean restart is attempted, then this clean controlled system start might be hampered by HA suddenly attempting to restart VMs when power is restored to hosts.

I'm sorry but I find this hard to believe and there must be some other factor at work on your system surely?

0 Kudos
StephenMoll
Expert
Expert

We have a strange use-case. I think it is worth bearing in mind that vSphere is really designed around high availability of data centres that need to be running 24/7. Our systems however need to be powered off and on possibly on a daily basis. To do this whilst retaining a certain level of turn-key operation was quite a challenge where we needed VMs to restart automatically, but also needed the hosts to be clustered.

Putting some script into local.sh was the method we discovered and developed first. If there was a better way, believe me it would be preferable. Extending the capabilities of the HA agents on the hosts to restart VMs on host power-up if the VM is found running elsewhere in the cluster would be a very useful technique for systems that have a regular power cycle requirement.

0 Kudos