VMware Cloud Community
ncolt
Contributor
Contributor
Jump to solution

HA settings for both host outage and site outage

I have always set my Host Isolation Response to "Leave powered on". This has served me well in the event of a PSOD of a host for instance.

But if something like a powercut was to occur, this would power on all the VMs at the same time, adversely affecting the storage, and HA might try to put all the VMs onto the first host that came up - which would then cause the VMs to run very slowly indeed.

Is there a way for me to allow for a complete power outage but also a single host outage?

Tags (2)
1 Solution

Accepted Solutions
daphnissov
Immortal
Immortal
Jump to solution

A couple points here

This has served me well in the event of a PSOD of a host for instance.

I'm not sure how a host isolation response of "leave powered on" would help or hurt you in the case of a PSOD. In the case a host crashes, this option does not apply to that host but only to a host which is isolated from the rest of the HA cluster.

But if something like a powercut was to occur, this would power on all the VMs at the same time, adversely affecting the storage, and HA might try to put all the VMs onto the first host that came up - which would then cause the VMs to run very slowly indeed.

This is not necessarily the case. HA will not attempt to power on VMs if there is not sufficient capacity for them. So it won't be a case of you having an 8-node cluster, all 8 hosts fail, one comes up, and all the VMs get piled on top of that one host. HA will simply fail to power on some of the VMs until capacity returns.

Is there a way for me to allow for a complete power outage but also a single host outage?

You'll have to define what you want to have happen in the case of a complete power outage. In a total site failure, what hardware is available to run anything? Will storage also be available? Or are you talking about failing over to a disparate site entirely?

View solution in original post

0 Kudos
3 Replies
depping
Leadership
Leadership
Jump to solution

so what you would like to do is in the case of a power outage, make sure HA spaces out the restart of VMs. In 6.5 you can set the restart priority of VMs. If you group VMs and then for instance say that the next batch should start when the first has "powered on" or a "guest heartbeat" has been detected then you would stage the restart of all VMs.

Still, you could end up in a scenario where all VMs are restarted on the same host if only 1 host comes up and the remaining hosts just take a long time,

Other than that there aren't too many other options I can think off

0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

A couple points here

This has served me well in the event of a PSOD of a host for instance.

I'm not sure how a host isolation response of "leave powered on" would help or hurt you in the case of a PSOD. In the case a host crashes, this option does not apply to that host but only to a host which is isolated from the rest of the HA cluster.

But if something like a powercut was to occur, this would power on all the VMs at the same time, adversely affecting the storage, and HA might try to put all the VMs onto the first host that came up - which would then cause the VMs to run very slowly indeed.

This is not necessarily the case. HA will not attempt to power on VMs if there is not sufficient capacity for them. So it won't be a case of you having an 8-node cluster, all 8 hosts fail, one comes up, and all the VMs get piled on top of that one host. HA will simply fail to power on some of the VMs until capacity returns.

Is there a way for me to allow for a complete power outage but also a single host outage?

You'll have to define what you want to have happen in the case of a complete power outage. In a total site failure, what hardware is available to run anything? Will storage also be available? Or are you talking about failing over to a disparate site entirely?

0 Kudos
ncolt
Contributor
Contributor
Jump to solution

Thanks for your responses. To be clearer we had a power outage of a datacentre. Every piece of hardware went down then came back a short time later at the same time. Most of our VMs are located in one blade chassis of 16 hosts. The vCenter and about a quarter of the other VMs came back on 1 host, even though the other hosts wouldn’t have started that much later. The VMs on host 1 were showing as up but were only capable of returning a PING request. On the other hosts, the other VMs were showing as “inaccessible” and greyed out. The fix in the end was removing the vCenter from inventory and adding it to another host. I need to improve the vSphere response in case it happens again.

You are right daphnissov, the host isolation response is not relevant here and HA should not attempt to power on VMs onto a host which does not have the capacity for them.

This feature seems to have malfunctioned a bit in that it overprovisioned the memory on 1 host by some large margin and I’ll log a call with VMware to see if they can see why this was from the logs.

Going forward I think the vCenter HA in 6.5 would have been of help in this case and setting the VM restart priority for each VM in Host Monitoring will be another thing to configure with this incident in mind. Is this related to the restart priority of VMs you mentioned Duncan?