VMware Cloud Community
Useless1
Contributor
Contributor

Host Isolation = Powered Off - Strange HA process?

Hi,

We have a three node ESXi 3.5 cluster in which the host isolation response is set to "Shut Down"...

Each host has 4 NIC's...

Two are dedicated to Service Console and vMotion (One active and one standby for each using the opposite NIC's)..

Two are dedicated for the production network...

All of these NIC's are plugged into a large stackable switch...

On the weekend this switch device locked up...

All hosts went into isolation mode and as per the response setting, shut down all VM's...

However when the switch issue was resolved, none of the VM's powered up... until one of the network guys actually restarted a host...

Why does VMware do this? I am assuming that as part of the isolation, each host disabled its HA agent... and therefore when the network was restored no host had any HA agent to do any work?

Also what should the host isolation response be? I have left it at "Shut Down" but specifically set the vCenter VM to "Leave Powered On"...

0 Kudos
7 Replies
Troy_Clavell
Immortal
Immortal

per you configuration the environment acted as it should.  If isolation response is to shutdown and there are no hosts available, where can the guests be powered on to?  We prefer to have our isolation to "leave powered on".  For a number of reasons, but most importantly to us is false positive.  As long as your virtual machine port groups are not affected an isolation of the management network won't affect the ability for your guests to run.

Now in a true HA situation (failed host), your guests will be powered down and restarted on the remaining hosts in the cluster.

0 Kudos
Useless1
Contributor
Contributor

Hi,

Yes I agree that as each host was set to shutdown, and each host was isolated, so it shutdown all the VM's...

However... what about after the network event...

When the network came back up...

Why didnt the VMware hosts simply keep checking for the isolation addresses, realize they were available and begin to power back on all VM's again?

I am not sure of the factors that determine how to make the decision for the isolation response...

I am thinking I should set it to "Leave Powered On"... simply because I have a single switching device, and if this fails then all hosts are going to be isolated, therefore do not do anything?

0 Kudos
Troy_Clavell
Immortal
Immortal

HA will not restart the guest if an isolation event caused them to be powered down.  The only way HA will restart the guests is if there was an actual HA event.  So, your environment acted as it should.  Thus, another reason, in my opinion, to set the isolation response to leave powered on.

Useless1
Contributor
Contributor

Hi,

I am not sure if we agree on this point?

As a test I vMotioned all of my VM's off of one host and left one test VM on this host...

I then literally pulled out its Service Console NIC's...

The host detected it was isolated, and shutdown the VM...

Then the VM was restarted on another host...

This all happened in well under a minute...

I then re-plugged in the Service Console NIC's and the host reconfigured itself for HA and was fine..

0 Kudos
Troy_Clavell
Immortal
Immortal

what you saw was an HA event, not just an isolation response.  Isolation occurs prior to 15 seconds, after that and HA event kicks in and will restart the guests.

Useless1
Contributor
Contributor

Hi,

Thanks for the clarification! I understand this a little better now...

However it still leads me back to my question...

If host isolation leads to a HA event, and there are no hosts currently available (As in my situation)...

Is that it? Or does it keep trying?

I am trying to understand what in reality actually fixed the issue, was it the fact that one of the network admin guys restarted one of the physical hosts, and this hosts HA agent was renabled at reboot, which started some VM's and allowed the others to come back online...

Or was it that vCenter (Which seems to have got restarted as part of the box reboot), is the part which is responsible for bringing it all back?

During host isolation, does the HA agent on the isolated server get disabled? (And therefore would require connectivity back to vCenter again later to re-configure it again?)

0 Kudos
Troy_Clavell
Immortal
Immortal

However it still leads me back to my question...

If host isolation leads to a HA event, and there are no hosts currently available (As in my situation)...

Is that it? Or does it keep trying?

If there are no hosts available, then HA is smart enough to know that and therefore do nothing.  With that said, your isolation response will still come into affect.

I am trying to understand what in reality actually fixed the issue, was  it the fact that one of the network admin guys restarted one of the  physical hosts, and this hosts HA agent was renabled at reboot, which  started some VM's and allowed the others to come back online...

Are you sure HA actually restarted guests on the host that came back online?  If there were no hosts available in the cluster, the guests would have remained on the hosts in a powered off state.  Once the host was restarted some sort of manual intervention would have ben required to start the guests.  Can you check to logs to see if the admi that restarted the host, didn't actually restarts the guests too?  vCenter is not needed for HA, except for the initial configuration.

0 Kudos