VMware Cloud Community
vmproteau
Enthusiast
Enthusiast
Jump to solution

Host Isolation Response

I currently set Host Isolation Response set to "Power off VM" on all Hosts. In my mind, it just doesn't seem wise to leave it powered on if there is a Host failure. Am I being paranoid?

What are arguments for/against setting this to "Shut Down VM" instead?

What are arguments for/against setting this to "Leave VM powered On" instead?

I've been toying with changing these because I've had HA failovers triggered for merely Service Console network problems but, I may be missing galing reasons not to.

0 Kudos
1 Solution

Accepted Solutions
jmcdonald1
VMware Employee
VMware Employee
Jump to solution

1 - It will continue to try to power them on until the host joins back to the cluster. Thus if you lose network connectivity to the server and the VM's are still powered on, and you subsequently have a host failure the VM's will be powered up on the other servers the next time it tries to power the VM's on.

2 - From my experience there is no downside to it. The file locks are there for this reason so that we do not have multiple items trying to write to a file at the same time and corrupt the file.

3 - Yes this from my understanding of HA is the expected behavior in this configuration.

Cheers,

/Jon

View solution in original post

0 Kudos
8 Replies
weinstein5
Immortal
Immortal
Jump to solution

The thing with isolation response is that there is not a host failure but a network failure - a host has lost network comunication - this is determine by not being able to get to the service console gateway - the remaining nodes of the host assume a node failure and begin to try to start the VMs on the isolated host - and are not able to because the VMDK is locked open by the vm running on the isolated host - the isolation will determine what these isolated VMs will do:

  • Power Off - allowing the HA Cluster to restrat the VMs on the remaining nodes causing a brief outage

  • Remain Powered On - users will conrinue accessing the VMs with no outage - the cluster will still try to start but will receive an error

So the question is in an isolation resopnse do you want any outage - from what you describe I would leave powered on -

If you find this or any other answer help please consider awarding point by marking the answer correct or helpful

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
jmcdonald1
VMware Employee
VMware Employee
Jump to solution

It is honestly up to you. The isolation response is only acted upon if you have a network failure occur. By default it is after 15 seconds of not able to ping the other nodes or the isolation address which one would be initiated.

The reason we have this option is so that we provide the flexability to be able to account for an ESX Server that may not have crashed but for some reason it lost network connectivity. A common problem that we see is that a network admin unplugs a cable (or other such maintinence), where the ESX Server loses network connectivity and thus all the VM's are not able to communicate. Effectively from a user perspective this is "down" even though the VM's may not be actually down. The problem with this is that the server did not CRASH and therefore the VM's would still have a lock on their disks and would fail to start on the remaining servers in the cluster.

The isolation response addresses this problem and says what you want to happen if you lose network connectivity.

- By powering off the VM's the remaining nodes of the cluster will be able to power them up on hosts that have not failed.

- By leaving the VM's on, you disable the failover from taking place if there is a network connectivity problem. If the host actually fails, ie power is lost to the system, all VM's will be powered off already and therefore they will be able to be started up on the remaining nodes of the cluster.

From my experience, the following are the options that you have:

  • - Redundant physical switches for the Service Console network connection. This can be accomplished by having a bond of two different NICS going to two different physical switches, thus if one fails, the other one will still be able to talk to the isolation address (by default the Default Gateway for the system)

  • - Add a second isolation address. More than one isolation response address can be specified to provide additional redundancy. To do this select a cluster > VMware HA > Advanced Options , and add the das.isolationaddress2 = <value2> option/value pair to the cluster's settings where <value2> represents the secondary IP addresses to use. There can be a maximum of 10 different isolation addresses per cluster.

    Note: This is a feature of VirtualCenter 2.0.2 and above, and the default timeout value should also be increased to 20 seconds (20000 ms) or greater when a secondary isolation address has been specified

  • - Change the default timeout from 15 seconds so that there is a greater amount of time before a failure occurs. In general 60 seconds (60000 ms) is an alternative commonly used. To do this select a cluster->VMware HA->Advanced Options and add the das.failuredetectiontime = <value> option/value pair to the cluster's settings where <value> represents the desired timeout value in milliseconds. VMware HA will not declare a host failure nor initiate an isolation detection response until the timeout value specified has been exceeded without heartbeats received.

    Note: This is a feature of VirtualCenter 2.0.2 and above.

  • - Turn the isolation response to leave powered on, and the VM will not be powered off in the case of an Isolation Response.

Cheers,

/Jon

0 Kudos
vmproteau
Enthusiast
Enthusiast
Jump to solution

OK this has always confused me but, I think I'm getting there. Check my logic:.

It's seems the "Power Down" Isolation response is most useful in environments where the Service Console and VM Networks might share the same vSwitch? Or maybe an environment that doesn't have redundant physical switches where a down switch means all network interfaces are down. Basically and environment where Service Console network failure typically indicates a VM network failure as well.

In our environment, wWe have 2-Pnics per vSwitch and a seperate vSwitch for SC, VMK, and VM network. A loss of Service Console connectivity will rarely mean loss of VM Network connectivity in our environment so, setting Isolaton response to "Leave Powered On" makes the most sense. Is this right?

0 Kudos
jmcdonald1
VMware Employee
VMware Employee
Jump to solution

Your logic is sound. If the service console and vm networks are on different physical nics than it is definitely possible that you can have one go down and not the other. If you forsee problems with the connectivity of the service console NIC, than I agree, so that you avoid downtime on the VM's becasue of a false alarm. In this configuration you will still have proper failover in the case that an entire host goes down because that disk lock will be released, and the VM will be powered on on the other nodes of the cluser.

Also as an FYI, VMware recognized this as a problem because of all the different types of configurations that you may have in a cluster. Thus Virtual Machine HA has emerged and as of VirtualCenter 2.5 update 2 full support for monitoring individual virtual machine failures based on VMware tools heartbeats has been added to HA. We want to provide as much flexability as possible to the administrator depending on the configuration of their environment.

Cheers,

/Jon

vmproteau
Enthusiast
Enthusiast
Jump to solution

Great. I just couldn't get the concept through my head. Last couple questions:

  1. When the other Hosts try and fail (because the file is locked) to bring up the VM on themselves, will they continue to try over and over again until the isolated Host is visible or will it only try once?

  2. This behavior appears to be by design but, are there any negatives to other other Hosts constantly trying to access the VMDK or is it just a constant bnign check?

  3. Along those same lines, I assume the other Hosts are in fact constantly trying to bring up the VMDK since you said a transition from network isolation to complete failure will trigger an actual HA fail over.

0 Kudos
jmcdonald1
VMware Employee
VMware Employee
Jump to solution

1 - It will continue to try to power them on until the host joins back to the cluster. Thus if you lose network connectivity to the server and the VM's are still powered on, and you subsequently have a host failure the VM's will be powered up on the other servers the next time it tries to power the VM's on.

2 - From my experience there is no downside to it. The file locks are there for this reason so that we do not have multiple items trying to write to a file at the same time and corrupt the file.

3 - Yes this from my understanding of HA is the expected behavior in this configuration.

Cheers,

/Jon

0 Kudos
vmproteau
Enthusiast
Enthusiast
Jump to solution

I think I finally understand it!!!! Thanks for the great explanation.

I think what originally confused me was that the default setting was "Power Off VM". Our environement is conducive to default standards. Interestingly, I think new Hosts builds now have "Leave VM powered on" as the standard.

0 Kudos
jmcdonald1
VMware Employee
VMware Employee
Jump to solution

We actually changed it in U2 and beyond (if i remember correctly...;)...) to 'leave powered on' as a default setting. We also added a setting to shutdown the VM gracefully in this release.

keep looking out for new additions, we are always trying to improve our products!

Cheers,

/Jon

0 Kudos