><![if gte mso 9]>
we got a vSphere Cluster
with 2 Clusternodes.
The problem: We added a new
physical switch to our network, which causes a renewal of the spanning tree
algorithm. The consequence was, that HA on each host meant that the hosts were
isolated, so that all VM´s shut down.
The Question: Are
there any parameters which causes a shut-down delay for the vm´s, so that
theres enough time for HA to get the heartbeat again... which again wouldn´t
cause HA to shutdown the VM´s.
If not, we thought about a
We thought about a second
Service Console (new vSwitch) on the Hosts, which is attached to a separate
physical switch, which is totally isolated from the main-network (view
attachment). Do you think that will work? Wouldn´t that may be the better solution to our problem?
thanks for answers
Hi, yes you can extend the delay, take a look at Duncan Epping's HA deepdive over at http://www.yellow-bricks.com/vmware-high-availability-deepdiv/ down near the bottom you'll find the advanced settings, but take the time to read the whole article so you know what your changing and the implications
reffering to this paragraph:
+das.failuredetectiontime – Amount of milliseconds, timeout time, for
isolation response action (with a default of 15000 milliseconds). It’s
a best practice to increase the value to 60000 when an active/standby
Service Console setup is used. For a host with two Service Consoles and
a secondary isolation address it’s a best practice to increase it to at
least 20000. I would recommend to always increase it to at least 30000+
Does that mean, that the VM´s are not shutdown until the the detection time (default 15000 ms) is over? Or does it mean that the other host declares the host as dead after this time?
The host is declared dead after this time by the other hosts. However, the host itself declares itself dead at 13 (using the default 15 second time out) using the last 2 seconds to initiate guest shutdowns. As per Duncan's gotcha:
I thought this issue was something that was common knowledge but a recent blog article by Mike Laverick proved me wrong.
The default value for isolation/failure detection is 15 seconds. In other words the failed or isolated host will be declared dead by the other hosts in the HA cluster on the fifteenth second and a restart will be initiated by one of the primary hosts.
For now let’s assume the isolation response is “power off”. The “power off”(isolation response) will be initiated by the isolated host 2 seconds before the das.failuredetectiontime. A “power off” will be initiated on the thirteenth second and a restart will be initiated on the fifteenth second.
Does this mean that you can end up with your VMs being down and HA not restarting them?
Yes, when the heartbeat returns between the 13th and 15th second the “power off” could already have been initiated. The restart however will not be initiated because the heartbeat indicates that the host is not isolated anymore."