HA Isolation Response difficulties

Sven_Vollmann · ‎11-10-2009

<![endif]~~><!~~[if gte mso 9]>

Hello everybody,

we got a vSphere Cluster

with 2 Clusternodes.

The problem: We added a new

physical switch to our network, which causes a renewal of the spanning tree

algorithm. The consequence was, that HA on each host meant that the hosts were

isolated, so that all VM´s shut down.

The Question: Are

there any parameters which causes a shut-down delay for the vm´s, so that

theres enough time for HA to get the heartbeat again... which again wouldn´t

cause HA to shutdown the VM´s.

If not, we thought about a

alternative solution:

We thought about a second

Service Console (new vSwitch) on the Hosts, which is attached to a separate

physical switch, which is totally isolated from the main-network (view

attachment). Do you think that will work? Wouldn´t that may be the better solution to our problem?

thanks for answers

sincerely yours

Daniel S.

[

|file:///X:/GAMES/Risen-THEPiRATEGAY/tpg-rsn.r80]

NTurnbull · ‎11-10-2009

Hi, yes you can extend the delay, take a look at Duncan Epping's HA deepdive over at http://www.yellow-bricks.com/vmware-high-availability-deepdiv/ down near the bottom you'll find the advanced settings, but take the time to read the whole article so you know what your changing and the implications

Thanks,

Neil

Thanks, Neil

Sven_Vollmann · ‎11-10-2009

Hello,

reffering to this paragraph:

+das.failuredetectiontime – Amount of milliseconds, timeout time, for

isolation response action (with a default of 15000 milliseconds). It’s

a best practice to increase the value to 60000 when an active/standby

Service Console setup is used. For a host with two Service Consoles and

a secondary isolation address it’s a best practice to increase it to at

least 20000. I would recommend to always increase it to at least 30000+

Does that mean, that the VM´s are not shutdown until the the detection time (default 15000 ms) is over? Or does it mean that the other host declares the host as dead after this time?

Thanks,

Daniel

NTurnbull · ‎11-10-2009

The host is declared dead after this time by the other hosts. However, the host itself declares itself dead at 13 (using the default 15 second time out) using the last 2 seconds to initiate guest shutdowns. As per Duncan's gotcha:

"Isolation gotcha

I thought this issue was something that was common knowledge but a recent blog article by Mike Laverick proved me wrong.

The default value for isolation/failure detection is 15 seconds. In other words the failed or isolated host will be declared dead by the other hosts in the HA cluster on the fifteenth second and a restart will be initiated by one of the primary hosts.

For now let’s assume the isolation response is “power off”. The “power off”(isolation response) will be initiated by the isolated host 2 seconds before the das.failuredetectiontime. A “power off” will be initiated on the thirteenth second and a restart will be initiated on the fifteenth second.

Does this mean that you can end up with your VMs being down and HA not restarting them?

Yes, when the heartbeat returns between the 13th and 15th second the “power off” could already have been initiated. The restart however will not be initiated because the heartbeat indicates that the host is not isolated anymore."

Thanks,

Neil

Thanks, Neil

Sven_Vollmann · ‎11-10-2009

Ok, thanks Neil....

We´ll set the Isolationresponse time to 40 seconds.... this should solve out problem.

thanks and best regards

Daniel

All

HA Isolation Response difficulties