Solved: Re: Is Isolation Response always das.failuredetect...

rickardnobel · ‎05-26-2011

From Duncan Eppings "HA /DRS Technical Deepdive" I can see that (with default settings) the following will happen:

on 13 sec: a host which hears from none of the partners will ping the isolation address

on 14 sec: if no reply from isolation address it will trigger the isolation response

on 15 sec: the host will be declared dead from the remaining hosts, this will be confirmed by pinging the missing host

on 16 sec: restarts of the VMs will begin

My first question is: Do all these timings come from the das.failuredetectiontime? That is, if das.failuredetectiontime is set to e.g. 30000 (30 sec) then on the 28th second a potential isolated host will try to ping the isolation address and do the Isolation Response action at 29 second?

Or is the Isolation Response timings hardcoded and always happens at 13 sec?

My second question, if the answer is Yes on above, why is the recommendation to increase das.failuredetectiontime to 20000 if having multiple Isolation Response addresses? If the above is correct then this would make to potential isolated host to test its isolation addresses at 18th second and the restart of the VMs will begin at 21 second, but what would be the gain from this really?

My VMware blog: www.rickardnobel.se

depping · ‎06-01-2011

you are correct. I think at some point multiple statements were merged into a single statement which is technically inaccurate. sorry about all the confusion,

View solution in original post

depping · ‎05-27-2011

Yes, the relationship between these timings is das.failuredetectiontime.

Increasing the das.failuredetectiontime is usually recommended when an additional das.isolationaddress is specified. the reason for this is that the "ping" and the "result of the ping" needs time and by added 5 seconds to the failure detection time you allow for this test to complete correctly. After which the isolation response could be triggered.

Duncan

ps: thanks for picking up the book

depping · ‎05-27-2011

Also see: http://www.yellow-bricks.com/2011/04/04/das-failuredetection-time-and-the-isolation-response/

I will also add my answer to a blog article to make sure people can find it on google

-d

admin · ‎05-27-2011

There are several KB articles touching on this property:

Adjusting the VMware High Availability failover timeout value ...

9) In the Advanced Options (HA) dialog box: a) In the option name field, enter das.failuredetectiontime. b) For the value, enter the new timeout value in milliseconds. 10) Click OK. 11) Click OK...
Published: 3/16/10 | Created Date: 5/23/07 | Last Modified Date: 1/27/11 |

Setting Multiple Isolation Response Addresses for VMware High Availability ...

dialog, enter the option name and the corresponding value: · Option: das.failuredetectiontime · Value: a value in milliseconds that represents the timeout value (20 seconds = 20000...
Published: 1/28/11 | Created Date: 8/12/08 | Last Modified Date: 1/28/11 |

Advanced Configuration options for VMware High Availability ...

possible. These settings are available in *VirtualCenter 2.0.2* and above: · das.failuredetectiontime = <value> This option/value pair changes the default failure detection timeout, where <value...
Published: 4/18/11 | Created Date: 11/4/08 | Last Modified Date: 4/18/11 |

Determining if your VMware HA cluster experienced an Isolation Response ...

Settings. 2) Click VMware HA > Advanced Option. 3) Add the das.failuredetectiontime = <value> option/value pair to the cluster’s settings, where <value> represents the desired timeout...
Published: 5/25/11 | Created Date: 11/21/07 | Last Modified Date: 5/25/11 |

rickardnobel · ‎05-27-2011

Duncan wrote:
Yes, the relationship between these timings is das.failuredetectiontime.
Increasing the das.failuredetectiontime is usually recommended when an additional das.isolationaddress is specified. the reason for this is that the "ping" and the "result of the ping" needs time and by added 5 seconds to the failure detection time you allow for this test to complete correctly. After which the isolation response could be triggered.

Thanks a lot for the answer Duncan. However, there is still something mysterious here..

If the (isolated) host always does the Isolation Test at the das.failuredetectiontime - 2, and the Isolation Response at das.failuredetectiontime -1, and the timings is as you say related to the value of das.failuredetectiontime, what will actually change if setting this value to anything else?

If the Isolation Test is done at X - 2, and the Isolation Response at X -1 and the VM restart at X +1, what will be different if X is 15000 or 20000 or 30000? This does not seem to actually give the host any more time to ping multiple Isolation Addresses?

My VMware blog: www.rickardnobel.se

depping · ‎05-29-2011

good question to which I don't have the answer. I have been told by engineering that it is -1. So if they don't use a percentage but a hard coded value that would indeed not make sense. I will ask them.

Duncan

rickardnobel · ‎05-29-2011

Duncan wrote:
good question to which I don't have the answer. I have been told by engineering that it is -1. So if they don't use a percentage but a hard coded value that would indeed not make sense. I will ask them.

This has been unclear to me since the recommendations does not seems to add together. I guess either is the Isolation Test always at 13th second no matter the value of das.failuredetectiontime or else it will not matter with increased das.failuredetectiontime from the perspective of multiple isolation addresses, perhaps only delay the restart of the VMs if a host has actually crashed.

My VMware blog: www.rickardnobel.se

depping · ‎05-31-2011

I discussed it and think some recommendations were mixed up. There used to be a recommendation where it was recommended to increase the failuredetectiontime for environments where they had issues with the network. it is most definitely -1 and +1 on the failuredetectiontime.

rickardnobel · ‎05-31-2011

Thanks for checking this, Duncan. So this does mean that it actually serves no purpose to increase the das.failuredetectiontime to get more time for multiple isolation addresses, correct?

My VMware blog: www.rickardnobel.se

depping · ‎06-01-2011

you are correct. I think at some point multiple statements were merged into a single statement which is technically inaccurate. sorry about all the confusion,

rickardnobel · ‎06-01-2011

Just nice to be some part of clearing up issues/inconsistencies like this.

I wrote an article last night about this setting on rickardnobel.se

My VMware blog: www.rickardnobel.se

All

Is Isolation Response always das.failuredetectiontime - 1?

Adjusting the VMware High Availability failover timeout value ...

Setting Multiple Isolation Response Addresses for VMware High Availability ...

Advanced Configuration options for VMware High Availability ...

Determining if your VMware HA cluster experienced an Isolation Response ...