rickardnobel
Champion
Champion

Is Isolation Response always das.failuredetectiontime - 1?

Jump to solution

From Duncan Eppings "HA /DRS Technical Deepdive" I can see that (with default settings) the following will happen:

on 13 sec: a host which hears from none of the partners will ping the isolation address

on 14 sec: if no reply from isolation address it will trigger the isolation response

on 15 sec: the host will be declared dead from the remaining hosts, this will be confirmed by pinging the missing host

on 16 sec: restarts of the VMs will begin

My first question is: Do all these timings come from the das.failuredetectiontime? That is, if das.failuredetectiontime is set to e.g. 30000 (30 sec) then on the 28th second a potential isolated host will try to ping the isolation address and do the Isolation Response action at 29 second?

Or is the Isolation Response timings hardcoded and always happens at 13 sec?

My second question, if the answer is Yes on above, why is the recommendation to increase das.failuredetectiontime to 20000 if having multiple Isolation Response addresses? If the above is correct then this would make to potential isolated host to test its isolation addresses at 18th second and the restart of the VMs will begin at 21 second, but what would be the gain from this really?

My VMware blog: www.rickardnobel.se
0 Kudos
1 Solution

Accepted Solutions
depping
Leadership
Leadership

you are correct. I think at some point multiple statements were merged into a single statement which is technically inaccurate. sorry about all the confusion,

View solution in original post

0 Kudos
10 Replies
depping
Leadership
Leadership

Yes, the relationship between these timings is das.failuredetectiontime.

Increasing the das.failuredetectiontime is usually recommended when an additional das.isolationaddress is specified. the reason for this is that the "ping" and the "result of the ping" needs time and by added 5 seconds to the failure detection time you allow for this test to complete correctly. After which the isolation response could be triggered.

Duncan

ps: thanks for picking up the book Smiley Happy

0 Kudos
depping
Leadership
Leadership

Also see: http://www.yellow-bricks.com/2011/04/04/das-failuredetection-time-and-the-isolation-response/

I will also add my answer to a blog article to make sure people can find it on google Smiley Happy

-d

0 Kudos
admin
Immortal
Immortal

There are several KB articles touching on this property:

Adjusting the VMware High Availability failover timeout value                                      ...

9) In the Advanced Options (HA) dialog box:      a) In the option name  field, enter das.failuredetectiontime.      b) For the value, enter the new timeout value in milliseconds.            10) Click OK. 11) Click OK...                     
Published: 3/16/10  |                                                         Created Date: 5/23/07  |                                          Last Modified Date: 1/27/11  |                 

Setting Multiple Isolation Response Addresses for VMware High Availability                         ...

dialog, enter the option name and the         corresponding value:          · Option: das.failuredetectiontime · Value: a value in milliseconds that represents the timeout value             (20 seconds = 20000...                     
Published: 1/28/11  |                                                         Created Date: 8/12/08  |                                          Last Modified Date: 1/28/11  |                 

Advanced Configuration options for VMware High Availability                                        ...

possible.  These settings are available in *VirtualCenter 2.0.2* and above:  · das.failuredetectiontime = <value>      This option/value pair changes the default failure detection timeout,     where <value...                     
Published: 4/18/11  |                                                         Created Date: 11/4/08  |                                          Last Modified Date: 4/18/11  |

Determining if your VMware HA cluster experienced an Isolation Response                            ...

Settings.      2) Click VMware HA > Advanced Option.      3) Add the das.failuredetectiontime = <value> option/value pair to the         cluster’s settings, where <value> represents the desired timeout...                     
Published: 5/25/11  |                                                         Created Date: 11/21/07  |                                          Last Modified Date: 5/25/11  |

rickardnobel
Champion
Champion

Duncan wrote:

Yes, the relationship between these timings is das.failuredetectiontime.

Increasing the das.failuredetectiontime is usually recommended when an additional das.isolationaddress is specified. the reason for this is that the "ping" and the "result of the ping" needs time and by added 5 seconds to the failure detection time you allow for this test to complete correctly. After which the isolation response could be triggered.

Thanks a lot for the answer Duncan. However, there is still something mysterious here.. Smiley Happy

If the (isolated) host always does the Isolation Test at the das.failuredetectiontime - 2, and the Isolation Response at das.failuredetectiontime -1, and the timings is as you say related to the value of das.failuredetectiontime, what will actually change if setting this value to anything else?

If the Isolation Test is done at X - 2, and the Isolation Response at X -1 and the VM restart at X +1, what will be different if X is 15000 or 20000 or 30000? This does not seem to actually give the host any more time to ping multiple Isolation Addresses?

My VMware blog: www.rickardnobel.se
0 Kudos
depping
Leadership
Leadership

good question to which I don't have the answer. I have been told by engineering that it is -1. So if they don't use a percentage but a hard coded value that would indeed not make sense. I will ask them.

Duncan

0 Kudos
rickardnobel
Champion
Champion

Duncan wrote:

good question to which I don't have the answer. I have been told by engineering that it is -1. So if they don't use a percentage but a hard coded value that would indeed not make sense. I will ask them.

This has been unclear to me since the recommendations does not seems to add together. I guess either is the Isolation Test always at 13th second no matter the value of das.failuredetectiontime or else it will not matter with increased das.failuredetectiontime from the perspective of multiple isolation addresses, perhaps only delay the restart of the VMs if a host has actually crashed.

My VMware blog: www.rickardnobel.se
0 Kudos
depping
Leadership
Leadership

I discussed it and think some recommendations were mixed up. There used to be a recommendation where it was recommended to increase the failuredetectiontime for environments where they had issues with the network. it is most definitely -1 and +1 on the failuredetectiontime.

rickardnobel
Champion
Champion

Thanks for checking this, Duncan. So this does mean that it actually serves no purpose to increase the das.failuredetectiontime to get more time for multiple isolation addresses, correct?

My VMware blog: www.rickardnobel.se
0 Kudos
depping
Leadership
Leadership

you are correct. I think at some point multiple statements were merged into a single statement which is technically inaccurate. sorry about all the confusion,

View solution in original post

0 Kudos
rickardnobel
Champion
Champion

Just nice to be some part of clearing up issues/inconsistencies like this.

I wrote an article last night about this setting on rickardnobel.se

My VMware blog: www.rickardnobel.se
0 Kudos