From Duncan Eppings "HA /DRS Technical Deepdive" I can see that (with default settings) the following will happen:
on 13 sec: a host which hears from none of the partners will ping the isolation address
on 14 sec: if no reply from isolation address it will trigger the isolation response
on 15 sec: the host will be declared dead from the remaining hosts, this will be confirmed by pinging the missing host
on 16 sec: restarts of the VMs will begin
My first question is: Do all these timings come from the das.failuredetectiontime? That is, if das.failuredetectiontime is set to e.g. 30000 (30 sec) then on the 28th second a potential isolated host will try to ping the isolation address and do the Isolation Response action at 29 second?
Or is the Isolation Response timings hardcoded and always happens at 13 sec?
My second question, if the answer is Yes on above, why is the recommendation to increase das.failuredetectiontime to 20000 if having multiple Isolation Response addresses? If the above is correct then this would make to potential isolated host to test its isolation addresses at 18th second and the restart of the VMs will begin at 21 second, but what would be the gain from this really?
you are correct. I think at some point multiple statements were merged into a single statement which is technically inaccurate. sorry about all the confusion,
Yes, the relationship between these timings is das.failuredetectiontime.
Increasing the das.failuredetectiontime is usually recommended when an additional das.isolationaddress is specified. the reason for this is that the "ping" and the "result of the ping" needs time and by added 5 seconds to the failure detection time you allow for this test to complete correctly. After which the isolation response could be triggered.
Duncan
ps: thanks for picking up the book
Also see: http://www.yellow-bricks.com/2011/04/04/das-failuredetection-time-and-the-isolation-response/
I will also add my answer to a blog article to make sure people can find it on google
-d
There are several KB articles touching on this property:
9) In the Advanced Options (HA) dialog box: a) In the option name field, enter das.failuredetectiontime. b) For the value, enter the new timeout value in milliseconds. 10) Click OK. 11) Click OK...
Published: 3/16/10 | Created Date: 5/23/07 | Last Modified Date: 1/27/11 |
dialog, enter the option name and the corresponding value: · Option: das.failuredetectiontime · Value: a value in milliseconds that represents the timeout value (20 seconds = 20000...
Published: 1/28/11 | Created Date: 8/12/08 | Last Modified Date: 1/28/11 |
possible. These settings are available in *VirtualCenter 2.0.2* and above: · das.failuredetectiontime = <value> This option/value pair changes the default failure detection timeout, where <value...
Published: 4/18/11 | Created Date: 11/4/08 | Last Modified Date: 4/18/11 |
Settings. 2) Click VMware HA > Advanced Option. 3) Add the das.failuredetectiontime = <value> option/value pair to the cluster’s settings, where <value> represents the desired timeout...
Published: 5/25/11 | Created Date: 11/21/07 | Last Modified Date: 5/25/11 |
Duncan wrote:
Yes, the relationship between these timings is das.failuredetectiontime.
Increasing the das.failuredetectiontime is usually recommended when an additional das.isolationaddress is specified. the reason for this is that the "ping" and the "result of the ping" needs time and by added 5 seconds to the failure detection time you allow for this test to complete correctly. After which the isolation response could be triggered.
Thanks a lot for the answer Duncan. However, there is still something mysterious here..
If the (isolated) host always does the Isolation Test at the das.failuredetectiontime - 2, and the Isolation Response at das.failuredetectiontime -1, and the timings is as you say related to the value of das.failuredetectiontime, what will actually change if setting this value to anything else?
If the Isolation Test is done at X - 2, and the Isolation Response at X -1 and the VM restart at X +1, what will be different if X is 15000 or 20000 or 30000? This does not seem to actually give the host any more time to ping multiple Isolation Addresses?
good question to which I don't have the answer. I have been told by engineering that it is -1. So if they don't use a percentage but a hard coded value that would indeed not make sense. I will ask them.
Duncan
Duncan wrote:
good question to which I don't have the answer. I have been told by engineering that it is -1. So if they don't use a percentage but a hard coded value that would indeed not make sense. I will ask them.
This has been unclear to me since the recommendations does not seems to add together. I guess either is the Isolation Test always at 13th second no matter the value of das.failuredetectiontime or else it will not matter with increased das.failuredetectiontime from the perspective of multiple isolation addresses, perhaps only delay the restart of the VMs if a host has actually crashed.
I discussed it and think some recommendations were mixed up. There used to be a recommendation where it was recommended to increase the failuredetectiontime for environments where they had issues with the network. it is most definitely -1 and +1 on the failuredetectiontime.
Thanks for checking this, Duncan. So this does mean that it actually serves no purpose to increase the das.failuredetectiontime to get more time for multiple isolation addresses, correct?
you are correct. I think at some point multiple statements were merged into a single statement which is technically inaccurate. sorry about all the confusion,
Just nice to be some part of clearing up issues/inconsistencies like this.
I wrote an article last night about this setting on rickardnobel.se