I have some network configurations to do and when I did start, it triggered a spanning tree re election. Lucky I was, it only took a few seconds so no outage on vsan, appart all ESXi on non prefered site went red with failures to access vsan datastore... But network issue was too short to trigger a failover, cool !
I need to go on with network job, so small network issues may happen again, and I wish to avoid a vsan failure. I can't migrate all on one site as the network issues may be between site 1 and site 2, but also site 1 and witness and / or site 2 and witness.
Hence having site 2 in maintenance, if network fails a few seconds between site 1 and witness, everything is lost.
I was wondering, which timeouts could be adjusted before HA triggers poweroff VMs and failover ?
I prefer a lot to have a few VMs that would crash because their storage is inaccessible a few seconds than a complete poweroff and failover of a full site. Because I know for sure (been there sadly) that some OS do accept disk outage with no problem while others do crash miserably on first disk issue.
But I am not sure that only HA should be tuned here, maybe some vsan parameters (special parameters for stretched clusters maybe) any ideas ?
In more general way, does a document that describes all the parameters related to HA / vsan failures (such as timeouts) exists somewhere ?
No you can't really make those changes for the time-outs. You could of course disable HA completely temporarily, but it would probably be best to just use maintenance mode and migrate the VMs between locations to avoid downtime, that is what makes most sense to me.
The problem is I can't be sure which links would be impacted during these network operations.
Could be between site 1 and witness site, site 1 and 2, site 2 and witness.
So if I have site 2 in maintenance, and link between site 1 and witness is gone, vsan shuts off.
Just to be sure to understand, only HA is responsible of powering off VMs in vsan stretched cluster ? vsan will not do it itself ?
Also, do you have knowledge of default timers ? Is 15 seconds the actual delay before HA power off a complete vsan site ? (regarding this : https://kb.vmware.com/s/article/1000510 )
Because last time I think the network outage was about 4 or 5 seconds, and did not cause any poweroff / restart
Thanks for the info 👍
Do you have knowledge of the default timers ? Specificaly the vsan one you mention that would start to power off everything when quorum is lost on a site ?
edit : I just read your deep dive clustering pdf, but I am not sure. It's written about 30sec delay before isolation response (HA) but I don't know about the internal vsan kill mecanism how much the delay is.
I understand I could achieve what I want by temporarly setting VSAN.AutoTerminateGhostVm to 0 and increasing das.config.fdm.isolationPolicyDelaySec but that would not be supported at all.
you need to keep in mind the following:
if you have a failure between Site A and Site B, with vSAN the "preferred site" will automatically take ownership of ALL components. Which means that if that is Site A, that Site B will lose access to ALL components, including the local components. if HA is active, then HA will restart all the VMs which are running in B also in Site A. Without "Autoterminateghost" enabled, vSAN would not kill the VMs in Site B. Which means you would have two identical instances of the same VM running at the same time. I would not recommend this! Let me repeat, I would not recommend this ever!
Yes, I have full knowledge of this.
Do you know what is the delay before vsan internal mecanics will kill VMs on isolated site ? Is it tunnable to be a little more permissive ?
Just for info, I did test 2 scenarios :
complete site network isolation
single ESXi network isolation
Both where about the same delay : VM powered off arround 60 sec and restart happens a few seconds after
I also tested single ESXi network isolation for 30 sec then resolve the network isolation, VMs running on this ESXi kept running and did not suffer a restart.
So for my initial question, having at least 30 seconds of delay before VM restarts is egnouth to prevent failovers if I get temporary network issues.