Hello there,
I have some network configurations to do and when I did start, it triggered a spanning tree re election. Lucky I was, it only took a few seconds so no outage on vsan, appart all ESXi on non prefered site went red with failures to access vsan datastore... But network issue was too short to trigger a failover, cool !
I need to go on with network job, so small network issues may happen again, and I wish to avoid a vsan failure. I can't migrate all on one site as the network issues may be between site 1 and site 2, but also site 1 and witness and / or site 2 and witness.
Hence having site 2 in maintenance, if network fails a few seconds between site 1 and witness, everything is lost.
I was wondering, which timeouts could be adjusted before HA triggers poweroff VMs and failover ?
I prefer a lot to have a few VMs that would crash because their storage is inaccessible a few seconds than a complete poweroff and failover of a full site. Because I know for sure (been there sadly) that some OS do accept disk outage with no problem while others do crash miserably on first disk issue.
But I am not sure that only HA should be tuned here, maybe some vsan parameters (special parameters for stretched clusters maybe) any ideas ?
In more general way, does a document that describes all the parameters related to HA / vsan failures (such as timeouts) exists somewhere ?
Thanks