In a standard stretched cluster configuration (RAID1,FTT1=1 across sites), if the inter-site connection is broken VMs will fail over to whichever site is configured as Preferred (this can of course be changed should that site be down/impaired).
VMs won't run on both side simultaneously - that wouldn't make sense as then it would be split-brained and which set of data would you use following the outage?
Following re-establishing connection between the sites, the delta data from the Preferred site is synced to the other site.
More information regarding failure scenarios and required HA settings etc. can be found here:
And we (VMware vSAN product team) knows this is a problem, and we are looking to fix this in the future. The reason you end up in this situation today is because the Witness VM binds itself to 1 location. Which means the other location will lose quorum and as such all VMs which are stretched will lose access to their storage objects, and those VMs will be killed by vSAN automatically.
Again, this is a known concern, and the team has it listed as an issue we need to solve in the future. I can't comment unfortunately when this will be,
Provided communications with the witness is still live during a scenario where the replication link has failed, I would have thought VMs could remain running 50/50, with of course the ability to fail across clusters disabled until the replication is re-enabled and data resynced. I understand the concept of a split brain, but this is what the witness server is essentially supposed to prevent.
From what depping is saying, this is how the vSAN development guys want it to work, but it needs development work to facilitate it?
The problem is that the Witness Appliance is not a witness, but it hosts witness objects. A host can only be part of 1 cluster or partition in this case, so the witness host will bind itself to the preferred location. Which causes the secondary location to lose quorum.
Yes this needs development work, and is being looked at.