It is explained in the same paragraph in the vSAN Stretched Cluster & 2 Node Guide - "The policy specifies "SFTT=2" which, during a PFTT violation is counted globally across sites due to quorum implications."
My question is, isn't SFTT=2 supposed to allow two host failures in the surviving site? Why does the whole site go offline?
Good point, considering VMware states the following:
"If PFTT = 1 and SFTT = 2, and one site is unavailable, then the cluster can tolerate two additional host failures." - Introduction to Stretched Clusters
depping, can you plese clarify?
We have a lab environment, but atm it's only 3+3+1, which means I could only test SFTT=1, and not even FTM=RAID5.
I can confirm the document is at least right about the failure scenario outcomes with the above setup.
Does anyone happen to have a 6+6+1 environment to test the point raise by the opening post?
With PFTT and SFTT it works as follows: you can tolerate 1 full site failure, and on top of that SFTT host failures. So that means with PFTT=1 and SFTT=2 you can tolerate a preferred site failure and then have 2 host failures and data would still be available. Where it gets tricky is when the Witness fails, as now the "witness" is the "site failure". In this case when 3 hosts would fail (2 in preferred and 1 in secondary for instance) than data would become unavailable.
Just wrote a blog with some diagrams which will hopefully explain it a bit better: http://www.yellow-bricks.com/2018/03/19/vsan-stretched-cluster-pftt-and-sftt-what-happens-when-a-full-site-fails-and-multiple-hosts-fail/
Thanks depping, but that would mean the scenario outcome in the document is wrong.
Or is there another explanation?
To repeat, the document says:
PFTT=1, SFTT=2, FTM=R5/6
Page 140 says:
Single site failure (PFTT) and dual disk, disk group, host failure across remaining sites (SFTT).
VMs will stop running and the cluster will be oﬄine until a site is brought back online.
Nowhere does it say that the witness is offline.
Let me add to the conversation.
This should be looked at from a "how many components are available" view.
Consider a VM with PFTT=1, SFTT=2, FTM=R5/6. In this situation:
A vSAN object is comprised of 6 components in the Preferred Site and 6 components in the Secondary Site (assuming the object is <255GB and the Stripe Width rule of 1.
The Preferred Site has 6 votes from the hosts it is distributed across.
The Secondary Site has 6 votes from the hosts it is distributed across.
The Witness "Site" has 6 votes on the vSAN Witness Host alone.
*vSAN Will add an additional vote to ensure an odd count, in this case, that's 19 votes.
If we lose a single site (say Preferred, but could be Secondary or vSAN Witness Host), we've lost 6 of 19 votes.
Now we have 13 of 19 votes available, which is >50% available.
If we lose 2 disks/disk groups/hosts in the Secondary site, we're now at 11 of 19 votes available, which is still >50% of the components available.
Losing the vSAN Witness Host would result in only 5 of 19 votes (remember, we had an extra to ensure an odd count). Because vSAN does not have >50% of the vote available, the object is not accessible.
We cover this in the Stretched Cluster & 2 Node Guide here: Storage and Availability Technical Documents
I hope this helps.
I will see if we can get the docs.vmware.com information updated appropriately.
Thank you for the detailed explanation.
The VSAN component voting mechanism has been made very clear now.
It will be greatly appreciated if the OP question could also be directly addressed, which is:
Whether that specific outcome for that specific scenario (on that specific page) is actually valid in that document?
Really usefull topic (I was really confused after rading Failure scenarios in the document).
But I think there is another scenario in the document that is not clear to me:
Policy: PFTT = 1, SFTT = 2, FTM = R5/6
Scenario: Dual disk, disk group,host failure on one site (SFTT)
vSAN Behaviour: Site marked as failed by vSAN, component rebuilds will begin when the site comes online again.
Since the SFTT is 2, and the only 2 failures are in 1 single site, VSAN should still be able to run the VM in that site, without marking the entire site as failed (votes count should be 17/19 if I understand correclty).
Is that correct for you?
I'm digging an old thread but this got me some valuable informations and I'll like to ask some specific details.
I'm currently testing a 4+4+1 stretched cluster with pftt=1 and sftt=1 raid5
I'm running failure resilience tests and I found a strange behavior.
When I fail 2 ESXi on the same site (site A), VMs are restarted on both sites (site A and site B).
From the docs I was believing that the vsan storage on site A would be shut down, VMs (on the 2 remaining ESXis still alive) on site A would read and write on the vsan storage on site B using the cross site link.
But then, why VMs are restarted on site A, wouldn't it be more logic to restart on site B ?
How can I know from which site the VM is currently accessing its datas ?
How can I know if vsan is considering a site as failed ?
There currently is no direct integration between HA/DRS and vSAN, I understand why you would expect this but it doesn't work like that. HA will restart the VMs based on the last known resource utilization and restart the VM where it feels it would fit best. So even though the vSAN components in that fault domain may not be available, HA may still try to restart the VMs in that fault domain. This has been raised with the vSAN engineering team before as something we can improve, I will raise it again for you.
One thing to point out is that if HA and DRS deem the whole location failed, that the available compute resources will go to waste as VMs would only run in 1 location. It is a tricky situation to be honest.
Thank you for your information. Yes I agree with you, it's a shame to waste resources on one complete site.
But I think something to my understanding is missing : when on a 4+4+1 scenario with 2 hosts down on the site A, pfft1 sfft 1.
Are the VMs running on site A accessing their datas via vsan storage on site B thought cross site link ?
If that is so, then the waste of resource is to compare vs performance decrease of accessing data via cross link instead of local (for READS at least).
What I'm not sure about is how to read "failure scenario matrix" : Failure Scenario Matrices | vSAN Stretched Cluster Guide | VMware
Single site failure (PFTT) and single disk, disk group, host failure across remaining sites (SFTT)
Site marked as failed, disk/disk group/host also marked as failed.
Disk and disk group failures will not affect VM running state.
VMs will continue running if they are running on a host/site other than the ones that failed.
If the VM was on the failed host/site a HA restart of the VM will take place.
What I understand is HA should poweroff VMs still running on site A and restart them on site B. But maybe it's my english that is the problem here
Because what I understood is with 4+4+1 and 2 esxi down on same site, the whole site (failure domain) is considered as down.