VMware Cloud Community
rleon_vm
Enthusiast
Enthusiast

VSAN Stretched Cluster & 2 Node Guide - Failure Scenario Question

Hi all,

On page 138 of the following VSAN Stretched Cluster & 2 Node Guide:

https://storagehub.vmware.com/export_to_pdf/vsan-stretched-cluster-2-node-guide

... It gives you a scenario where:

PFTT=1, SFTT=2, FTM=R5/6

Note: FTM is actually RAID6 in this case, because SFTT=2.

So, at minimum, we have a 6+6+1 Stretched Cluster, because RAID6 requires at least 6 hosts each at the two sites. (+1 witness at a 3rd location).

In one of othe failure scenarios (page 139), it says:

Single site failure (PFTT) and single disk, disk group, host failure across remaining sites (SFTT)

Translate: One site is down (let's say the Preferred site, doesn't matter), and in the surviving site, a single disk/diskgroup/host failure.

Outcome: VMs gets HA restarted at the surviving site.

The above looks right.

But in another scenario (page 140), it says:

Single site failure (PFTT) and dual disk, disk group, host failure across remaining sites (SFTT).

This time, it says the whole cluster will be offline, all all VMs still stop running at both sites.

My question is, isn't SFTT=2 supposed to allow two host failures in the surviving site? Why does the whole site go offline?

Thanks for your input.

12 Replies
roman79
Enthusiast
Enthusiast

Hi @rleon_vm,

It is explained in the same paragraph in the vSAN Stretched Cluster & 2 Node Guide  - "The policy specifies "SFTT=2" which, during a PFTT violation is counted globally across sites due to quorum implications."

My question is, isn't SFTT=2 supposed to allow two host failures in the surviving site? Why does the whole site go offline?

Good point, considering VMware states the following:

"If PFTT = 1 and SFTT = 2, and one site is unavailable, then the cluster can tolerate two additional host failures." - Introduction to Stretched Clusters

depping​, can you plese clarify?

Reply
0 Kudos
rleon_vm
Enthusiast
Enthusiast

We have a lab environment, but atm it's only 3+3+1, which means I could only test SFTT=1, and not even FTM=RAID5.

I can confirm the document is at least right about the failure scenario outcomes with the above setup.

Does anyone happen to have a 6+6+1 environment to test the point raise by the opening post?

Thanks!

Reply
0 Kudos
depping
Leadership
Leadership

With PFTT and SFTT it works as follows: you can tolerate 1 full site failure, and on top of that SFTT host failures. So that means with PFTT=1 and SFTT=2 you can tolerate a preferred site failure and then have 2 host failures and data would still be available. Where it gets tricky is when the Witness fails, as now the "witness" is the "site failure". In this case when 3 hosts would fail (2 in preferred and 1 in secondary for instance) than data would become unavailable.

Reply
0 Kudos
depping
Leadership
Leadership

Just wrote a blog with some diagrams which will hopefully explain it a bit better: http://www.yellow-bricks.com/2018/03/19/vsan-stretched-cluster-pftt-and-sftt-what-happens-when-a-ful...

rleon_vm
Enthusiast
Enthusiast

Thanks depping, but that would mean the scenario outcome in the document is wrong.

Or is there another explanation?

To repeat, the document says:

PFTT=1, SFTT=2, FTM=R5/6

Page 140 says:

Single site failure (PFTT) and dual disk, disk group, host failure across remaining sites (SFTT).

Outcome:

VMs will stop running and the cluster will be offline until a site is brought back online.

Nowhere does it say that the witness is offline.

Reply
0 Kudos
Jasemccarty
Immortal
Immortal

Let me add to the conversation.

This should be looked at from a "how many components are available" view.

Consider a VM with PFTT=1, SFTT=2, FTM=R5/6. In this situation:
A vSAN object is comprised of 6 components in the Preferred Site and 6 components in the Secondary Site (assuming the object is <255GB and the Stripe Width rule of 1.

The Preferred Site has 6 votes from the hosts it is distributed across.
The Secondary Site has 6 votes from the hosts it is distributed across.
The Witness "Site" has 6 votes on the vSAN Witness Host alone.

*vSAN Will add an additional vote to ensure an odd count, in this case, that's 19 votes.

If we lose a single site (say Preferred, but could be Secondary or vSAN Witness Host), we've lost 6 of 19 votes.

Now we have 13 of 19 votes available, which is >50% available.

If we lose 2 disks/disk groups/hosts in the Secondary site, we're now at 11 of 19 votes available, which is still >50% of the components available.

Losing the vSAN Witness Host would result in only 5 of 19 votes (remember, we had an extra to ensure an odd count). Because vSAN does not have >50% of the vote available, the object is not accessible.

We cover this in the Stretched Cluster & 2 Node Guide here: Storage and Availability Technical Documents

I hope this helps.

Jase

Jase McCarty - @jasemccarty
Jasemccarty
Immortal
Immortal

I will see if we can get the docs.vmware.com information updated appropriately.

Jase McCarty - @jasemccarty
Reply
0 Kudos
rleon_vm
Enthusiast
Enthusiast

Hi all,

Thank you for the detailed explanation.

The VSAN component voting mechanism has been made very clear now.

It will be greatly appreciated if the OP question could also be directly addressed, which is:

Whether that specific outcome for that specific scenario (on that specific page) is actually valid in that document?

Thanks again.

Reply
0 Kudos
Daniele_S
Contributor
Contributor

Really usefull topic (I was really confused after rading Failure scenarios in the document).

But I think there is another scenario in the document that is not clear to me:

Pag.                        139

Policy:                     PFTT = 1, SFTT = 2, FTM = R5/6

Scenario:                Dual disk, disk group,host failure on one site (SFTT)

vSAN Behaviour:   Site marked as failed by vSAN, component rebuilds will begin when the site comes online again.

Since the SFTT is 2, and the only 2 failures are in 1 single site, VSAN should still be able to run the VM in that site, without marking the entire site as failed (votes count should be 17/19 if I understand correclty).

Is that correct for you?

Thanks! D

Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast

Hi there,

I'm digging an old thread but this got me some valuable informations and I'll like to ask some specific details.

I'm currently testing a 4+4+1 stretched cluster with pftt=1 and sftt=1 raid5

I'm running failure resilience tests and I found a strange behavior.

When I fail 2 ESXi on the same site (site A), VMs are restarted on both sites (site A and site B).

From the docs I was believing that the vsan storage on site A would be shut down, VMs (on the 2 remaining ESXis still alive) on site A would read and write on the vsan storage on site B using the cross site link.

But then, why VMs are restarted on site A, wouldn't it be more logic to restart on site B ?

How can I know from which site the VM is currently accessing its datas ?

How can I know if vsan is considering a site as failed ?

pastedImage_0.png

Reply
0 Kudos
depping
Leadership
Leadership

There currently is no direct integration between HA/DRS and vSAN, I understand why you would expect this but it doesn't work like that. HA will restart the VMs based on the last known resource utilization and restart the VM where it feels it would fit best. So even though the vSAN components in that fault domain may not be available, HA may still try to restart the VMs in that fault domain. This has been raised with the vSAN engineering team before as something we can improve, I will raise it again for you.

One thing to point out is that if HA and DRS deem the whole location failed, that the available compute resources will go to waste as VMs would only run in 1 location. It is a tricky situation to be honest.

Sharantyr3
Enthusiast
Enthusiast

Hi,

Thank you for your information. Yes I agree with you, it's a shame to waste resources on one complete site.

But I think something to my understanding is missing : when on a 4+4+1 scenario with 2 hosts down on the site A, pfft1 sfft 1.

Are the VMs running on site A accessing their datas via vsan storage on site B thought cross site link ?

If that is so, then the waste of resource is to compare vs performance decrease of accessing data via cross link instead of local (for READS at least).

What I'm not sure about is how to read "failure scenario matrix" : Failure Scenario Matrices | vSAN Stretched Cluster Guide | VMware

Single site failure (PFTT) and single disk, disk group, host failure across remaining sites (SFTT)

->

Site marked as failed, disk/disk group/host also marked as failed.

->

Disk and disk group failures will not affect VM running state.

VMs will continue running if they are running on a host/site other than the ones that failed.

If the VM was on the failed host/site a HA restart of the VM will take place.

What I understand is HA should poweroff VMs still running on site A and restart them on site B. But maybe it's my english that is the problem here Smiley Happy

Because what I understood is with 4+4+1 and 2 esxi down on same site, the whole site (failure domain) is considered as down.

Reply
0 Kudos