VMware Cloud Community
RSEngineer
Enthusiast
Enthusiast

vSAn Stretched Cluster SFTT=0

Hello, all.

I'm trying to size a VxRail cluster and I need to understand the implications with regard to failures of a 2+2+1 set up. I assume that a 2+2+1 set up means a PFTT=1 (default - and I want that) and an SFTT=0 (??). Correct on the SFTT? If so, then with an SFTT of 0, how are different local failures handled (short of an entire site failure)? I cant seem to find any helpful material from VMware on this scenario. I mean a nice deep dive. 

Also, VMware doesn't seem to go too deep on failure scenarios in vSAN in general. 

Can anyone help?

Tags (1)
Reply
0 Kudos
16 Replies
TheBobkin
Champion
Champion

@RSEngineer 
"I assume that a 2+2+1 set up means a PFTT=1 (default - and I want that) and an SFTT=0"
Correct as a minimum of 3 data-nodes per site would be required for placement of SFTT=1,SFTM=RAID1 Objects (and 4-nodes each side if wanted to use SFTT=1,SFTM=RAID5).

 

If a node failed on one site then it would attempt to repair as much of the data from that site as possible (assuming it won't all fit) on the remaining node on that site, if a Disk-Group failed then it would repair the data from it on either the other node on that site or the remaining Disk-Groups on the node with the failure (assuming it has multiple Disk-Groups) - it would violate a PFTT=0 Storage Policy to try to repair any of these data onto the other site in the cluster as then both copies would be in a single Fault Domain.

 

"Also, VMware doesn't seem to go too deep on failure scenarios in vSAN in general. "
Sorry but I completely disagree and not to be mean but these can be found with very simple Google searches e.g.:
https://core.vmware.com/resource/vsan-availability-technologies#section13
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan-monitoring.doc/GUID-35A4B700-6...
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan.doc/GUID-08911FD3-2462-4C1C-AE...

 

While technically personal blogs, @CHogan  and @depping  are VMware employees that have been working with vSAN since the start and frequently cover in very deep detail how vSAN reacts in failure scenarios depending on the configuration etc.:
http://www.yellow-bricks.com/
https://cormachogan.com/vsan/

Tags (1)
Reply
0 Kudos
RSEngineer
Enthusiast
Enthusiast

Thank you very much for the feedback. Let me ask a few things further. 

"If a node failed on one site then it would attempt to repair as much of the data from that site as possible (assuming it won't all fit) on the remaining node on that site, "

For an SFTT of 0, that seems quite "tolerant," no? Let me give a scenario to showcase what I mean. 

Site A has 2 nodes - N1 and N2. 

Site B has 2 Nodes - N3 and N4

A VM is running on N1 and its VMDK is on N2. N2 dies altogether. N1 survives. I would have thought that no attempt to repair or rebuild would take place at Site A because the SFTT=0. I would have thought that the mirrored copy of that VMDK on Site B would be immediately leveraged by vSAN and that the VM in Site A would remain in place (adhering to Site affinity rules) and access I/O across the inter-site link. 

But if repairing and rebuilding DO INDEED take place at Site A, as you describe, and it has a SFTT of 0, then why do we say there is no protection? Can you clarify what exactly - in practice - an SFTT of 0 really means? 

Reply
0 Kudos
depping
Leadership
Leadership

there's  a very extensive stretched cluster guide to be found here:

https://core.vmware.com/resource/vsan-stretched-cluster-guide

it is pretty straight forward:

PFTT = Primary Failures To Tolerate = Protection Across Locations

SFTT = Secondary Failures To Tolerate = Additional Protection Within Locations

With PFTT you ensure a copy of the data is available in both locations, with SFTT you can protect that copy additionally locally as well against failures. So you have a RAID-1 configuration across locations, and RAID-1,5 or 6 within the location potentially if desired.

What is the benefit of SFTT? It adds two things:

1. If a local copy fails, data repair will happen locally

2. it adds an extra level of availability, as you can tolerate more failures before the VM becomes inaccessible

So even if PFTT=1 and SFTT=0 and a disk fails in a location, vSAN will still try to repair the impacted VM/Objects to meet the specified policy!

RSEngineer
Enthusiast
Enthusiast

Why do my posts keep getting deleted!?!?!

I HAVE RESPONDED TO DUNCAN EPPING 3 TIMES! AND EVERY TIME I DO, THE POST GETS DELETED!  WHY???

Reply
0 Kudos
TheBobkin
Champion
Champion

@RSEngineer, If it helps any, we can see this one.

Reply
0 Kudos
RSEngineer
Enthusiast
Enthusiast

Thanks, TheBobkin.

 

Must say...It's annoying. Please forgive my complaining, but this board is flaky. Has so many quirks. I typed a nice, detailed response to Duncan and reposted it 3 times. 

 

B\OK, so bottom line...

 

I will accept that even with an SFTT of 0 that vSAN will try to rebuild WITHIN the site where the failure occurred. But I don't see why if the SFTT is 0. Zero to me, and to every other engineer who I know works with vSAN and sells it as an SE,  means that, once node 2 in my example fails, vSAN should failover to the mirrored VMDK and that's that. If even with an SFTT of 0 vSAN is still going to try to rebuild at the site where the failure occurred, then why add extra nodes and waste money? Just do a 2+2+1, as opposed to say, 3+3+1. Just add some extra disk on the 2 nodes at each site and let the rebuild occur. 

 

In fact, some people think that the site should have failed altogether once node 2 failed with an SFTT of 0 for that site -- and all VMs should fail to the secondary site. Again, I accept that that is not the case, but it still doesn't make much sense. Think of a 3-node cluster with an FTT of 1 and RAID 1 in place. If 1 node fails, your SPBM will be in violation (FTT of 0) AND if another failure a\occurs, the whole cluster will be down because you will have lost quorum. 

Reply
0 Kudos
RSEngineer
Enthusiast
Enthusiast

  • Thanks, TheBobkin.

     

    Must say...It's annoying. Please forgive my complaining, but this board is flaky. Has so many quirks. I typed a nice, detailed response to Duncan and reposted it 3 times. 

     

    B\OK, so bottom line...

     

    I will accept that even with an SFTT of 0 that vSAN will try to rebuild WITHIN the site where the failure occurred. But I don't see why if the SFTT is 0. Zero to me, and to every other engineer who I know works with vSAN and sells it as an SE,  means that, once node 2 in my example fails, vSAN should failover to the mirrored VMDK and that's that. If even with an SFTT of 0 vSAN is still going to try to rebuild at the site where the failure occurred, then why add extra nodes and waste money? Just do a 2+2+1, as opposed to say, 3+3+1. Just add some extra disk on the 2 nodes at each site and let the rebuild occur. 

     

    In fact, some people think that the site should have failed altogether once node 2 failed with an SFTT of 0 for that site -- and all VMs should fail to the secondary site. Again, I accept that that is not the case, but it still doesn't make much sense. Think of a 3-node cluster with an FTT of 1 and RAID 1 in place. If 1 node fails, your SPBM will be in violation (FTT of 0) AND if another failure a\occurs, the whole cluster will be down because you will have lost quorum. 

Reply
0 Kudos
RSEngineer
Enthusiast
Enthusiast

Thanks, TheBobkin.

 

Must say...It's annoying. Please forgive my complaining, but this board is flaky. Has so many quirks. I typed a nice, detailed response to Duncan and reposted it 3 times. 

 

OK, so bottom line...

 

I will accept that even with an SFTT of 0 that vSAN will try to rebuild WITHIN the site where the failure occurred. But I don't see why if the SFTT is 0. Zero to me, and to every other engineer who I know works with vSAN and sells it as an SE,  means that, once node 2 in my example fails, vSAN should failover to the mirrored VMDK and that's that. If even with an SFTT of 0 vSAN is still going to try to rebuild at the site where the failure occurred, then why add extra nodes and waste money? Just do a 2+2+1, as opposed to say, 3+3+1. Just add some extra disk on the 2 nodes at each site and let the rebuild occur. 

 

In fact, some people think that the site should have failed altogether once node 2 failed with an SFTT of 0 for that site -- and all VMs should fail to the secondary site. Again, I accept that that is not the case, but it still doesn't make much sense. Think of a 3-node cluster with an FTT of 1 and RAID 1 in place. If 1 node fails, your SPBM will be in violation (FTT of 0) AND if another failure a\occurs, the whole cluster will be down because you will have lost quorum. 

Reply
0 Kudos
TheBobkin
Champion
Champion

@RSEngineer 
"Must say...It's annoying. Please forgive my complaining, but this board is flaky. Has so many quirks. I typed a nice, detailed response to Duncan and reposted it 3 times. "
Sure, but bear in mind it has existed on this new platform for all of a week now (migrated from Jive to Khoros with a LOT of changes) - not to knock this platform or any other, but I have worked with enough 'put your words in the box, we keep them safe...' platforms/forms/websites/youNameIts that if I am writing something longer than 2 sentences then it is being done locally in Notepad++, this isn't always the platforms fault, maybe your login token for X was just about to expire when you clicked the button.

 

"But I don't see why if the SFTT is 0. Zero to me, and to every other engineer who I know works with vSAN and sells it as an SE, means that, once node 2 in my example fails, vSAN should failover to the mirrored VMDK and that's that"
Because it is the SFTT e.g. Secondary to the Primary (the default cluster settings which still exists and has been the default all along) - if you have PFTT=0, SFTT=0, this behave exactly as it would have before SFTT even existed e.g. it WOULD failover to running off the remaining data-replica and it WOULD try to rebuild the lost replica if there was still available space and available Fault Domains (without violating the SP).

 

"If even with an SFTT of 0 vSAN is still going to try to rebuild at the site where the failure occurred, then why add extra nodes and waste money?"
So that multiple concurrent or staggered failures can be withstood even if the other site fails, sure this costs more than less but so does anything else that provides durability-at-depth.

 

"Just add some extra disk on the 2 nodes at each site and let the rebuild occur. "
This won't particularly help if a motherboard, boot device or a shared controller is the point of failure. If you mean add them just after the failure occurred then maybe you are working with folks that can replace/add disks a lot faster than is realistic for most (especially in the current climate) - this also means you have to choose hardware that has ample free slots and that these are readily accessible.

 

"In fact, some people think that the site should have failed altogether once node 2 failed with an SFTT of 0 for that site -- and all VMs should fail to the secondary site."
Are you talking about in a 2+2+1 or 3+3+1? I would think that most (vSAN customers or otherwise) would prefer some of their data redundant shortly after a failure vs NONE of their data redundant.

 

"If 1 node fails, your SPBM will be in violation (FTT of 0) AND if another failure a\occurs, the whole cluster will be down because you will have lost quorum. "
As I said above - PFTT=0, SFTT=0 works the exact same as it did before SFTT was even a thing, nothing has changed in this regard.

Reply
0 Kudos
TheBobkin
Champion
Champion

@RSEngineer 
"Must say...It's annoying. Please forgive my complaining, but this board is flaky. Has so many quirks. I typed a nice, detailed response to Duncan and reposted it 3 times. "
Sure, but bear in mind it has existed on this new platform for all of a week now (migrated from Jive to Khoros with a LOT of changes) - not to knock this platform or any other, but I have worked with enough 'put your words in the box, we keep them safe...' platforms/forms/websites/youNameIts that if I am writing something longer than 2 sentences then it is being done locally in Notepad++, this isn't always the platforms fault, maybe your login token for X was just about to expire when you clicked the button.

 

"But I don't see why if the SFTT is 0. Zero to me, and to every other engineer who I know works with vSAN and sells it as an SE, means that, once node 2 in my example fails, vSAN should failover to the mirrored VMDK and that's that"
Because it is the SFTT e.g. Secondary to the Primary (the default cluster settings which still exists and has been the default all along) - if you have PFTT=0, SFTT=0, this behave exactly as it would have before SFTT even existed e.g. it WOULD failover to running off the remaining data-replica and it WOULD try to rebuild the lost replica if there was still available space and available Fault Domains (without violating the SP).

 

"If even with an SFTT of 0 vSAN is still going to try to rebuild at the site where the failure occurred, then why add extra nodes and waste money?"
So that multiple concurrent or staggered failures can be withstood even if the other site fails, sure this costs more than less but so does anything else that provides durability-at-depth.

 

"Just add some extra disk on the 2 nodes at each site and let the rebuild occur. "
This won't particularly help if a motherboard, boot device or a shared controller is the point of failure. If you mean add them just after the failure occurred then maybe you are working with folks that can replace/add disks a lot faster than is realistic for most (especially in the current climate) - this also means you have to choose hardware that has ample free slots and that these are readily accessible.

 

"In fact, some people think that the site should have failed altogether once node 2 failed with an SFTT of 0 for that site -- and all VMs should fail to the secondary site."
Are you talking about in a 2+2+1 or 3+3+1? I would think that most (vSAN customers or otherwise) would prefer some of their data redundant shortly after a failure vs NONE of their data redundant.

 

"If 1 node fails, your SPBM will be in violation (FTT of 0) AND if another failure a\occurs, the whole cluster will be down because you will have lost quorum. "
As I said above - PFTT=0, SFTT=0 works the exact same as it did before SFTT was even a thing, nothing has changed in this regard.

Reply
0 Kudos
TheBobkin
Champion
Champion

@RSEngineer - replying to my own comment as maybe you are cursed but 3 times my own replies to you also were visible then 'gone'...
"Must say...It's annoying. Please forgive my complaining, but this board is flaky. Has so many quirks. I typed a nice, detailed response to Duncan and reposted it 3 times. "
Sure, but bear in mind it has existed on this new platform for all of a week now (migrated from Jive to Khoros with a LOT of changes) - not to knock this platform or any other, but I have worked with enough 'put your words in the box, we keep them safe...' platforms/forms/websites/youNameIts that if I am writing something longer than 2 sentences then it is being done locally in Notepad++, this isn't always the platforms fault, maybe your login token for X was just about to expire when you clicked the button.

 

"But I don't see why if the SFTT is 0. Zero to me, and to every other engineer who I know works with vSAN and sells it as an SE, means that, once node 2 in my example fails, vSAN should failover to the mirrored VMDK and that's that"
Because it is the SFTT e.g. Secondary to the Primary (the default cluster settings which still exists and has been the default all along) - if you have PFTT=0, SFTT=0, this behave exactly as it would have before SFTT even existed e.g. it WOULD failover to running off the remaining data-replica and it WOULD try to rebuild the lost replica if there was still available space and available Fault Domains (without violating the SP).

 

"If even with an SFTT of 0 vSAN is still going to try to rebuild at the site where the failure occurred, then why add extra nodes and waste money?"
So that multiple concurrent or staggered failures can be withstood even if the other site fails, sure this costs more than less but so does anything else that provides durability-at-depth.

 

"Just add some extra disk on the 2 nodes at each site and let the rebuild occur. "
This won't particularly help if a motherboard, boot device or a shared controller is the point of failure. If you mean add them just after the failure occurred then maybe you are working with folks that can replace/add disks a lot faster than is realistic for most (especially in the current climate) - this also means you have to choose hardware that has ample free slots and that these are readily accessible.

 

"In fact, some people think that the site should have failed altogether once node 2 failed with an SFTT of 0 for that site -- and all VMs should fail to the secondary site."
Are you talking about in a 2+2+1 or 3+3+1? I would think that most (vSAN customers or otherwise) would prefer some of their data redundant shortly after a failure vs NONE of their data redundant.

 

"If 1 node fails, your SPBM will be in violation (FTT of 0) AND if another failure a\occurs, the whole cluster will be down because you will have lost quorum. "
As I said above - PFTT=0, SFTT=0 works the exact same as it did before SFTT was even a thing, nothing has changed in this regard.

Reply
0 Kudos
depping
Leadership
Leadership

The problem is that people don't understand what vSAN is.

vSAN is a distributed object based storage system. Availability is specified on a per object basis. If you create a policy and that policy states:

1 copy of the data per fault domain (PFTT=1), then that is what you get! Even when a host fails. As long as there is a remaining host in the other fault domain vSAN will try to comply to the policy that you created.

With vSAN there's no such a thing as a failed site really. The RAID tree with a location for an object may be inaccessible, but that doesn't render the site failed. Failures are on a per object basis, individually, even when all hosts within a site are down, even then all components in that site will be marked as inaccessible.

TheBobkin
Champion
Champion

Reposting this for 3rd time as the post keeps vanishing.
@RSEngineer 
"Must say...It's annoying. Please forgive my complaining, but this board is flaky. Has so many quirks. I typed a nice, detailed response to Duncan and reposted it 3 times. "

Sure, but bear in mind it has existed on this new platform for all of a week now (migrated from Jive to Khoros with a LOT of changes) - not to knock this platform or any other, but I have worked with enough 'put your words in the box, we keep them safe...' platforms/forms/websites/youNameIts that if I am writing something longer than 2 sentences then it is being done locally in Notepad++, this isn't always the platforms fault, maybe your login token for X was just about to expire when you clicked the button.

 

"But I don't see why if the SFTT is 0. Zero to me, and to every other engineer who I know works with vSAN and sells it as an SE, means that, once node 2 in my example fails, vSAN should failover to the mirrored VMDK and that's that"
Because it is the SFTT e.g. Secondary to the Primary (the default cluster settings which still exists and has been the default all along) - if you have PFTT=0, SFTT=0, this behave exactly as it would have before SFTT even existed e.g. it WOULD failover to running off the remaining data-replica and it WOULD try to rebuild the lost replica if there was still available space and available Fault Domains (without violating the SP).

 

"If even with an SFTT of 0 vSAN is still going to try to rebuild at the site where the failure occurred, then why add extra nodes and waste money?"
So that multiple concurrent or staggered failures can be withstood even if the other site fails, sure this costs more than less but so does anything else that provides durability-at-depth.

 

"Just add some extra disk on the 2 nodes at each site and let the rebuild occur. "
This won't particularly help if a motherboard, boot device or a shared controller is the point of failure. If you mean add them just after the failure occurred then maybe you are working with folks that can replace/add disks a lot faster than is realistic for most (especially in the current climate) - this also means you have to choose hardware that has ample free slots and that these are readily accessible.

 

"In fact, some people think that the site should have failed altogether once node 2 failed with an SFTT of 0 for that site -- and all VMs should fail to the secondary site."
Are you talking about in a 2+2+1 or 3+3+1? I would think that most (vSAN customers or otherwise) would prefer some of their data redundant shortly after a failure vs NONE of their data redundant.

 

"If 1 node fails, your SPBM will be in violation (FTT of 0) AND if another failure a\occurs, the whole cluster will be down because you will have lost quorum. "
As I said above - PFTT=0, SFTT=0 works the exact same as it did before SFTT was even a thing, nothing has changed in this regard.

Reply
0 Kudos
TheBobkin
Champion
Champion

Reposting this for 3rd time as the post keeps vanishing.
@RSEngineer 
"Must say...It's annoying. Please forgive my complaining, but this board is flaky. Has so many quirks. I typed a nice, detailed response to Duncan and reposted it 3 times."
Sure, but bear in mind it has existed on this new platform for all of a week now (migrated from Jive to Khoros with a LOT of changes) - not to knock this platform or any other, but I have worked with enough 'put your words in the box, we keep them safe...' platforms/forms/websites/youNameIts that if I am writing something longer than 2 sentences then it is being done locally in Notepad++, this isn't always the platforms fault, maybe your login token for X was just about to expire when you clicked the button.

 

"But I don't see why if the SFTT is 0. Zero to me, and to every other engineer who I know works with vSAN and sells it as an SE, means that, once node 2 in my example fails, vSAN should failover to the mirrored VMDK and that's that"
Because it is the SFTT e.g. Secondary to the Primary (the default cluster settings which still exists and has been the default all along) - if you have PFTT=0, SFTT=0, this behave exactly as it would have before SFTT even existed e.g. it WOULD failover to running off the remaining data-replica and it WOULD try to rebuild the lost replica if there was still available space and available Fault Domains (without violating the SP).

 

"If even with an SFTT of 0 vSAN is still going to try to rebuild at the site where the failure occurred, then why add extra nodes and waste money?"
So that multiple concurrent or staggered failures can be withstood even if the other site fails, sure this costs more than less but so does anything else that provides durability-at-depth.

 

"Just add some extra disk on the 2 nodes at each site and let the rebuild occur. "
This won't particularly help if a motherboard, boot device or a shared controller is the point of failure. If you mean add them just after the failure occurred then maybe you are working with folks that can replace/add disks a lot faster than is realistic for most (especially in the current climate) - this also means you have to choose hardware that has ample free slots and that these are readily accessible.

 

"In fact, some people think that the site should have failed altogether once node 2 failed with an SFTT of 0 for that site -- and all VMs should fail to the secondary site."
Are you talking about in a 2+2+1 or 3+3+1? I would think that most (vSAN customers or otherwise) would prefer some of their data redundant shortly after a failure vs NONE of their data redundant.

 

"If 1 node fails, your SPBM will be in violation (FTT of 0) AND if another failure a\occurs, the whole cluster will be down because you will have lost quorum. "
As I said above - PFTT=0, SFTT=0 works the exact same as it did before SFTT was even a thing, nothing has changed in this regard.

Reply
0 Kudos
Alain81
Contributor
Contributor

Hello,

Interesting topic, thanks for the valuable information. I need to validate what I understood if I may:

1- A 2 + 2 + 1 VSAN stretched cluster with PFTT = 1 and SFTT = 0 is a valid design supported by VMware

2- In this scenario, if 1 node fails, the cluster will try to rebuild the lost data on the other node within the same site, that is part of the architecture of VSAN and works independently of SFTT being 0. If successful this will re-protect the lost data, as in it will recopy it from the mirror on site 2 to the remaining node on site 1 provided that node has enough space. (performance impact during this phase?)

3- In this scenario, if Site 1 fails (both nodes), but we still have the witness and site 2 (both nodes), the cluster would still be fully functional but data is no longer protected

Thank you,

Alain.,

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello @Alain81 

1. Yes, this is how data has been stored in Stretched and 2-node clusters as default setting since day 1 of these existing.

2. Yes, this is the PFTT=1 part e.g. there is another full data-replica still available on the other site (+ Witness components for quorum) to rebuild the data back to a PFTT=1,SFTT=0 redundant state.
The performance impact during repair should be fairly minimal as adaptive resync scheduler should limit (storage) throughput for the resync data if it is causing too much contention with normal VM IOs at that time. That being said, how much it has to repair depends on how much data needs to be recreated and how long this will take depends on the performance of the hardware, the Disk-Group(s) configuration and how many nodes available to receive the new data (assuming larger than 2+2+1 cluster).

3. Correct.