VMware Cloud Community
LR2
Contributor
Contributor

self-healing requirement in VSAN stretch cluster

We are planning to have 4 node(primary data center) + 4 node secondary datacenter. I understood from reading that for self healing required N+1 nodes to work - Virtual Blocks: vSAN Deployment Considerations (vmware.com)

I have a debate going on on this subject in "stretched cluster" scenario - as per some folks our scenario is considered as N+N. However as per VMware below documentation " A RAID-5 erasure code typically amplifies a single write operation into 4 I/Os, but since this occurs on each site, a single write operation will be a total of 8 I/Os.  Since vSAN emphasizes data consistency in its architecture, it must wait for all the I/O operations of the participating hosts on both sites to complete before the write acknowledgment can be sent back to the VM."

so as per this there will be 2 write operations one on preferred site and other on secondary site which typically says it treats the raid 5 sets in both data center as separate entity though it resembles single stretched RAID5. 

so in this case as per my understanding we need N+1 nodes on each site to have self-healing to work! let me know your thoughts

kindly look at figure 3, and explanation below that on following doc "Performance with vSAN Stretched Clusters | VMware"

 

Reply
0 Kudos
2 Replies
TheBobkin
Champion
Champion

Hello @LR2 and welcome to the community.

Just a small clarification: Objects stored with RAID5 policy have to do 2 writes + 2 reads for each write IO (single site, double this for PFTT=1,SFTT=1,FTM=R5+R5) this is comprised of it having to read and write to both the data-block being changed and the parity block to update based on that change - this is 'standard' RAID5 write-amplification, however there have been significant changes to how this works in 7.0 U2 and again in 7.0 U3 that basically use new algorithms to reduce this. https://core.vmware.com/blog/raid-56-erasure-coding-enhancements-vsan-7-u2 https://core.vmware.com/blog/improving-raid-56-vsan-7-u3-using-heuristics

What I gather you have main concern about here is whether a 4+4+1 stretched cluster can heal back to full redundancy after a data-node failure when using a PFTT=1,SFTT=1,FTM=R5+R5 policy (like in Figure 3. as you mentioned), the answer is no as this requires a minimum of 4+4+1 nodes for component placement and you would have 3+4+1 (or 4+3+1). That being said, a R5+R5 policy already has so much redundancy built into it that being able to automatically heal back to full redundancy isn't so much of a concern - such a cluster could tolerate a whole site down + another disk/Disk-Group/node fail in the remaining site (e.g. 0+3+1 remaining) and the data would still be accessible. So while going N+1 (5+5+1 configuration) would be nice, I wouldn't see it as a necessity - the vast majority of solutions wouldn't have this level of redundancy and would be basing recovery-strategy on 1-2 elements failing.

Reply
0 Kudos
kastlr
Expert
Expert

Hi,

 

need to add a comment here as I disagree that a 5+5+1 stretched Cluster with EC (or RAID5) would be only "nice".

IMHO it should be the recommended way when the majority of VMs are protected with pFTT=1, sFTT=1 and FTM=EC.

Simply because there're some scenarios where vital services or tasks would fail when a single node is absent/lost on a 4+4+1 stretched cluster (also true for 3+3+1 when using FTM=MIR).

 

Imagine the following scenario.

Majority of VMs are protected with pFTT=1, sFTT=1 and FTM=EC, so we would have a "mirrored RAID5".

Let's assume a node would fail.

 

At the datacenter where the node failed there would be 3 nodes left.

So our cluster would still have 7 fully operational hosts and the witness.

VMs could operate normally, users might report a performance dip.

 

While access to the VMs is still possible some tasks would fail. 

  • adding new Disks (using pFTT=1, sFTT=1 and FTM=EC) to ANY VM would fail
  • Backups would also fail, as Snapshots (aka delta disks) inherit the SPBM from the base disk

This is caused by the fact that creating new objects with pFTT=1, sFTT=1 and FTM=EC in vSAN would require at least 2 * 4 nodes (as long as Force provisioning flag isn't set).

While we still have 7 fully operation hosts only one datacenter would fulfill the requirement of 4 nodes/datacenter.

So while a 4+4+1 stretched Cluster using FTM=EC is a really reliable solution and enables customer to survive a full datacenter failure ( when properly sized) there's still a measurable benefit when adding an extra Node to each datacenter.

 

I always argue the other way round.

Based on my experience a datacenter outage isn't resolved within hours, it usually takes days or weeks.

Additionally that would mean that the surviving resources has to handle twice the load which would stress them higher than usual.

And therefor I do recommend customers to still invest into (N+1) + (N+1) +1 stretched Cluster designs.

As this would be the only way to protect their business against a whole site failure AND minimize impact on regular operations even in case of a site failure.   


Just my 2 cents.

 

Regards,

Ralf


Hope this helps a bit.
Greetings from Germany. (CEST)