VMware Cloud Community
georgemason
Contributor
Contributor

Enabling fault domains on existing vSAN cluster

Hi,

I have been considering enabling fault domains on my vSAN infrastructure, as all servers do not reside in the same rack. The storage is about 70% consumed across 4 hosts. I figure that doing so will generate a lot of sync traffic as vSAN works to get VMs into a compliant state after the policy has been changed, but is there any way I can work out how much sync traffic is going to be generated?

Essentially I'm trying to work out what sort of effect this might have on the workload and whether it needs to be done out of hours. Also trying to establish whether there is any way to stop the sync from overrunning into working hours.

Basically any relevant info or experience of this would be welcome!

Thanks in advance

George

0 Kudos
1 Reply
TheBobkin
Champion
Champion

Hello George,

"I have been considering enabling fault domains on my vSAN infrastructure, as all servers do not reside in the same rack."

How are the servers distributed?

Just so you are aware, each node essentially acts as a Fault Domain(FD) by default with none configured:

e.g. in a 4-node standard cluster, a RAID1,FTT=1 Object will be distributed data+data+witness components on nodes 1,2,3 and as Objects are added and space assigned this evens out data component usage across all available nodes (e.g. creating another Object will place data witness+data+data on nodes 2,3,4 as dictated by the relative free space on the nodes in tandem with placing the components over the minimum number of FDs to be compliant with the Storage Policy).

So, with a 4-node cluster you likely (and logically) might have 1 of the following:

2x2 - in which case creating 2 FDs won't work here as you require 3 FDs for RAID1,FTT=1 component placement (or 4 FDs for RAID5,FTT=1).

4x1 in which case there is no benefit to defining explicit FDs as per my first point above as if your Objects are compliant with their Storage Policy, components are distributed in this manner anyway.

In a case of a larger cluster e.g. 16-node being divided into 4x4 FDs then yes there would potentially be a huge amount of resync, maybe not comparable to the extremes like changing the structure of all Objects (e.g. changing from RAID1 to RAID5 or Stripe-Width) but definitely something that should be done out of hours (and depending on the hardware/infrastructure and size may take days).

There is of course the capability in vSAN to throttle resync cluster-wide and prevent such activities from causing contention of bandwidth/capability if they do require to be run during regular hours (as some businesses have 0 'out of hours').

Bob

0 Kudos