VMware Cloud Community
Philch
Contributor
Contributor

Converting vsan standard cluster to stretched, what happens about the IO activity?

Hy everybody,

I have a VSAN cluster with 4 hosts and some VMs in production.

After adding 4 new hosts in the VSAN standard cluster and converting it to a stretched cluster (creating both sites, host groups and VM groups), does the mirroring and so the move of the data to the new site starts automatically?

What happens regarding the IO activity on the VSAN datastore and the performance impact for the running VM?

Is it possible to monitor it?

Is it possible to manage the move of the data to avoir impact performance?

In advance thank you for your help and your answers.

Best regards.

Tags (2)
Reply
0 Kudos
6 Replies
afoignant
Contributor
Contributor

Hi

On the vSAN FAQ Document, page 19, it is said that it is easy to convert a standard vSAN cluster to a vSAN stretched cluster :

"Yes, it is easy to convert a standard (non-stretched) vSAN cluster to a stretched cluster. This is performed in the “Fault Domains & Stretched Cluster” section of the vSAN UI. More details can be found in the vSAN Stretched Cluster Guide ."

It is also explained here :

Storage and Availability Technical Documents

I suppose that the question is more about the impacts on the existing VM and data on the primary site, before the convertion begin and after the conversion has been done.

TheBobkin
Champion
Champion

Hello afoignant

You are correct there, I should stop posting late-night :smileygrin: removed comment to avoid confusion.

From an IO perspective there are two scenarios I can see here:

1. The 4 added nodes are in use for some time and have had data moved onto them with proactive rebalance - reconfiguring the data placement so that data-replicas are located on each site would be intensive for both read and writes on all host and this could result in contention for storage-resources which could increase latency on VMs, however in 6.6 resync can be throttled to avoid this contention (but will be a longer resync).

2. The 4 added nodes are added to the cluster and then Fault Domains are configured - reconfiguring component placement  would result in heavy reads from the original hosts (Site A) and heavy writes on the new hosts (Site B) as the data-replicas are written there (and the original 2nd R0 sets removed from Site A), this would be less impactful than scenario 1 and could also be throttled if needed.

Throttle Resynchronization Activity in the vSAN Cluster

I don't think it will start data move to Site B until Storage Policy (SP) is re-applied for compliance, so the resync load could be kept low by only re-applying SP to a few VMs and increased at will by re-applying SP to a small number of VMs/Objects at a time, will test this in a lab.

Bob

Reply
0 Kudos
Philch
Contributor
Contributor

Hello Bob,

thank you for your answers.

We should be in the case 2. However we will be with VSAN 6.2 and it seems (from documentation) that in this version, throttle resynchronization feature is not available.

Do you confirm?

Best regards.

Philippe

Reply
0 Kudos
GreatWhiteTec
VMware Employee
VMware Employee

Just to clarify. In 6.2 throttle resync was "expanded" (in the background). This can be controlled by advanced commands such as:

- esxcfg-advcfg -s <x> /VSAN/DomCompResyncThrottle

- Vsish –e set /vmkModules/vsan/dom/compSchedulers/<component_id>ResyncIopsLimit 1000

But that is something that should probably follow GSS recommendations.

In 6.6 that feature was added to the UI for customer to tweak as needed. In addition to that, in 6.6 you also have Primary Failures to Tolerate (PFFT), and Secondary Failures to Tolerate (SFTT); which will be very helpful if you are planning on a stretched cluster. The PFFT can be used to determine which objects to move to the secondary site at your own pace. So, this may be something you may want to look into as well.

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Philippe,

I have tested this procedure in vSAN HOL 1808, by configuring a 3-node vSAN cluster, migrating 'core-A' VM to this cluster and vsandatastore, adding a few disks with different SPs to this VM and cloning this VM to put more data on the cluster. Then adding 3 other nodes to the vSphere cluster(hosts 04-06 require their vSAN vmks added/IP changed + Default gateway changed) then configuring the Fault Domains and Witness and creating disk-groups on hosts 04-06.

http://labs.hol.vmware.com/HOL/catalogs/lab/3651

As soon as stretched-cluster was configured all VM Objects were of state 'Reduced Availability with no rebuild' but remained accessible through-out (checked via console of running VMs), these then started automatically resyncing all Objects as soon as disk-groups were configured on the new hosts. Thus I don't think you will have the choice of selectively resyncing some VMs/Objects at a time by re-applying Storage Policies (but do also bear in mind that HOL is nested nested and can have quirky behaviour :smileygrin:)

The reliable and safe method of throttling resync in 6.2 is to lower the IO to resyncing components from the default of 50, anywhere down to 1, e.g.:

#vsish -e set /vmkModules/vsan/dom/MaxNumResyncCopyInFlight 1

This needs to be applied to all hosts and will of course slow down the resync.

Obviously there is only a need to do this if the additional load of the resync is causing unacceptable latency for production VMs.

You could start the resync with this lowered to 1 on all hosts and gradually increase while periodically checking that VM latency is acceptable.

GreatWhiteTec

"Just to clarify. In 6.2 throttle resync was "expanded" (in the background). This can be controlled by advanced commands such as:

- esxcfg-advcfg -s <x> /VSAN/DomCompResyncThrottle

- Vsish –e set /vmkModules/vsan/dom/compSchedulers/<component_id>ResyncIopsLimit 1000"

I am unsure of the safety of using either of those pre-6.6, I know that 'pausing' resync through similar means could cause mass-PSODs, can't recall if throttling could have the same effect.

Also, in ResyncIopsLimit vsish command those are NOT component UUIDs being referenced, they are cmmds UUIDs referencing the cache-tier SSDs of the host that command is being run on.

Bob

Reply
0 Kudos
Philch
Contributor
Contributor

Hi GreatWhiteTec and Bob,

thank you for your answers.

Bob, i will use you url. I didn't think that we could create a cluster in the HOL. Greta. I will test it.

Sure, your suggestion about the value of the resynch is a very good idea and it is what we used to do to migrate storage when we have no idea about the impact, when it is possible.

I will come back to inform when the expantion cluster and the stretched cluster has been done (it should be done on february).

Thank you again for your help.

Philippe

Reply
0 Kudos