VMware Cloud Community
Shervinn
Contributor
Contributor

SRM and vSphere Replication - Reprotect very slow

Greetings,

I am doing some work for a customer, trying to optimize their current SRM set up.

Environment is set up in Production and DR, with 4 hour RPO. 10Gbps link with <1ms RTT, unshaped and near 0% packet loss.

During the execution/running of recovery plan/s, the Reprotect step is extremely time consuming.

For a 100GB Test VM, the initial Reprotect takes +-30mins. After failback, the 2nd Reprotect takes +-50mins, all this with no data changes.

This is same for Production VMs.

From all the documentation I have seen, vSR initiates a full sync between the Production and DR and does a checksum of the entire vDisk.

Is there anyway to work around this or speed up the checksum process? I've read that this has been optimised in vSphere Replication 6.0 onwards?

ESX, SRM, vSR are all version 6.0 upwards. vDisk is thin and Datastores are all VMFS.

Reply
0 Kudos
8 Replies
basher
VMware Employee
VMware Employee

Hello

As a matter of fact vSphere Replication indeed has optimizations around full syncs: VMware vSphere Replication 6.0 Release Notes

This especially effective to thin-provisioned and lazy-zeroed thick disks on VMFS.

Stefan

Director - VMware Site Recovery Manager
Reply
0 Kudos
Shervinn
Contributor
Contributor

Thanks Stefan,

Yes this release has already been applied and the disks are thin-provisioned.

Are these times normal?

Reply
0 Kudos
Baoth
Enthusiast
Enthusiast

Sorry to drag up an old thread, but it's not been answered, so...

I've just experienced a similar thing.

Running a DR test over the weekend, a planned migration of a set of VM's took place. Sites connected by 2 x 10GB links. Two of the VM's within the Protection Groups that formed part of the Recovery Plan were on the larger side - one for example was around 4TB's, and the other just under that.

Fail over (planned migration) is not too bad in fairness; took around 35 minutes from Saite A to B. Reprotection took around 4 hours. Fail back (again, planned migration) took around an hour and half.

However, I initiated reprotect at 14:55, and at 20:09 the Protection Groups finally displayed an "OK" status. However, vSphere Replication was still working through Initial Sync.

We're using vSphere Replication version 6.1.2.

Based on this, if we tried to do a full DR test via planned migration, we'd never fit the activities into the agreed testing window.

I have a ticket logged for VMware Support to assist.

Reply
0 Kudos
Baoth
Enthusiast
Enthusiast

In case this helps anyone in the future, given the versions we were using, the performance is expected.

The re-protect activity initiates what looks like an "initial sync" in vSphere Replication, but it is in fact initiating checksum activities. The time taken largely depends on the size of the VMDK files being checked. The bulk of the checksum time is taken up by hashing, as vSphere Replication does not have a Change Block Tracking mechanism.

If more speed is required, array based replication should be considered. There unfortunately isn't a way around it.

I worked our testing out, and our checksum activity worked through around 1.2TB per hour (per re-protect direction).

There is an unsupported work around to the time of re-protect though, which was given to me by VMware Support, with the caveat that it "should technically work, but is unsupported. If you cannot wait for SRM and VR to complete their work through their processes automatically, you might consider the following steps - note that this is unsupported by VMware though and I accept no liability for these steps:

  • Manually power off the VM in DR site
  • Unregister it from the DR site inventory
  • Locate the VM in PROD
  • Re-register it in the PROD inventory
  • Power it back on in PROD
  • Re-configure vSphere Replication, but point at the existing replica
  • Select Yes to using seeds when prompted, as that will remove the requirement to copy the entire VMDK over the WAN
Reply
0 Kudos
vbrowncoat
Expert
Expert

Support is wrong about those steps helping speed things up. The time would be exactly the same (plus the additional time to do those added on steps) as that is exactly the same process that VR goes through when you run a reprotect (it uses the previous source VM as a seed and calculates the checksum difference between the source and target). When you run a reprotect (or when you use a seed) the disks are not copied over the WAN, only changes/differences are.

The checksum time is as long as it is not because VR doesn't use CBT (it uses a light-weight delta mechanism to track changes which is better suited to it's purpose than CBT), rather it is because VR doesn't currently have a way of recognizing that it is reprotecting (using something that should be almost exactly the same) vs. using a seed (something that could have some things the same and potentially a lot different). That is something we're working on though.

Reply
0 Kudos
Baoth
Enthusiast
Enthusiast

Appreciate the reply and details.

I was told the same thing by support in that it's the same process too, but wouldn't the time taken depend on the size of VMDK's the checksum is running against?

In our case, the VM's in question came back in the production site in the same time it takes to power them on. After all, all we were doing is essentially powering up a VM that was powered off. (Of course, we had to clean up the SRM DB with a script and ensure replication was set up again). Might be worth adding that we were working with a 4 TB VM - which is where the checksum time took in the region of 6 hours, and it was much quicker to power off at DR and power on at PROD site.

Reply
0 Kudos
rshenoy
Enthusiast
Enthusiast

It is very evident that reprotect takes time the reason being,before it begins replicating the incremental data it runs checksum(source and destination VMDK) to compare block by block to understand from what point data should be replicate further.

So until the checksum is completed Synchronization is not going to be completed.

When you configure replication for the first it really does not matter as you do not have any data residing at the target site.So you dont see any checksum being performed.

Hope this helps.

Regards

Ritesh

Reply
0 Kudos