Solved: Reprotect with vSphere replication

SteveShepherd · ‎03-26-2013

Hello,

We are using SRM 5.1 with vSphere replication.

I'm trying to get a rough idea of how long a reprotect operation would take after a failover, in particular the synchronisation of storage.

The manual says that:

"The full synchronization that appears in the recovery steps mostly performs checksums, and only a small amount of data is transferred through the network."

However in my testing this step still seems to take a long time to complete. For one 50GB guest it takes 40 minutes. My question is: what is the bottleneck here and can I speed things up?

In a live scenario I would have much larger guest VMs and I'm worried a reprotect could take days to synchronise the storage.

Thanks,

Steve

mikez2 · ‎03-27-2013

The full sync basically just reads data for the disk and checksums the data. It also makes remote requests to read the data and return a set of checksums. It would only send data if the checksums don't match. So, assuming not much of the disk has changed, there's very little except checksums going over the network. So you're going to mostly be bound by how quickly you can read the vmdk files on both sites and to a lesser degree by the the network latency between sites, and, of course, whether the link between sites remains up.

Check into whether the link to the sites is good. Are connections often getting dropped or otherwise throttled in some way? Is the latency extremely high, like hundreds of milliseconds? This can slow things down.

On the source site, make sure the storage is not overloaded with other jobs. In the case of NAS/ISCSI, also check that there aren't other network jobs hogging bandwidth. Sometimes there are periodic jobs on secondary sites, like backups, and that people often forget about that can be resource intensive.

On the destination site, do similar checks and make sure the VR server is getting a reasonable amount of Network and CPU resources.

View solution in original post

emild · ‎03-26-2013

Hello,

Is there a chance the original VM to have been deleted before reversing the replication?

Thanks,

Emil

SteveShepherd · ‎03-26-2013

Hi Emil,

No I checked and the full VMDK file for the guest was still present at the original protected site.

Steve

mikez2 · ‎03-26-2013

There is an initial sync where the disks will be compared and any blocks that changed since the failover will have to be sent over the network. This gets the disks back into sync. Just the minimum number of blocks that changed will have to be sent but there's not much you can do about that.

If the entire disk changed, pushing 50GB in 40 minutes is ~171 megabits/second.

Things to look at include network bandwidth and latency, and storage and network speeds on both sites. Are there other storage or network loads on either site that are slowing down VR?

SteveShepherd · ‎03-26-2013

The thing is - in my testing I didn't change anything on the guest between failing over and running the reprotect so I wouldn't expect there to be a lot of changed blocks to copy over the network.

So I assumed that it was the checksum process which was taking the most time and I wondered if there was anyway to speed that up. But since I can't find out how it works I'm not sure what to look at.

Steve

stuartclements · ‎03-27-2013

Hi Steve,

I'm not sure if this helps, but there is information about calculating the required bandwidth for VR here:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=203726...

Hope this helps,

Stuart

SRM/VR docs

mikez2 · ‎03-27-2013

The full sync basically just reads data for the disk and checksums the data. It also makes remote requests to read the data and return a set of checksums. It would only send data if the checksums don't match. So, assuming not much of the disk has changed, there's very little except checksums going over the network. So you're going to mostly be bound by how quickly you can read the vmdk files on both sites and to a lesser degree by the the network latency between sites, and, of course, whether the link between sites remains up.

Check into whether the link to the sites is good. Are connections often getting dropped or otherwise throttled in some way? Is the latency extremely high, like hundreds of milliseconds? This can slow things down.

On the source site, make sure the storage is not overloaded with other jobs. In the case of NAS/ISCSI, also check that there aren't other network jobs hogging bandwidth. Sometimes there are periodic jobs on secondary sites, like backups, and that people often forget about that can be resource intensive.

On the destination site, do similar checks and make sure the VR server is getting a reasonable amount of Network and CPU resources.

SteveShepherd · ‎03-27-2013

Thanks Mike. I'll do some more investigation.

Steve

PxPxger · ‎09-30-2014

The problem is the checksum if your reprotect is hanging at 10% sign into the DR host that currently has the VM and run vim-cmd hbrsvc/vmreplica.getstate and then the vm number. This will show you how slow the checksum is going. Also if you look at ESXtop in network mode you will see tons of traffic over the management network for this. The reprotect is horrible because the checksum is super duper slow.

All

Reprotect with vSphere replication