Mikky83
Contributor
Contributor

vSphere Replication is too slow compared to bandwith

Hi,

I replicated Vm's from local storage on PS, to local storage on DR site via vSphere Replication 5.1. For my test purpose I set RPO on 15 min. Link between sites is 1Gbps (test link)

I messured traffic regarding vsphere replication and results are dissapointing. Max speed is only 30-40 Mbit/s. Conclusion is that vSphere Replication algorithm can't use all available bandwith. Local Storage performance are not problem, copy paste on vmfs works with arround 800 Mbit/s

Anybody else with this behavior?

Thanks,

0 Kudos
8 Replies
mikez2
VMware Employee
VMware Employee

There are a few things to consider:

* VR tries to optimize bandwidth utilization for lots of VMs replicating at once. So single-VM replication performance isn't really a perfect way to measure performance as things scale up.

* If you're depserate, you can tweak some advanced config options to increase the number of transfer buffers available for disks. This blog is pretty useful in explaining how to do this: http://blogs.vmware.com/vsphere/2012/06/increasing-vr-bandwidth.html

* You mention you have a 1Gbps link but you don't mention the latency. If you have a high bandwidth, high latency link, then it might count as a long-fat-network or LFN (See  http://en.wikipedia.org/wiki/Long_fat_network to see how to calculate if you have an LFN). The TCP congestion control algorithms in the ESX 5.1 TCP stack aren't optimal for LFNs.

As an FYI, upcoming versions of ESX/VR are likely to have fixes for the single VM, buffer, and LFN issues.

0 Kudos
davelee2126
Contributor
Contributor

I'm seeing something very similar in a customer's environment.  We have a vSphere 5.0 cluster on each site, a 30Mb/s link between the two sites with less than 10ms round trip, and we're replicating 20 or so VMs from HQ to the DR site using vSphere Replication.

We've seeded the replication site using one-time backups onto a USB drive (using VeeamZip) which were restored to the DR site a couple of days later.  Some of the especially busy VMs generated between 10Gb and 50Gb of changes in that time and it's really struggling to get the replicated VMs up the date.  As an example, we kicked off replication of a VM this afternoon and after checksumming it had 1.9Gb of changes that needed replicated.  7 hours later, it's only transferred 1.3Gb.  At a conservative estimate, we should be able to transfer around 10Gb an hour over this link so this seems incredibly slow.  It does seem that the data is being drip fed down to the DR site by vSphere Replication.  The networking guys are seeing less than 5Mb/s traffic on the link.

I understand the point about it being optimised for multiple VMs but how many VMs do you need to be replicating for it to use a reasonable amount of bandwidth?  From what I understand, vSphere Replication is positioned at the SMB market that perhaps haven't the budget for storage array replication.  20-30 VMs must be quite normal in these kind of environments.  Initial replication could take weeks and weeks if you've not got the ability to pre-seed the DR site.  I'd love to investigate the advanced settings to speed it up but I really don't want to go down that route on a customer's environment if VMware consider it unsupported.

Sorry, rant over and apologies for the post hi-jack but just wanted to add my experience of using it.  Don't get me wrong, I like SRM as a product but my experience of using it with storage array replication is much better than what I'm seeing with vSphere Replication unfortunately.

Dave

0 Kudos
mikez2
VMware Employee
VMware Employee

There may be other factors besides what's going on over the WAN link.

For instance, there have been cases where people are using iSCSI or NFS storage on the replica site and the storage network is being shared with the WAN traffic as well as other traffic. There is a limit to the amount of data that is allowed to be in-flight to the VR server before the VR server acknowledges it. So if there's a slow-down to the storage on the replica site, that can limit how fast the data is allowed to be sent to the primary.

0 Kudos
davelee2126
Contributor
Contributor

I don't think the storage is an issue in my case.  The DR site is very simple, a single ESXi host with direct attached storage.  It's only running a DC, File Server, vCenter and the VRMS and VRS appliances.  Fairly low IOPS and no latency on the storage.  I might have a look at the performance on the VRS appliance in the DR site tomorrow though, see if that could have any bearing on it.

0 Kudos
Smoggy
VMware Employee
VMware Employee

how many disks have the VM's got? the protocol is (in current versions) constrained by the buffers detailed here:

http://blogs.vmware.com/vsphere/2012/06/increasing-vr-bandwidth.html

this is per vmdk so vm's with >1 vmdk obviously benefit. further real world data was generously shared by hosting.com here:

http://blogs.vmware.com/vsphere/2012/05/when-creating-a-disaster-recovery-solution-using-site-recove...

you are correct that officially it is not supported to alter the advanced settings for buffer and extent counts simply because when the solutions is QA'd we don't adjust the default values so wildly changing these (especially up massively) could take you into unknown territory. That said if you were only to increase for the duration of your initial sync tasks (and then put back to the default) you could argue that the VM's are not actually replicating yet so if the process failed you start again. I do understand that you are in production so are probably concenred about "yeah its ok for you to say that but what if a ramp these up and blow some stack out of the water brining the source host down with production VM's on it????" I can speak to R&D to see if there are any safety guards there. In my lab i've played with doubling both values a few times and not hit issues. As Mike said we are looking to make changes in this area.

in vsphere 5.1 note that these settings were doubled over those used in 5.0.

I'll do some more testing in the lab see what else I can find out.

0 Kudos
davelee2126
Contributor
Contributor

Okay, so perhaps I have an apology to make to vSphere Replication :smileyblush:  The networking chaps had made a bit of a mistake in the QoS being applied to the link and it was throttling all traffic to 5Mb/s.  Now this has been fixed, things are buzzing along much better.  We've been given full reign of the bandwidth for the moment (as it's the weekend) and everything has caught up nicely.  We've even copied a few of the smaller VMs in their entirety rather than taking a initial copy over to the DR site first.

I'd be interested to know if we can safely double the extents and buffers in 5.0 environments to match the values used in 5.1 environments.  I've got another SRM / vSphere Replication job coming up and I'm not sure if the customer is on 5.0 or 5.1 so could be really useful information.

I'm sure the original poster (whose post I've kind of hijacked - apologies!) would be interested as well Smiley Happy

Dave

0 Kudos
mikez2
VMware Employee
VMware Employee

With the caveat that changing advanced config options isn't supported, it should be relatively safe.

0 Kudos
DylanGoh
Contributor
Contributor

Hi Dave,

Mind to share the statistic of improved network QoS??

Where do we see the data changes size between the replication duration?

Hope we get the share the real number for replication rate of WAN vs Data Change.

0 Kudos