VMware Cloud Community
cypherx
Hot Shot
Hot Shot

vSphere replication 6.1.1 seems much slower than vSphere replication 1.0

We upgraded our vSphere system from 5.0 to 6.0 update 2.  For this we also went to vSphere replication 6.1.1 and SRM 6.1.

Because of some complexities with the upgrade, we just started over, so our upgrade process was this:

Pause all vSphere replications under the 5.0 system using the C# client in the SRM plugin.

At DR site move all replicated machines (30) to a subfolder called \Replication, so when we Stop and remove replication at the HQ this data will not be deleted and we can use it as seeds.

Stop all vSphere replications at the source (HQ).

Tear down all protection groups and plans in SRM.

Completely uninstall SRM at source and HQ.

Power off and delete the VRMS and vSphere replication virtual appliances at HQ and DR sites.

Unregister vSphere replication at HQ and DR site using the web url to go to the Extentions.

Reboot both vCenter servers.

Upgrade HQ vCenter server to 6.0 update 2 and install VUM.

Upgrade DR vCenter server to 6.0 update 2 and install VUM.

Deploy new vR Replication 6.1.1 ovf appliance at both HQ and DR site.  Had to fuddle with a new ip pools thing on our dVswitch - not something we ever had to do before, but once I figured it out they were able to power on.

Go to the web management interface of each vR server and configure it with the new vsphere.local admin user credentials and accept the non trusted certificate.

Install SRM 6.1 on HQ site.

Install SRM 6.1 on DR site.

In new ESXi 6.0 web client pair the HQ and DR sites, providing proper credentials to the @drvsphere.local site.

Rebuild our 30 VM replications, carefully repointing the replicated data folder to the appropriate VM subfolder under \Replicated - where we moved the data earlier.  Configure it to utilize the data found there as seed copies.

Rebuild our SRM protection groups and recovery plans.

Ok so far things seem so good, but I was wondering why the new 6.1.1 vSphere replication isn't maxing out our WAN bandwidth that we have allocated to it.  Before with vSphere replication 1.0 we could easily peak at 80mbps max (that we allocated via QoS on the Cisco 2901 wan router to match tcp ports 44046 and 31031).  We have 100mbps link to our remote site so we've always had to use the Cisco QoS feature to control this bandwidth.  Using Netflow in WhatsUpGold we could see some spikes from individual ESXi hosts add up to the 80mbps but not always when things were caught up.

Now with the upgrade it has taken 3 days to complete "initial full sync" which the properties have shown a lot of that time is comparing checksum on the remote file as well as transfering data.  Because we have 30 VM's doing this I would have thought we EASILY would see 80mbps throughput aggregate across 6 ESXi hosts.  For the first two days it doesn't seem we've barely peaked 25mbps.  Just today I'm finally seeing some larger bandwidth usage, like some ESXi hosts approx 20mbps.  Just today I see 1 VM in Sync (RPO Violation), 3 VM's in Sync, 22 VM's in OK, and 4 VM's in Initial Full Sync.

Is there a reason what I think that Initial Full Sync is so throttled or slow?

For example our Exchange server is in Initial Full Sync and is at 73%.  Checksum compared: 1.02 TB of 1.40 TB and Transferred 24.43 GB of 24.43 GB.

I just want to make sure theres no other place to check where vR traffic can be throttled.  I know a lot has changed from the initial release of VR to this latest release.

Also can you vMotion a machine that is in Initial Full Sync or Sync?  The reason I ask is because I moved a VM that was in Initial full sync from an ESXi 5.0 patch 12 host to an ESXi 6.0u2 host (in preperation to evacuate the source host for an ESXi upgrade), and it crashed host.d on the destination VM, orphaning the VM (though it still ran fine).  Off hours I had to RDP to it to shut it down, then manually add it to another ESXi host so I could at least manage it.  Now that destination ESXi hostd will not start, even after reboots.  The reason I ask this about replication is becuase every time you reboot the host or try to restart hostd, a backtrace is generated in hostd.log that has some language in it regarding replication.

--> Panic: Assert Failed: "!repDiskInfo->GetDiskReplicationId().empty()" @ bora/vim/hostd/hbrsvc/

--> Backtrace:

-->

--> [backtrace begin] product: VMware ESX, version: 6.0.0, build: build-3620759, tag: hostd

--> backtrace[00] libvmacore.so[0x00316373]: Vmacore::System::Stacktrace::CaptureFullWork(unsigne

--> backtrace[01] libvmacore.so[0x00146D79]: Vmacore::System::SystemFactoryImpl::CreateBacktrace(

--> backtrace[02] libvmacore.so[0x00311DA0]

--> backtrace[03] libvmacore.so[0x00311E76]: Vmacore::PanicExit(char const*)

--> backtrace[04] libvmacore.so[0x0010767B]: Vmacore::PanicVerify(char const*, char const*, int)

--> backtrace[05] hostd[0x0086FC75]

--> backtrace[06] hostd[0x008361CE]

--> backtrace[07] hostd[0x0083EA98]

--> backtrace[08] hostd[0x00B461F5]

--> backtrace[09] hostd[0x00B46705]

--> backtrace[10] hostd[0x00B46ADB]

--> backtrace[11] hostd[0x00B47484]

--> backtrace[12] hostd[0x00878530]

--> backtrace[13] hostd[0x0083DAAE]

--> backtrace[14] hostd[0x0083EA51]

--> backtrace[15] libvmacore.so[0x000DAD28]

--> backtrace[16] libvmacore.so[0x000DB41B]

--> backtrace[17] libvmacore.so[0x000DD4E3]

--> backtrace[18] libvmacore.so[0x002555C0]

--> backtrace[19] libvmacore.so[0x002597CA]

--> backtrace[20] libvmacore.so[0x002599EC]

--> backtrace[21] libvmacore.so[0x002614CF]

--> backtrace[22] libvmacore.so[0x002562B8]

--> backtrace[23] libvmacore.so[0x0025AB53]

--> backtrace[24] libvmacore.so[0x0032023C]

--> backtrace[25] libpthread.so.0[0x00006D6A]

--> backtrace[26] libc.so.6[0x000D5D9E]

--> [backtrace end]

It will always fill up /vmfs/volumes/830250eb-944d5a3c-ee55-8197af9d8600 to 100% with a coredump.  I delete the core dump and get 71% free on this volume, however if I try to restart hostd (or the ESXi host itself) it fills up again to 100% with a core dump file.

SR 16140535006 has been created for that last issue, but really the main purpose of my thread (sorry for being so long) are these two questions:

Does anything throttle VR traffic besides my QoS on my router now in 6.1.1?  In VR 1.0 VR traffic always exited the management nics, which we did not have on dV switch, so any dV switch Network IO/SRV would not change shares or throttle traffic, hence us resorting to router QoS.

Can you vMotion a VM from a 5.0 host to a 6.0 host while VR is doing an Initial Full Sync or Sync without it crashing the destination 6.0u2 host?

6 Replies
admin
Immortal
Immortal

vSphere Replication traffic is throttled by default. As far as I know this series blog articles are still relevant for Replication 6.x so you may be able to improve throughput by tinkering with the advanced configuration options:

http://blogs.vmware.com/vsphere/2012/06/increasing-vr-bandwidth.html

vSphere Replication 5.5 Performance Findings - VMware vSphere Blog

http://blogs.vmware.com/vsphere/2013/10/a-few-cautionary-notes-about-replication-performance.html

Second issue looks like a bug. You should probably open a support request but to work around you could try create a new VM and point it to the existing VMDKs. Then delete the problem VM, and reconfigure replication on the new VM.

cypherx
Hot Shot
Hot Shot

Thank you very much, that documentation is extremely helpful.

After a few days now (we did the upgrade tuesday) now vSphere replication seems to be chugging along up to our router defined QoS threshold of 80mbps.  Perhaps the first few days it just takes a long time doing checksums of the existing "seeds" at the other end, so it really wasn't transferring much because the replication servers were very busy reading and calculating those checksums to make sure the seed disks were intact.

Yes for the last issue, I do have a case open with VMware.  I have my 8th host that is unusable because the vmotion to it caused hostd to stop.  Even though off hours I manually imported those two vm's to another working server, and rebooted that 8th host - hostd still will not stay started with the backtrace -> Panic: Assert Failed: "!repDiskInfo->GetDiskReplicationId().empty()" @ bora/vim/hostd/hbrsvc/

Reply
0 Kudos
cypherx
Hot Shot
Hot Shot

We vmotioned a VM from ESXi 5.0 server to a different ESXi 6.0u2 server and guess what... same exact hostd crash and everything.  Now I have 6 orphaned (but running) VM's on a second ESXi host that will never recover.

Seems like a major bug.

Reply
0 Kudos
vbrowncoat
Expert
Expert

Can you PM me the SR#?

Reply
0 Kudos
skykid
VMware Employee
VMware Employee

It looks a bug which fixed in next vsphere 6.0 patch which will be released in July or August. Currently, we can provide a debug patch with that fix to customer to check if it works. Thanks!

cypherx
Hot Shot
Hot Shot

Thats good to know that this has VMware's attention.  Currently out of my 8 ESXi hosts, host number 6 is running one machine, but since hostd will not stay started I cannot manage this ESXi host in any way.  Luckilly the machine is still chugging along fine and we have MS RDP access to it so there is no downtime. 

If I want to access any vim-cmd or vSphere client on this host 6, what I have to do is edit the /etc/vmware/hostd/vmInventory.xml file to remove reference to this running VM.  Then I can restart hostd and it remains started.  I can manually right click and import the .vmx file for this running machine on this host and it will show connected and I can work with the host and vm using any standard vmware tools and management.  However about an hour or two later hostd will crash again.  hostd.log looks like it tries to restart itself 3 times or so but then just gives up with the same backtrace as usual.

We tried removing the replication.  In vSphere web client I stopped it.  On both source and destination it no longer shows as replicating.  However when I did stop it there was an error in the task list about a resource being in use, even though the entry is now gone.  So we found all references to hbr in the .vmx file and removed them.  Off hours remotely powered off the VM via Remote Desktop, and imported it to a different ESXi 6.0, 3825889 host and powered it on.  Everything ran beautifully for about an hour and a half, and then this hostd service crashed again.  VM is still running though, only evidence of it SSH to the host run TERM=xterm esxtop , you can see its name riding the top of the list.

Even though there are no references to hbr in this vmx config file anymore, there is a file out in this VM's directory called hbr-persistent-state-RDID-{some guid}.psf  It hasn't been modified since I last powered on the VM 6/15/2016 7:27:32 AM EST (and yes this power on was after adding the modified vmx file to inventory with all the hbr line items removed).  So I wonder if the presence of this file makes some part of the ESXi host "think" it should be running through some replication code.

Reply
0 Kudos