VMware Cloud Community
SteveBrat
Contributor
Contributor

Machines Replicated using VSphere Replicator Randomly Lock up

We are experiencing an issue where Virtual Machines that are being replicated using the vsphere replicator will randomly lockup when performing tasks such as a vmotion or creating a snapshot.  When this occurs the machine is totally unresponsive and the DataStore the machine resides on shows extremely high latency.  Also there is a ton of IO on the storage array. you cant kill the machine or anything.  you just have to wait until it becomes responsive which is normally after about 30 minutes  We also have had 2 servers that we were unable to power off or reboot without them locking up.  We've since found another means to replicate those machines and haven't had any issues since.  I've opened a few tickets with vmware and haven't gotten a real solution to this issue.  one tech pointed to a forum post of another customer that had the same issue but the customer stopped using the replicator to resolve his issues.  Has anyone else run into this? 

0 Kudos
2 Replies
cfsullivan
Contributor
Contributor

For me this seemed to be resolve by two changes. We went without quiescing the VMs (they are all Windows) and also upgrading the VR appliance to 5.5.x.x. (I was at 5.1.x and had to wait until vCenter was upgraded to 5.5 before I could do that). Going without quiescing certainly made a difference, but as far as I can recall the upgrade also was needed before I really began to stop seeing these issues. The replicated VMs that were affected particularly badly were SQL Servers.

My test of this issue was pretty simple. I would run a continuous ping of a couple of VMs that were being replicated and one on the same host that wasn't replicated. When we had the issue I would see lots of drops on only the replicated VMs. At this point when I try the same test, I see virtually no drops. This isn't to say that it was a networking issue. It's just that the guest OS would be frozen to the point that its networking failed.

By the way, I seemed to have lost nothing by not quiescing. I have done a few test failovers without issue.

0 Kudos
SteveBrat
Contributor
Contributor


We are running 5.5 and we do not quiesce the VMs. This is the 3rd machine that this has happened to.  I know its the vsphere replicator because as soon as I moved the machines to Zerto for replication the machines no longer lock up.  right now I'm attempting to remove the replication and the machine has locked up.  it stays in this state for about 20 minutes and eventually responds.  I've opened several tickets with VMware and they always say that is the storage system because the DAVG, QAVG, KAVG stats are all extremely high for the datastore the VM is on when this happens.

Capture.PNG

0 Kudos