I have the following setup:
Production & DR Sites
vCenter 6 build 3634793
ESXi 6 build 3825889
SRM 220.127.116.11 build 2700459
vReplication 18.104.22.168 build 3845888
Quite a number of times I have noticed that a bunch of replicated VMs will start showing RPO violations. if you take a look at one of the RPO flagged VMs from within the web client and check the replication details you see only a few KB replicated over a period of time instead of the GBs that should have already replicated.
I usually vmotion the problem VMs to a new host and the replication then runs within normal observed times to completion.
Anyone come across a similar problem? I have yet to figure out what causes it.
You need to do a basic health check both from Network and Storage perspective
1. Login to the hosts were VM's are residing on both source and destination and check vmkernal and hostd logs for that time period to confirm if there were any connectivity issues reported . Live ESXTOP would also be a better solution when RPO is getting reported - Watch out for Network counters Rx Tx etcc.
2. Any storage connectivity issues or disk latency issues should be checked
3.Any fluctuation in network link ? Is this a dedicated connection for replication traffic?
4. Also do understand how RPO works -- >Understanding vSphere Replication (VR) Scheduling and RPO Violations - VMware vSphere Blog
I usually vmotion the problem VMs to a new host and the replication then runs within normal observed times to completion-> Good observation,does that mean RPO violation never got reported in the next replication cycle on the new host ? OR do you need to migrate the VM again ?
A bit of feedback.
One of the VMs is currently in vReplication "limbo", the VM shows the following:
on the target site replication appliance they following is seen regarding the above vm:
after a short period of time:
the above entries keeps on going through the same loop until the VM is vmotioned to another host and then the replication runs successfully.
Check the logs of the ESXi host on which the replicating VM was registered, when replication was not progressing.
There may be some clue about why it was not working. try /var/log/vmkernel.log or maybe /var/log/hostd.log. You could try 'grep -i hbr' in either of these logs to narrow down to replication activity.
Is it possible the problem is associated with certain ESXi hosts or does it seem to affect all of them?