vSphere replication RPO

Toes · ‎02-28-2017

Hi everyone,

I have the following setup:

Production & DR Sites

vCenter 6 build 3634793

ESXi 6 build 3825889

SRM 6.0.0.1 build 2700459

vReplication 6.0.0.3 build 3845888

Quite a number of times I have noticed that a bunch of replicated VMs will start showing RPO violations. if you take a look at one of the RPO flagged VMs from within the web client and check the replication details you see only a few KB replicated over a period of time instead of the GBs that should have already replicated.

I usually vmotion the problem VMs to a new host and the replication then runs within normal observed times to completion.

Anyone come across a similar problem? I have yet to figure out what causes it.

regards,

Craig

Sreec · ‎02-28-2017

You need to do a basic health check both from Network and Storage perspective

1. Login to the hosts were VM's are residing on both source and destination and check vmkernal and hostd logs for that time period to confirm if there were any connectivity issues reported . Live ESXTOP would also be a better solution when RPO is getting reported - Watch out for Network counters Rx Tx etcc.

2. Any storage connectivity issues or disk latency issues should be checked

3.Any fluctuation in network link ? Is this a dedicated connection for replication traffic?

4. Also do understand how RPO works -- >Understanding vSphere Replication (VR) Scheduling and RPO Violations - VMware vSphere Blog

I usually vmotion the problem VMs to a new host and the replication then runs within normal observed times to completion-> Good observation,does that mean RPO violation never got reported in the next replication cycle on the new host ? OR do you need to migrate the VM again ?

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

Toes · ‎02-28-2017

Thanks for the feedback!

I will need to get the network and storage teams involved too.

I will provide feedback as I discover issues.

Craig

Toes · ‎03-01-2017

A bit of feedback.

One of the VMs is currently in vReplication "limbo", the VM shows the following:

on the target site replication appliance they following is seen regarding the above vm:

after a short period of time:

the above entries keeps on going through the same loop until the VM is vmotioned to another host and then the replication runs successfully.

TIA,

Craig

admin · ‎03-01-2017

Check the logs of the ESXi host on which the replicating VM was registered, when replication was not progressing.

There may be some clue about why it was not working. try /var/log/vmkernel.log or maybe /var/log/hostd.log. You could try 'grep -i hbr' in either of these logs to narrow down to replication activity.

Is it possible the problem is associated with certain ESXi hosts or does it seem to affect all of them?

Toes · ‎03-01-2017

This is across all hosts, a mix of BL460c Gen9 and DL380 Gen9.

The host logs do not seem to give much insight as to the problem.

Thanks,

All

vSphere replication RPO