VMware Cloud Community
jasonboche
Immortal
Immortal

Protection Group hung in "Reprotecting..." state

SRM 5.0 GA on vSphere 5.0 GA

After a Planned Migration of a Protection Group completed successfully, the follow up Reprotect operation failed.  Repeated Reprotect operations with the "Force Cleanup" option fail.  As a result, the Protection Group is left in a state of Reprotecting...  I can manually clean up on the VM and storage side, and I can also delete the associated Recovery Plan. However, the Protection Group remains in a state where I can do nothing with it. Can't delete. Can't unpair or remove SRAs at this point because of the Protection Group dependency.  Does anyone know how to clean up an SRM 5.0 Protection Group when it's stuck in this state without uninstalling and reinstalling the product?

Thank you,

Jas

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
0 Kudos
31 Replies
TimOudin
Hot Shot
Hot Shot

I should just presume you've tried it but...did you consider restarting SRM services?  This has cleaned up other stuck tasks for me in the past.

Tim Oudin
0 Kudos
jasonboche
Immortal
Immortal

Tim Oudin wrote:

I should just presume you've tried it but...did you consider restarting SRM services?  This has cleaned up other stuck tasks for me in the past.

I've tried this in the past to no avail. The SRM database does too good of a job tracking state of the objects/environment.

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
0 Kudos
Smoggy
VMware Employee
VMware Employee

did you get this resolved?

0 Kudos
basher
VMware Employee
VMware Employee

Did SRM or VC crash during reprotect operation?

If you go to the offending Protection Group and into the Virtual Machines list, do you see any VMs with errors? If so, could you try to "Remove Protection" on these VMs.

Thanks

Director - VMware Site Recovery Manager
0 Kudos
jasonboche
Immortal
Immortal

Lee Dilworth wrote:

did you get this resolved?

No. I can still reproduce this issue.

More accurately, yes I resolved it the way I always need to resolve it which is to uninstall and reinstall SRM. A better resolution is still needed though.  I believe the intent is that the Force Cleanup option is supposed to resolve anything that SRM can't work out through the SRA but using that option fails as well.

If this was anything other than a demo lab (ie. complex customer environment), I'd be pretty upset in having to set up the whole environment again (including startup groups, depencencies, IP customizations, etc.)

Message was edited by: jasonboche

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
0 Kudos
jasonboche
Immortal
Immortal

Stefan Tsonev wrote:

Did SRM or VC crash during reprotect operation?

If you go to the offending Protection Group and into the Virtual Machines list, do you see any VMs with errors? If so, could you try to "Remove Protection" on these VMs.

Thanks

Neither crashes that I'm aware of.  The reprotect fails, throwing some errors in the process.  The root cause has something to do with failing to reverse replication via the SRA.

At that point, the protection group is left in a pseudo state of a reprotection in progress (mandating that a successful reprotect or cleanup be compelted, but force cleanup doesn't work) such that anything it depends on cannot be removed (ie. Array pairs).  I'm not able to edit or remove the protection group itself when it is in this state and I don't recall being able to remove individual VMs but I'll take a look at that next time it happens.

Message was edited by: jasonboche

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
0 Kudos
Smoggy
VMware Employee
VMware Employee

if you can reproduce and have time that would be appreciated. i've never seen this state in my environments inside vmware so I'd be VERY keen to get the log bundles when this issue is occuring as I've not seen anyone else report it either. if you can raise an SR for it as well even better and let me know the SR number i'll ensure it gets the right visibility.

some questions:

- storage platform / SRA in use?

- storage s/w or firmware versions?

- ESX host versions at both sites? i'm guessing 5.0 but with array replication could be 4.1/4.0 or 3.5 so I need to ask...sorry

once the issue is reproduced simply generate the log bundles and send in via the SR....if possible i'd also like to see these as well soon as you have them. my work email is simply lee@vmware.com

jasonboche
Immortal
Immortal

Lee,

Pure vSphere 5.0 GA & SRM 5.0 GA environment.

I have some older logs but I'll just jump in an alternate lab and reproduce fresh logs.

I'll take the storage specific questions to email.

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
0 Kudos
Smoggy
VMware Employee
VMware Employee

thanks. appreciate you taking the time to reproduce it.

0 Kudos
admin
Immortal
Immortal

Have you double-checked that the storage you're using is at the minimum supported firmware rev. for the latest SRA?  I have seen this when I was working with a falconstor VSA that was not at the min MP4 patch level.  Same results as you where I needed to reinitialize to clear that protection group and successfully go through with the reprotect.  Putting in the upgrade fixed the issue in my case.

0 Kudos
jasonboche
Immortal
Immortal

vmwnelson wrote:

Have you double-checked that the storage you're using is at the minimum supported firmware rev. for the latest SRA?  I have seen this when I was working with a falconstor VSA that was not at the min MP4 patch level.  Same results as you where I needed to reinitialize to clear that protection group and successfully go through with the reprotect.  Putting in the upgrade fixed the issue in my case.

Supported/certified storage: Yes

To me this is an SRM framework/workflow issue.  Storage management is SRA's responsibility.  SRM should leave its problems with the SRA and those issues should not impact protection groups, and recovery plans to the point that the SRM application needs to be uninstalled which could expose a RTO vulnerability to protection groups which are based on other SRAs/array pairs.

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
0 Kudos
ryan_gallagher
Enthusiast
Enthusiast

Jason Boche wrote:

Neither crashes that I'm aware of.  The reprotect fails, throwing some errors in the process.  The root cause has something to do with failing to reverse replication via the SRA.

At that point, the protection group is left in a pseudo state of a reprotection in progress (mandating that a successful reprotect or cleanup be compelted, but force cleanup doesn't work) such that anything it depends on cannot be removed (ie. Array pairs).  I'm not able to edit or remove the protection group itself when it is in this state and I don't recall being able to remove individual VMs but I'll take a look at that next time it happens.

Hi Jason,

It sounds like the root issue is may be an SRA failure.  If you reproduce this again definitely provide us the logs so we can analyze and hopefully work through that issue with you.

You were able to re-run reprotect though, so the plan itself was in an okay state after this failure.  The group state was you say was "stuck" in "Reprotecting...".  This may be because in the UI basically there is a bit of an ambiguity about the current combined state of the group and its peer group.  Internally we have more states that we maintain for each group and the UI generally tries to look at the group's state and its peer group's state to figure out a combined state to display to the users.  Unfortunately, there is a known point in the reprotect process that we know that the UI can't easily determine if the group is currently running or not and so will display "Reprotecting..." even if the group isn't doing anything (i.e. the UI doesn't look at the various plans' states and contents to determine whether or not the group is actually in use).  Internally the group state pairs are consistent though.  (I think we have a release note for this issue.)

What's a bit off about this explanation though is that the group was failing during reverse replication -- this UI issue should only happen if the group completed reverse replication successfully but failed in the next reprotect step.  I suspect that the group may actually have failed when we were running that next step, but I'd need logs to know for certain.  Was the failure in the "1. Configure Storage to Reverse Direction" step, or the "2. Configure Protection to Reverse Direction" step?

Otherwise though, you still should have been able to unprotect VMs and once that is done to remove the group itself.  The server is supposed to always be able to unprotect VMs no matter the group state just to avoid "stuck" states as you described.  Besides that the only operation that I would expect that you'd be allowed to perform is to re-run reprotect -- anything else wouldn't actually make sense given the state of the group and its peer.

So again, if you can reproduce this then please send us the logs.  Hopefully uninstalling the server can be avoided.

Also, as for the "Force Cleanup" option, that only applies to the 3rd step, "Cleanup Storage".  We can't skip the previous steps and expect things to work properly but this one we generally can.

jasonboche
Immortal
Immortal

Thanks for the reply Ryan. I can pretty consistently reproduce the issues; I'm unable to open an SR with my current personal or Global Technology Alliance Partner account but I'll definately get those into VMware if there is a way that I can (perhaps upload to the root of the FTP site or email to Lee as the logs should be small enough for a single run).

In your 2nd paragraph, you've described precisely what I'm seeing with the protection group (Reprotecting...) You may be correct on the exact point of failure - I've sent a screenshot tonight to Lee's email address. "Configure Protection to Reverse Direction" is the step that fails in the screenshot.

Once in this state of "Reprotectiong...", Force Cleanup doesn't resolve the issue and I can't manually clean up by deleting protection groups or recovery plans since the UI sees them as actively running.

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
0 Kudos
ryan_gallagher
Enthusiast
Enthusiast

Hi Jason,

It sounds like it's failing in the section of code I suspected then.  Feel free to continue working with Lee on this and sending him whatever logs/screenshots you have as I can get them from him if an SR can't be filed.

I'm curious about a couple things when the group is in this state and just want some confirmation.  If you get a chance and don't mind trying some things out upon reproducing this then can you answer the following for me:

1) Can you edit the recovery plan to remove the group?  I think the plan may be in the Incomplete Reprotect state and it should allow editing.  (Maybe you addressed this already in your original post.)  This would allow other groups to proceed with reprotect.  Otherwise creating a new plan with working groups may be another option.

2) Can you unprotect VMs from the group in the "Protection Group" "Virtual Machines" tab?  This shoud be enabled.  If so then does it work?  If you remove all the VMs in the group and then run Reprotect again does it now work?  (Given where it's failing I suspect it won't work, but it's worth a shot.)

3) Is the ability to remove the protection group just not available when it's in this state? I suspect it may be given our UI specification.  Otherwise, if it is actually available then with what error does it fail when invoked?

Thanks,

-Ryan

0 Kudos
jasonboche
Immortal
Immortal

Ryan wrote:

Hi Jason,

It sounds like it's failing in the section of code I suspected then.  Feel free to continue working with Lee on this and sending him whatever logs/screenshots you have as I can get them from him if an SR can't be filed.

I'm curious about a couple things when the group is in this state and just want some confirmation.  If you get a chance and don't mind trying some things out upon reproducing this then can you answer the following for me:

1) Can you edit the recovery plan to remove the group?  I think the plan may be in the Incomplete Reprotect state and it should allow editing.  (Maybe you addressed this already in your original post.)  This would allow other groups to proceed with reprotect.  Otherwise creating a new plan with working groups may be another option.

2) Can you unprotect VMs from the group in the "Protection Group" "Virtual Machines" tab?  This shoud be enabled.  If so then does it work?  If you remove all the VMs in the group and then run Reprotect again does it now work?  (Given where it's failing I suspect it won't work, but it's worth a shot.)

3) Is the ability to remove the protection group just not available when it's in this state? I suspect it may be given our UI specification.  Otherwise, if it is actually available then with what error does it fail when invoked?

Thanks,

-Ryan

As luck would have it, I wasn't able to reproduce the problem earlier this morning for 2 hours, then this afternoon it showed up in a live customer demo Smiley Happy  Reprotect and Reprotect with Force Cleanup does not complete; instantly fails.  This time due to step 1.0/1.1 failing "Error - Failed to reverse replication for failed over devices. Cannot process device '21666' with role 'target' when expected device with role 'promotedTarget'."

1)  Yes.  After I remove the PG, the RP goes into an error state about "This plan cannot be run because it doesn not contain any protection groups.".  On the PG side, the PG is still in "Reprotecting..." state & it cannot be edited or deleted.  I then ran a Reprotect against the Protection Group after the VMs were removed from the PG individually. This resulted in an immediate failure as did the Force Cleanup option.

2)  Yes.I can remove all VMs from the protection group.  Then at that point, the PG still exists in a "Reprotecting..." state and cannot be edited or deleted.

3)  I've only seen the inability to remove the PG when it's hung in the "Reprotecting..." state.

Screenshots and logs sent to Lee Dilworth via FTP/email followup.

Thank you,

Jas

Message was edited by: jasonboche  Added additional step/reponse to 1)

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
0 Kudos
ian0x0r
Contributor
Contributor

Did you ver get this reolved Jason? I've got the same situation now with Dell Equallogic kit and I dont really want to go down the root of uninstalling and re-installing SRM to fix this.

Thanks,

Ian

0 Kudos
russiamutha
Contributor
Contributor

Had the same issue, seee "force removal of protection group" thread below.

The only way to fix this is to reinstall, as there is a corruption in DB. You will need to start with a new DB as well.

http://communities.vmware.com/message/1877533#1877533

0 Kudos
jasonboche
Immortal
Immortal

ian0x0r wrote:

Did you ver get this reolved Jason? I've got the same situation now with Dell Equallogic kit and I dont really want to go down the root of uninstalling and re-installing SRM to fix this.

Thanks,

Ian

Lee and a few others were able to spend considerable time with me on this.  In my instance, I was able to solve the problem without uninstalling/reinstalling the environment.  I did so by cleaning up the storage and replication at both sites which SRM managed.  One or more of the volumes at one of more of the sites was in a precarious state from a replication standpoint.  This was causing the SRA to return a status to SRM which SRM did not like and thus would not proceed until the precarious state was resolved and a more appropriate/proper status code could be returned by the SRA.  Once that underlying storage issue was remedied, the force cleanup worked instantly as one might expect it to and I was then able to tear down the recovery plan and the hung protection group, fix the direction of replication as needed for the LUN(s), then simply recreate the protection group and recovery plan.

As I cross posted in the other thread linked above, I think there needs to be an ability to cut away the protection group & recovery plan without reinstalling the environment.  From the looks at the other thread, that person will not be able to forcefully remove the protection group.  SRM should not allow the SRA to interfere with the integrity & protection of the rest of the environment, particularly where there could be other array models involved.  This is an Achilles Heel from an architecture standpoint.

Thank you,

Jas

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+
0 Kudos
ian0x0r
Contributor
Contributor

Thanks for that.

I think in my instance things are too far gone. I have re-confingured the SAN replication correctly, but I had already removed the recovery plan and removed the VM from the protection group. Creating a new recovery plan and adding the broken protection group to it fails instantly with error message Call "DrRecoveryRecoveryManager.Reprotect" for object "DrRecoveryManager2". I think im going to end up reinstalling this environment. Good job its test dev really.

Thanks,

Ian 

0 Kudos