VMware Cloud Community
TimCYTse
Contributor
Contributor

SRM Failover stuck for RDMs

Hi all,

I experienced an intermitted failover fault for both failover "test" and full "recover" process.

We have 3 x ESXi 5 at production site and 2 x ESXi 5 at remote site. The volumes are replicated by EMC RecoverPoint.

And vCenter SRM are setup for handling the DR action.

The problem occur for a protection group with two VM with Win 2K8 and share 3 x RDM physcial mode.

The failover process stuck at "Power-on VM > Configure Storage". The process keeps waiting and never ended without obvious error message, also cannot be cancelled.

By observation, the last action of SRM is doing something about "srm-rdm-helper".

The problem seems occur intermittedly and randomly. But we catch a rule that when SRM create more then one "srm-rdm-helper", then the problem will occur.

And the problem will NOT occur when we unassign all RDM or we just keep single RDM for the VMs, because the "srm-rdm-helper" will not be create when no RDM exist and always only one "srm-rdm-helper" is created when keeping single RDM)

By the repeatedly troubleshoot testing, we basically isolated that the problem should not related to vSphere HA, DRS, ESX host location for VM.

We have create a support case, but so far we didn't get any good new from them.

So does anyone have similar experience? Any sucessfully case for this configuration? And any hints I can try else?

Tim

0 Kudos
10 Replies
lhevia
Contributor
Contributor

We have configured:

  • 2x VC 5.0
  • 4x ESXi 5.0 u1 per site
  • SRM 5.0.1
  • HP P6000 SRA 5.0.0
  • 2x HP EVA8100 replicated with HP Command View 10.0 and Continuos Access.

We have the same problem with a MSCS 2-node Windows 2003 with 2 RDM disk.

Although we don't have any problem if we use a single VM with 2 RDM disks.

We have also open support case.

Have you find any solution?

Regards,

Luis.

0 Kudos
TimCYTse
Contributor
Contributor

Hi,

I have created a support case for more then 20 days. They are still working for it.

So far, I don't get any solution or workaround.

I will update here, if any good news.

Tim

0 Kudos
lhevia
Contributor
Contributor

We had a webex session with a vmware engineer, and he said that it is a deadlock condition when the SRM is assigning the RDMs with the srm-rdm-helper.
He said it is a bug and that they are working on a fix.

TimCYTse
Contributor
Contributor

Hi Ihevia,

That's a good news for me. For our support case, they still investigating. And they reply that they cannot reproduce the problem in their own environment.

Would you mind do me a favor? Can you send me your support case number? so that I can inform my case owner refering to?

My e-mail address is cytsetim@gmail.com.

Many Thanks!

Tim

0 Kudos
cqde
Contributor
Contributor

I have just run into a similar error.

I have three ESX 5 hosts at production and two hosts at DR, all are Build 469512

SRM is 47459

I have an EMC VNX5100 replicating to a CX3-20 using MirrorView/S

VNX SRA version 5.0.1

MirrorView Enabler 5.0.8

All my VMs will failover, both test and actual, except the two clusters that are using RDMs

One of the clusters has two RDMs attached (quorum and data drive) and the test failover worked twice in a row.

The second cluster has Three RDMs attached and during the test it hung during the Power on VMs phase at 10%

On the DR site it shows the Recovery Plan is at 66%

On the DR vCenter I get the following error;

Mount VMFS volume
vsphere1
The operation is not allowed in the current state.
Administrator
DR-VCENTER
4/14/2012 4:02:40 PM
4/14/2012 4:02:40 PM
4/14/2012 4:02:40 PM

I see two helper VMs on the DR side.

When I tried to cancel the test, it said it was cancelling and then sat there for ever.

At this point, any other SRM operation fails.

Eventually I had to uninstall SRM and clean up all the snapshots before re-installing everything.

I even tried uninstalling SRM, and vCenter at both sites, re-installed everything, ran the test again and it failed at exactly the same point.

At least I can reproduce the error.

I am waiting for VMware supoort to look at the case.

0 Kudos
cqde
Contributor
Contributor

Does anybody know how to get out of the stuck situation without uninstalling and re-installing SRM?

Cancel does not work, it just hangs and then stops other SRM operations from working.

I have re-installed SRM six times at the same site now. HELP!!!

0 Kudos
lhevia
Contributor
Contributor

It is not necessary to reinstall.

You will have to:

  • Restart the SRM services on both sites
  • Run again the test, it will fail.
  • Delete the VMs named "srm-rdm-helper"
  • Reprotect

We are still waiting for a solution to the full problem.

0 Kudos
TimCYTse
Contributor
Contributor

Hi cqde,

I have created the support case more then 1 months ago for this problem.

VMWare support confirmed that is a bug.

I believe there is nothing can avoid the problem, excepted waiting for hotfix.

Hopefully, I hope the hotfix can deliever within following 2 weeks.

Now, I am urging VMWare support everyday.

Tim

0 Kudos
TimCYTse
Contributor
Contributor

Finally...I got the HotFix about this issue today.

VMware-srm-5.0.1-690170

Now I am downloading and wish everything become OK.

Tim

0 Kudos
vlaxa
Contributor
Contributor

Where did you get hotfix 690170?

Update: fired SR, relevant informations are in KB 2020532: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=202053...

0 Kudos