Itzikr
Enthusiast
Enthusiast

Error: Network device needed by recovered virtual machine couldn't be found at recovery or test time

Hi,

im having a very strange showstopper using SRM4

basically, im using vDS (dynamic port allocation) at both the protected and the recovery site, after a sucesfull inventory mapping and the creation of a Protected group / recoveyr plan, everything works fine at first, then when im trying to create another PG, the first VM im trying to protect is getting time out, i can resolve this by restarting the SRM services, after the services resatart i can protect the VM, but any attempt to ececute a recovery plan ends up with the error mentioned above..

it looks like the DB of SRM "forget / deletes" this network mapping and even a manual network mapping does not work..the gui also doesnt show the network mapping as well.

Itzik

Itzik Reich
0 Kudos
19 Replies
babyg_wc
Enthusiast
Enthusiast

I get a similar issue.

I create a Protection Group and Recovery Plan.

The first time I run a test on the Recovery Plan I have no issues, everything works as it should, if I re-run the test no probs as.

Howerver, after a period of time (not exactly sure if its time related, or if its because Ive done something with the protected VM) when i go to re-run (test) the recovery plan I get the same message as you have posted.

I am going to do some more testing, and will probably log a SR.

I am runing the latest ver of SRM 4.0.1, Lastest version of ESX and all patches, Latest verion of vCenter.

I have dual EVA8400, with the latest verion of the SRA.

0 Kudos
Itzikr
Enthusiast
Enthusiast

Hi Mate,

I opened a SR as well..for the last two weeks vmware support has been trying to reproduce the issue, last week they have, so it's a confirmed NASTY bug!

also, you didn't do anything wrong, every time you execute a recovery plan, the protected network mapping to the recovery network mapping dissapear and you cannot edit it unless you totally wipe out the RP and the PG

Solutions Architect

VCP,VTSP,MCTS,MCITP,MCSE,CCA,CCNA

EMC²

where Information Lives

If you find this information useful, please award points for "correct" or "helpful".

Itzik Reich
0 Kudos
babyg_wc
Enthusiast
Enthusiast

there are quite a few "gotcha" bugs in the v4 range of products (ive been using ESX since 2.x)... we also have a confirmed vSphere bug with SIX core numa boxes not scheduling 4 and 5 vpu machines well (kb to be released soon)

Back to SRM.

I find that SRM leaves the PROTECTED side ESX hosts in a poor state when you actually failover one or more of your protection groups (no probs for a test run, as it uses snapshots etc)

Basically when you fail over a protection group, it tells the SAN to fail over, and on a HP EVA, when you fail over, it will automagically unpresent the LUN or LUNs from the ESX host on the PROTECTED side, and present to the RECOVERY side.. But SRM it doesnt do anything with the ESX hosts at the protected side, they just have there LUN(s) ripped out from under them, not a prob in a FULL DR (site gone), but less than ideal when you only want to fail over one/two protection groups.

IBy ripping out LUNS from underneath running ESX hosts, the Service Consoles then spend forever trying to find the missing LUN/s. If there are a few LUNs removed, I find that you can even loose contact with the Service Console all together - the only way to get the SC back, is to fail back the LUNs to the protected site.

Check out this article for the issue... prob not a prob for EMC carrays with Powerpath.. is a problem for my HP EVA using vmware Round Robin..

http://virtualgeek.typepad.com/virtual_geek/2009/12/an-important-vsphere-4-storage-bug-and-workaroun...

0 Kudos
Itzikr
Enthusiast
Enthusiast

True, that's not my issue, im using an EMC RecoverPoint with PowerPath/VE installed on the protected ESX servers (the winning combination in my prespective but im biased..)

im basically waiting for the VMware engineering to come up with a patch to resolve the networking issues..

Itzik Reich

Solutions Architect

VCI,VCP,VTSP,MCTS,MCITP,MCSE,CCA,CCNA

EMC²

where Information Lives

If you find this information useful, please award points for "correct" or "helpful".

Itzik Reich
0 Kudos
babyg_wc
Enthusiast
Enthusiast

Yeah I have two seperate issues.. thou my :network device error: is not quite as predictable as yours... I can create the 2nd protection group and things still work, thou the GUI doesnt show network mappings for the 1st protection group (but it still works (for now).. wonder if its just a vDS issue?

Anywho Im going to log a SR for both problems, and link them to this article...

This will be 3 SR's on the go now...

Let me know if you get a patch... also would it be OK to quote your SR number when I log my SR (if so what is your SR)? I find it takes days/weeks to prove something is wrong before you get to the "lets make a patch" level of support.

Brandon.

0 Kudos
mal_michael
Commander
Commander

The APD bug has been fixed in ESX400-200912401-BG patch. More info: http://ict-freak.nl/2010/02/25/vsphere-apd-bug-is-solved-in-patch-esx400-200912401-bg/.

So, with this patch installed you should not experience the issue anymore.

0 Kudos
babyg_wc
Enthusiast
Enthusiast

Most excellent will test that out tomorrow, sorry can't award points, didn't start thread

...

Sent via my mobile device, apologies for any inaccurateness and brevity..

0 Kudos
Itzikr
Enthusiast
Enthusiast

just wanted to say that the networking bug still happens even after esx 4 U1 + the latest patches..

Itzik Reich

Solutions Architect

VCP,VTSP,MCTS,MCITP,MCSE,CCA,CCNA

EMC²

where Information Lives

If you find this information useful, please award points for "correct" or "helpful".

Itzik Reich
0 Kudos
babyg_wc
Enthusiast
Enthusiast

Did some checking, all my ESX hosts are fully updated (with the patch the article reckons fixes the issue). However the problem still exisits.

After esx 4 u1 you can run an advanced command.

HP are recommending to run the command (note requires esx4 update1)

“esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD”, but Ive logged a call with VMware

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=...

0 Kudos
babyg_wc
Enthusiast
Enthusiast

Hi Itzikr, Ive got two SR's logged, one for the "Error: Network device needed by recovered virtual machine couldn't be found at recovery or test time' " issue, and anothe for the storage issue.

Could I have your SR number at all?

My numbers are..

Issue #1 SR#1497681341 (storage issue)

Issue #2 SR#1497592691 (network SRM issue)

Im going to test the network SRM issue on a site I have that has NO vds...

0 Kudos
Itzikr
Enthusiast
Enthusiast

Hi,

My SRM Network issue SR is: 1492451261

Itzik

Itzik Reich
0 Kudos
Witek_Rolka
Contributor
Contributor

Hi all,

When I create a Protection Group and Configure All the VMs under it, the Recovery Site Network mappings are created for each VM based on the Inventory Mappings.

When I run a Recovery Plan test, what seems to happen is that the at the end of the test the Protection Group looses all the Recovery Network Site mappings from every VM in it. If I try to add the appropriate Recovery Site Network to the VM through the Configure Protection option, the setting doesn’t get saved even though I select the matching Distributed Port Group at the DR site from the drop down box.

The only way I can re-create the Recovery Site Network mappings is to Remove Protection from all the VMs and reconfigure the protection from scratch. This buggers up the VM network change process because the VM Ids change every time the protection Configuration process is run.

Look forward to hearing the results of your SR.

Kind regards

Witek

0 Kudos
babyg_wc
Enthusiast
Enthusiast

Been on the phone/email with VMware. Apparently the bug is logged with a Yellow flag (p1), and is affecting multiple customers.

So just need to wait for them to get back to us with a fix now.

0 Kudos
CRO
Contributor
Contributor

I have also come across this issue, and have logged SR#1498901701

To add to the different conditions this problem is seen under;

We use VSw on the protected site, and DVS (Static Binding) on the Recovery site. Our replication is done with Recoverpoint/SE between a CX3 @ Protected and a CX4 @ Recovery.

VM Support said a patch was due in June.

0 Kudos
babyg_wc
Enthusiast
Enthusiast

Better not be JUNE... as it stands SRM4 is NOT usable....

0 Kudos
CRO
Contributor
Contributor

A bit more info;

The fix for this info will be supplied via SRM U1 which is due in June. It is currently in QA so some people have their hands on it already.

As a work around, I have "downgraded" to virtual switching and so far in the testing I've done seems to work fine.

Thank goodness for host profiles.

0 Kudos
babyg_wc
Enthusiast
Enthusiast

Yep ive given up BETA testing SRM + vDS at this state. Reverted to standard switches at that SRM bug no longer is standing in the way. Have succesffuly used SRM to failover 120vms (then reversed it and failed back)

Ive had a SR logged for a number of weeks, and very little (eg NO) progress. Got more useful information out of this thread (eg standard switches work OK)

Quality control at VMware has certainly gone down hill in the past year or so.... How could they release a version of SRM that doesnt even work??? (rant ends).

Thanks to everybody for there input, sorry I have no real useful/good news at this stage...

0 Kudos
Michelle_Laveri
Virtuoso
Virtuoso

I do sympathise with your situation. I recall having this problem in my development environment whilst writing the SRM book. At first I thought it was something to do with me or my configuration - I've had some long standing network problems. I did raise this issue of lost switch configuration informally with VMware some months ago, but they said at the time that they hadn't seen anything of this nature.

So in some respects I'm pleased this thread exists but it at least validates I'm not going crazy and its not my kit/configuration after all...

It does seem shame that VMware cannot fix this with a defcon 1 patch, considering it affects their premium customers who have paid a premium for Enterprise+ and the DvSwitch it provides.

On the subject of quality control, I know personally there are changes coming though internally at VMware that might change this situation significantly. Unfortunately, I'm not yet at liberty able to talk about this publically. All I can say is comments like yours (and mine...) have been duely noted, and there are steps afoot to address those concerns....

In the meantime I understand there is timetable for another dot release that may fix this problem - but from what I understand that might not be until the next couple of months...

Regards

Mike Laverick

RTFM Education

http://www.rtfm-ed.co.uk

Author of the SRM Book:http://www.rtfm-ed.co.uk/2010/03/22/new-administrating-vmware-site-recovery-manager-4-0/

Free PDF or at-cost Hard Copy

Regards Michelle Laverick @m_laverick http://www.michellelaverick.com
0 Kudos
Itzikr
Enthusiast
Enthusiast

HI,

i havent updated this post for a while so here's the status:

Vmware have delivered me a special build of SRM 4.0.1 that FIX the issue.

the remaining issues i have are:

1. after protecting a VM and running a recovery plan test, the network mapping dissapear from the GUI, VMware confirmed this to be a vCenter "design" Bug and not an SRM one and do not have an estimate for a fix.

2. when you storage vmotion a VM to a protected group that you already created a PG and a RP for, the VM moved showd as unconfigured, and even if you configure it, it still comes as unconfigured, they confirmed this as another BUG with no time estimation for a fix.

3. with this special build of SRM, im sometimes getting a datastore error, but the VM do boot up sucesfully.

overall, i agree with the opinion here that VMware QA started to look bad as of last year, im guessing it's to do with the fact they are growing rapidly but like mike mentioned they are aware of this issue and in overall it took 6 weeks to fix my bug.

Itzik Reich

Solutions Architect

VCI, VCP,VTSP,MCTS,MCITP,MCSE,CCA,CCNA

EMC²

where Information Lives

If you find this information useful, please award points for "correct" or "helpful".

Itzik Reich
0 Kudos