Solved: Another "datastore not found" error

gheywood · ‎09-08-2010

Hello,

Occasionally I get this error while running plans.

Lets say I have five VM's spread over two volumes. Three of the five may recover five, but the other will generate this error. They will be there, but the "change network settings" (and onwards) won't have been done.

When I happens, I can press "continue" to finish the task and try the test later, which will normally work.

Last time this happened, I found the following error in the vmware-dr.log file:

Section for VMware vCenter Site Recovery Manager, pid=1532, version=4.0.0, build=build-236215, option=Release

RecordOp ASSIGN: runtimeInfo.runtimeStatus, RSVm-87974

RecordOp ASSIGN: runtimeInfo.finishTime, RSVm-87974

Released VC LRO semaphore, token = '947'

Progress is unchanged

RecordOp ASSIGN: info.error, RSGroup-87813SecondaryShadowVMRecover-7724

Error set to (dr.san.fault.RecoveredDatastoreNotFound) {

dynamicType = ,

faultCause = (vmodl.MethodFault) null,

datastore = (dr.vimext.SanProviderDatastoreLocator) {

dynamicType = ,

primaryUrl = "sanfs://vmfs_uuid:4bfe3ea4-fc89b582-71eb-0024817df61b/",

},

reason = (vmodl.MethodFault) null,

msg = "",

}

RecordOp ASSIGN: info.completeTime, RSGroup-87813SecondaryShadowVMRecover-7724

State set to error

RecordOp ASSIGN: info.state, RSGroup-87813SecondaryShadowVMRecover-7724

Not Starting Tasks All Tasks Complete

Task destroyed

MRT-DoneCallback Task RSGroup-87813SecondaryShadowVMRecover-7724 for RSGroup-87813

SetTaskComplete

SetRuntimeStatus for RSGroup-87813 from running to error

RecordOp ASSIGN: runtimeInfo.runtimeStatus, RSGroup-87813

RuntimeInfoError (dr.san.fault.RecoveredDatastoreNotFound) {

dynamicType = ,

faultCause = (vmodl.MethodFault) null,

datastore = (dr.vimext.SanProviderDatastoreLocator) {

dynamicType = ,

primaryUrl = "sanfs://vmfs_uuid:4bfe3ea4-fc89b582-71eb-0024817df61b/",

},

reason = (vmodl.MethodFault) null,

msg = "",

}

RecordOp ASSIGN: runtimeInfo.finishTime, RSGroup-87813

RecordOp ASSIGN: runtimeInfo.runtimeFault, RSGroup-87813

Task created

FormatField: Optional unset (dr.san.fault.RecoveredDatastoreNotFound.reason)

Starting Task RSGroup-87994SecondaryShadowVMRecover-7942 for step RSGroup-87994

MultipleRecoveryTask Info max 1 cur 1 remain 5

RecordOp ASSIGN: info.startTime, RSGroup-87994SecondaryShadowVMRecover-7942

The datastores are obviously available as the VM's are being presented.

We are running various Equallogic SAN's to host the VM's. I did log this a while ago and VMware suggested that it could be because of a loss of contact between the two VC's, but the error which they found was actually slightly earlier in the day and related to another issue (a timeout issue).

Any thoughts on this one?

TimOudin · ‎09-10-2010

With inconsistent errors registering a virtual machine with this error I have found that setting the ESX servers to rescan for storage twice has provided resolution. There is actually a KB article outline some storage arrays in which this is recommended. On the recovery site, edit advanced settings and set SanProvider.hostRescanRepeatCnt = 2. Default settings on this value is 1. Try it, it can't hurt anything!

Tim Oudin

View solution in original post

vijayagce · ‎09-09-2010

Hi,

Could you please give me the details about primary and recovery site configurations like number of ESX servers in primary site, number node in primary site cluster, no. of ESX in recovery site and no. of ESX in recoery cluster. I think it may be a resoure problem because we faced the same issue with resource problem in recovery site.

Regards,

Vijaya

gheywood · ‎09-10-2010

We have five DL380 G6's at our production site with:

Total CPU: 117 GHZ

Total Memory: 319.96 GB

DR is smaller. We have four DL380 G5's.

Total CPU: 60 GHZ

Total Memory: 73.99 GB

Not sure how much that tells you but we also run a test environment at DR and continue to run it while DR tests are on-going. That said, although the boxes are slower and nowhere near as beefy (with older generation CPU's), the VM's are obviously slower but I am generally not at the point of maxing out the host CPU's or RAM.

The other difference is storage. We have much quicker storage at production because we have more units and more spindles and use multiple paths to connect to the storage units. While the hosts are configured in DR to use multiple paths, VMware defaults to "fixed path" for the newly connected volumes. They are also connecting to volumes on one SAN (48 disks at DR compared to 128 disks across 6 units in production).

VMware are pointing at the SRA, but the storage is being presented to the hosts and the hosts are seeing the storage (I can see it listed as a datastore and I can browse it). I think perhaps SRM is attempting to register or perform some operation on those failing VM's before the hosts have finishing configuring the connection to the datastore (or something like that).

vijayagce · ‎09-10-2010

Did you try this scenario:

Creating Recovery plan with only one VM which is failed with network configuration error.

then we can make sure that the problem starting point.

Regards,

gheywood · ‎09-10-2010

Not specifically one VM, but I split the plan into three smaller plans and it does seem much more reliable (the smaller plans haven't failed). Even on the one plan though, it would often work intermittently..

vijayagce · ‎09-10-2010

Did u check whether the VM network has auto config while creating recovery plan?

gheywood · ‎09-10-2010

No, my test networks all go onto a specific VMPG on all my plans. Why do you think that could be a factor?

vijayagce · ‎09-10-2010

While SRM certification we used auto network configuration only, because in SRM config guide vmware told to set the network config as auto.

can u test with auto config whether its working with that config. Alos may i know what is VMPG? I am not aware abt this.

Regards,

vijaya

gheywood · ‎09-10-2010

VMPG = Virtual Machine Port Group.

I have set it to auto and all is OK at the moment. I am going to do some more testing over the next few days will report back.

Also found this in the Admin guide:

"By default, the test network is specified as Auto, which creates an isolated test network. If you would

prefer to specify an existing recovery site network as the test network, click Auto and select the network

from the drop-down menu."

So it should work on networks other than auto...

TimOudin · ‎09-10-2010

With inconsistent errors registering a virtual machine with this error I have found that setting the ESX servers to rescan for storage twice has provided resolution. There is actually a KB article outline some storage arrays in which this is recommended. On the recovery site, edit advanced settings and set SanProvider.hostRescanRepeatCnt = 2. Default settings on this value is 1. Try it, it can't hurt anything!

Tim Oudin

gheywood · ‎09-10-2010

Just got the error again with the networks set to auto.

Thanks Tim there are a couple of those that I am intending to look at. That one, and some of the timeout ones. I will make a couple of changes and see what happens.

TimOudin · ‎09-10-2010

Found KB 1008283 for reference, all thanks to @dawoo reponse to twitter rant.

This refers to a total failure to recover a datastore but the concept is the same.

Tim Oudin

gheywood · ‎09-14-2010

Well after more testing yesterday, changing the rescan value to 2 seems to have done the trick. I did have a few time out issues (which when I have seen in the past, have been to do with communication between the two VC's), but the "datastore not found" error hasn't occurred after about 10 attempts. On Friday, it was occurring maybe 40% of the time.

TimOudin · ‎09-14-2010

Is there anything new in the logs after the datastore mount failures?

Tim Oudin