Re: need a little guidance in troubleshooting this...

rickkar1 · ‎06-17-2012

just need a little guidance in troubleshooting the following error msg in SRM...., i.e., what may the likely causes be ...?

"Error - Failed to recover datastore 'xxx.xxx.xxxx.xxx' . Failed to create snapshots of replica devices. Failed to create snapshot of replica device xxxxxxxxxx. SRA command 'testFailoverStart' failed for device xxxxxxxxx. UnableToConnectException while making snapshot SRMt-xxx-xxx-xxxx-xxxxx : Regries ExceededNull Client xxx.xxx.xxxxx.xxxxxx ........"

makethevapor · ‎06-17-2012

I have worked through the same error a couple of times... Check to make sure that the service account for SRM on the array is not locked out and has admin priveledges.

Also check your documenation for the array and make sure tthat the recovery lun is presented as recommended. In EMC MirrorView for example you have to make an inactive snapshot of the lun and present it to the cluster. SRM then goes in and activates the snapshot. I had to actually delete some old snaps and recreate them one time to get through that error. Also with mirrorview you have to have VMWARE_SRM_SNAP in the naming convention for the snapshot.. In other words there are a lot of gotcha in the way it is presented on the array and it varies from array to array b ased on what that particular vendors SRA does exactly.

In recoverpoint its a bit different. Just check your documentation.

If all else fails re install the SRA and any array software that is needed like EMC solutions enabler for example. Good luck.

jeff9565 · ‎06-17-2012

Can you tell us a few more details?

What storage platform?

When does this error happen?

Jeff

TheITHollow · ‎06-20-2012

If you're using Netapp Filers, check to see that your FlexClone license is not expired.

http://www.theithollow.com

FM-DK · ‎06-21-2012

Hi

I had the same problem.

Check my solution at http://communities.vmware.com/thread/403526?tstart=30

It is a known bug in NetApp SRA.

Regards

André

rickkar1 · ‎06-22-2012

thx Andre, the challenge is that it is intermittent, works on some LUNs but not for others...

CorruptedLogic · ‎08-15-2012

Did anyone ever get resolution for this? I have the exact same issue using SRM 5, Dell Equallogic SAN arrays (PS5000 & PS6000 in one group @ production site, PS4000 @ recovery site), Dell SRA v 2.1.

Some volumes will test fail perfectly, but others will error exactly as the OP reported.

VMware tech support looked at the SRM logs and can see that an error is being generated by the SRA at the point of failure.

Dell are giving me the run around telling me that "we provide the SRA as a convenience and don't really support it" really?, Dell, really?

Among various other things I have re-installed the SRA, tweaked advanced settings in SRM, vmotioned the VM's on the LUN(s) in question to different host (hey, i'll try anything at this juncture!), rebooted virtual center on both sides of the equation and bashed my head against my desk for multiple hours.

Vmware have kicked thier side up to upper management to try to get some traction (thank you VMware, you have been most helpful as usual).

I am currently awaiting a call back from the (third) Dell / Equallogic tech to see what they can come up with today.

Sorry to be ranty, but this has now been going on for several weeks and we have a full DR test looming in 2 weeks, I'm somewhat at the end of my rope on this one.

Oh, a quick post script...here is a list of facts that may be of some assistance to other folks (basically, this is a list of observations i have made whilst troubleshooting):

1.    Able to snapshot VM / LUN in ESX.
2.    Vmotion to a new host makes no difference.
3.    LUN size seems to make no difference.
4.    Not an overcommitted SRM licensing issue.
5.    Able to Snapshot from EQL (protected side)
6.    Replications are completing successfully for all volumes as scheduled (on demand replicas works also).

rkuczma · ‎08-17-2012

Corrupted logic - out of interest, are you using smart copy replica's or group manager replica's?

Did you select the correct option when installing the SRA?

I had a simliar problem and Equallogic support had me up storage.CommandTimeout to 900 and storageprovider.fixrecovereddatastorenames in the SRM advanced settings at each site. This fixed it for me (well that bit anyway, failback still has problems but that is due to lower than ideal firmware).

Richard Kuczma

Regards Richard Kuczma

CorruptedLogic · ‎08-20-2012

rkuczma, I am using Group Manager to make the replicas, and yes, the correct option was selected at install time of the SRA (I had also tried the timeout change you suggested, alas, to no avail).

After a long call with some senior Equallogic guys on Thursday of last week, we found the root cause of the issue and a workaround. Please bear in mind that the issue I was having was that some LUNs would test fail just fine, whilst others would consistently fail (this suggested that the configuraiton was not the issue). The problem was thus...

I had a large amount of space delegated on the recovery site for replication (approximatley 200% of each replicated LUN + some wiggle room, per the documentation), consequently this meant I had very little Free Space on the recovery site, around 7% or so (that is, un-delegated space available to the array for whatever it needs it for);ordinarily, this wouldnt really be a problem as I don't NEED the free space since this is a DR site and it's sole purpose in life is to sit dormant until needed.

The problem came about by the SRA attempting to clone a pre-existing replica into Free Space (rather than using delegated space) and then set that online as a volume. Essentially, whenever this happened, the array ran out of Free Space completley and (in the words of the Eql tech) "really wierd things happen when the array runs out of free space". This to me reeks of a design flaw, surely the SRA should simply set the latest replica of the volume on-line using the Delegated Space (like Group Manager would if I were to manually set a replica online) rather than make another copy of the replica in Free Space and consume the lot...what is the point of delegating all that space if you're not going to use it come the time?

The workaround was to reduce the amount of Delegated Space (thus increasing the amount of Free Space) so that replicated volumes could be cloned into Free Space and set online. This does make things work and will allow us to test our DR plan; however, this to me is not a long term solution. Dell / Equallogic need to take a look at the SRA and have it use the Delegated Space on the array instead of the Free Space (after all, configuring it this way contradicts their own documentation).

Hopefully this will help someone else and alleviate the need to bash your head against a hard object for 2 weeks before resolution.

paxri02 · ‎11-15-2012

For SRAs version 05.00.3x50.0017 or 05.00.3x50.0021

ISSUE:

During testFailover, the SRA creates snapshots of the replicated volumes on the recovery site. The SRA automatically appends ‘SRMt-‘ to the snapshot volume and 'SRMt-' and '-R' to the repository volume during the testFailover process. The SRA automatically trims the snapshot name to 30 characters, but did not trim the repository name to 30 characters causing the error “Retries ExceededNull Client”being reported if the target base volume's name was longer than 23 characters.

WORKAROUND:

Rename all target volumes on the recovery site to less than or equal to 23 characters to allow for the addition of ‘SRMt-‘ and ‘-R’ to the repository volume during the testFailover process.

Good Luck,

Rick

All

need a little guidance in troubleshooting this error msg in SRM... i.e., what are the likely causes...