VMware Cloud Community
maytrix0
Enthusiast
Enthusiast

Problems running test fail over with SRM 5 and Equallogic SAN

I have recently upgraded to SRM 5 and since the ugprade am having an error trying to run a test recovery.

The error in VSphere Client is:  Error - Failed to create snapshots of replica devices. SRA command 'testFailoverStart' didn't return a response.

Below are parts of the log file.  Has anyone had a similiar issue?  Any idea what I should look at next?  Thanks in advance!

2011-12-07T11:46:44.750-05:00 [01668 verbose 'SysCommandLineWin32' opID=5027BD1B-0000002E] Starting process: "C:\\Program Files (x86)\\VMware\\VMware vCenter Site Recovery Manager\\external\\perl-5.8.8\\bin\\perl.exe" "C:/Program Files (x86)/VMware/VMware vCenter Site Recovery Manager/storage/sra/EqualLogic/command.pl"
2011-12-07T11:46:44.750-05:00 [01668 verbose 'SraCommand' opID=5027BD1B-0000002E] Listening for updates to file 'C:\Windows\TEMP\vmware-SYSTEM\sra-status-75-0'
2011-12-07T11:46:44.750-05:00 [02836 trivia 'Recovery' ctxID=56b66da8 opID=5027BD1B-0000002E] [recovery-plan-2538.failoverOrchJob] Received progress update from replication for group: [dr.replication.ProtectionGroup:protection-group-1115], progress: 45
2011-12-07T11:46:45.469-05:00 [02600 verbose 'PropertyProvider' ctxID=56b66da8] RecordOp ASSIGN: items[3], 52dbc1e7-dc88-8db6-658d-bb99386c8000
2011-12-07T11:46:45.469-05:00 [02600 verbose 'PropertyProvider' ctxID=56b66da8] RecordOp ASSIGN: items[3], 52d8a9a7-a1e7-e773-b4e0-f0d4e00dd643
2011-12-07T11:46:49.156-05:00 [01668 info 'AsyncJump'] (4.406s) ==> '37'
2011-12-07T11:46:49.156-05:00 [01668 verbose 'SraCommand' opID=5027BD1B-0000002E] Stopped listening for updates to file 'C:\Windows\TEMP\vmware-SYSTEM\sra-status-75-0'
2011-12-07T11:46:49.156-05:00 [01668 info 'SraCommand' opID=5027BD1B-0000002E] testFailoverStart exited with exit code 0
2011-12-07T11:46:49.156-05:00 [01668 error 'SraCommand' opID=5027BD1B-0000002E] testFailoverStart exited with no response
2011-12-07T11:46:49.203-05:00 [01668 verbose 'Storage' opID=5027BD1B-0000002E] Releasing read op lock on 'array-pair-1073'
2011-12-07T11:46:49.203-05:00 [01668 verbose 'PerformanceMonitor' opID=5027BD1B-0000002E] Performance monitor Token 0 of lock PersistableRWLock-1074. 'Locked' took 4.672 seconds
2011-12-07T11:46:49.203-05:00 [01668 verbose 'PersistableRWLock' opID=5027BD1B-0000002E] Releasing Read lock 'PersistableRWLock-1074'
2011-12-07T11:46:49.203-05:00 [01660 trivia 'Recovery' ctxID=56b66da8 opID=5027BD1B-0000002E] [recovery-plan-2538.failoverOrchJob] Received progress update from replication for group: [dr.replication.ProtectionGroup:protection-group-1115], progress: 50
2011-12-07T11:46:49.203-05:00 [01668 error 'StorageProvider' opID=5027BD1B-0000002E] Failed to create snapshots of replica devices for group 'protection-group-1115' using array pair 'array-pair-1073': (dr.storage.fault.CommandResponseMissing) {
-->    dynamicType = <unset>,
-->    faultCause = (vmodl.MethodFault) null,
-->    commandName = "testFailoverStart",
-->    msg = "",
--> }
2011-12-07T11:46:49.250-05:00 [01668 verbose 'PersistableRWLock'] Destroying NON persisted released token 0 of lock PersistableRWLock-1074
2011-12-07T11:46:49.250-05:00 [03216 verbose 'PropertyProvider' opID=5027BD1B-0000002E] RecordOp ASSIGN: info.progress, dr.recovery.RecoveryManager.test58
2011-12-07T11:46:49.250-05:00 [03216 verbose 'StorageProvider' opID=5027BD1B-0000002E] StartTest completed
2011-12-07T11:46:49.250-05:00 [03216 verbose 'DatastoreGroupManager'] Enabling datastore group computation

Reply
0 Kudos
15 Replies
russiamutha
Contributor
Contributor

Have you check the Equallogic console to see if there are any warnings? You may have reached the maximum snapshot reserve on the volume, therefore it will error out.    

Reply
0 Kudos
maytrix0
Enthusiast
Enthusiast

Thanks, I didn't think of that.  I have looked and there's nothing in there aside from the logins.

Reply
0 Kudos
russiamutha
Contributor
Contributor

Are you using SRA 2.0 for Equallogic? Did you make any changes to replication/snapshot settings after you had created the protection group you're trying to test? if so, try rescanning volumes inside SRM console so it will update.    

Reply
0 Kudos
SundaleICT
Contributor
Contributor

Hello All

I am having exactly the same issue, is there any update on this as it is a matter of urgency.

Out of curoisity what is the replication "username" not password as that is defined in the SAN group Manager.

I notice the SRA setup within SRM requires a "username" for replication.

could this be the issue, I cannot find any documentaion regarding this.

Please advise

Lance Knight

Reply
0 Kudos
russiamutha
Contributor
Contributor

SRA only requires username and password for the Equallogic groups. All replication pairing should be done within Equallogic console.

Reply
0 Kudos
erikterr
Contributor
Contributor

I also have this problem, with EMC VNX storage...

Reply
0 Kudos
tobiashansen
Contributor
Contributor

I would expect to see an error with "snapshot" in error text if you on the Equallogic added the one of the volumes for a server to an existing replication schedule and tested failover before creating a new replica - or on vmware storage migrated a protected volume away.

This way the srm will allow you to protect based on there is a replicated volume that corresponds to your configuration.

The test will fail due to when mounting the replicated volume there is no snapshot of the configured vmware guest.

If this is the cause of your problems then check your protected guests on protection site and verify where the files for your added hard disks are located.

Verify that replication is working for the Equallogic volumes that are the counterpart for those volumes.

If you migrated away, then unprotect and re-protect to update configuration.

I hope this will point you in the right direction

Regards

Tobias

Reply
0 Kudos
CorruptedLogic
Contributor
Contributor

Hi, not sure if you ever resolved this issue or not but i had to contend with the same problem today on an EQL array. I boiled the issue down to the fact that I had multiple replication jobs happening at the same time as I was attempting the recovery plan test. EQL will only allow a maximum of 3 volumes to replicated simultaneously (this appears to apply to the snapshotting mechanism also) - I had 3 volumes replicating, 4 queued for replication and then tried to do the recovery plan test on top of that (i did not ask SRM to replicate the latest changes, just perform the test with the last good replication set on the remote side). Anyway, very long story short and patience being a virtue (that i am seldom blessed with), disabled my scheduled replications on the EQL (temporarily) and waited for current and queued replications to finish, i then tried the failover test again and bingo..success.

So i guess a summary is, be aware that EQL will only (seemingly) allow 3 replication/snapshot jobs to run concurrently and an attempt to test a recovery plan during time when all 3 slots are taken, will result in some bizarre SRM behavior (like SRA errors and failed tests).

Hope that helps someone!

Reply
0 Kudos
maytrix0
Enthusiast
Enthusiast

Glad you posted as I forgot about this post.

It turns out the issue is actually with the SRA.  If a replication job is in progress, it will try to use the replica that is still being replicated for cloning.  Obviosuly this won't work, so it fails.

Just get the latest SRA (2.1) and the problem has been resolved.  I downloaded it a week ago or so and it worked fine.

Reply
0 Kudos
CorruptedLogic
Contributor
Contributor

Thanks for the heads up, the new SRA must've been released a couple of days after I downloaded the old version. I hadn't thought of looking for an updated version yet!

Reply
0 Kudos
Bucketenator
Enthusiast
Enthusiast

The new SRA rev also solved a problem (which will affect all) where the array 'snapshot' of each replicated volumes that's created for for testing purposes was preallocated to 100% of the size of the original volume, and this was drawn down from the free space on the array, not from the delegated space.   In effect for each volume, you required x3 disk space on the DR array! 

Now they seem to be thin provisioned instead ... but why are they taking any disk space at all?? Very disappointing ... this is not the way that other array / SRA combos work (e.g. Clariion).

JD

Reply
0 Kudos
cantique
VMware Employee
VMware Employee

Dear gurus,

  Just wondering how many spaces are required in EQL for SRM to work properly, such as:

1. The original volume, suppose 5TB.

2. Need X% for the replica at the replication partner, do I really need 200%? (Now total 15TB)

3. Need Y% for local replication reserver, need 100%? (Now total 20TB)

4. What about the snapshot space needed for Test workflow? Should it be included in the X%?

  Really need a guidance for planning.

Cantique

Reply
0 Kudos
admin
Immortal
Immortal

Below is the storage requirement i can define, as i run many DR Tests succesfully from SRM using EMC Celerra.

For a SRM:

TO run Actual DR:

1. If the the original (Primary) volume is suppose 5GB.

2. Need atleast 5GB for the replica at the replication partner array would suffice. (this would be regular Replica and used when you roll actual Failover)

Primary Lun                  DR Lun

  X%                 ==                Y%                        

Particulary for SRM DR test:

Primary Lun                  DR Lun

  X%                 ==           2Y%

In our case:

5GB                 ==          10GB ( 5GB + 5GB )                 

I don't fallow you on the below points...

3. Need Y% for local replication reserver, need 100%? (Now total 20TB)

4. What about the snapshot space needed for Test workflow? Should it be included in the X%

Reply
0 Kudos
Bucketenator
Enthusiast
Enthusiast

As I mentioned previously, because of the way that the EQL SRA works (and the fact that the EQL cannot create a snapshot of a replica - duh!) you're going to need between 2-3 times the disk space at the DR site relative to production.  For example:

  • Prod:
    • Volume 1 size: 500GB (200GB used).
    • Volume 2 size: 500GB (200GB used).
    • Disk space required:
      • Volumes = 1TB (if not thin).
      • Local snapshot reserve ... required for rapid failback, but not strickly necessary. Size will depend on rate of change of data and host long snapshots will be maintained for.  Let's say 20%, which gives 200GB.  You can also allow snapshot reserve to borrow from free space.
      • Total: 1.0 - 1.2TB
  • DR:
    • Delegated space: 2 x replicated volume size (to be absolutely safe, and protect against extended comms outages between PROD & DR).
      • Therefore delegated space in this case = 1TB.
    • Additional space will also be required on the DR array for failover tests since the SRA / EQL array cannot create a traditional COFW snapshot (which is very disappointing). Instead, the EQL creates a 'SmartClone' which in effect creates a thin volume representation of the replica.
    • Disk space required:
      • Delegated space = 2TB.
      • Free space (used transiently during tests) = 400GB.
      • Total: 2.4TB

Hope this helps.

JD

Reply
0 Kudos
cantique
VMware Employee
VMware Employee

Thanks JD, this is crystal clear.

Reply
0 Kudos