VMware Cloud Community
TaupoJohn
Contributor
Contributor

IBM V7000 SRA with SRM 5.1

There is a conversation going on in another thread

(http://communities.vmware.com/message/2227065#2227065)

- but it is an old thread and started off with different versions of everything.

I'm going to start a new thread based on current versions.

I cannot get the SRA to work with SRM 5.1

When I run a "test" failover for a simple recovery plan - I get this error:

"Error - Failed to create snapshots of replica devices. Failed to create snapshot of replica consistency group test1rep. SRA command 'testFailoverStart' failed for consistency group 'test1rep'. Volume cannot be created, please check test mdiskgroup cofiguration Refer to IBM SAN Volume Controller troubleshooting"

I'm running:

IBM v7000 Version 6.4.1.2 (build 75.0.1211301000)

vCenter 5.1

SRM 5.1.0-941848 on Windows 2008 R2 64 bit (2vCPU 8GB RAM)

SRM advanced settings
I have set the Advanced Settings option:
    storage.Commandtimeout to 1800 (def 300)
    storageProvider.hostRescanRepeatCnt to 3 (def 1)
    storageProvider.hostRescanTimeoutSec 1800 (def 300)

SRA v2.1.0.121224 (also tried 2.1.0.121108)
non-preconfigured environment

    Pre-Configured Env. not set
    Test MDisk Group ID set
    SpaceEfficient Mode = False

Is anyone having success with the IBM V7000 SRA (any version) with SRM 5.1?

Thanks

rgds, John B
Reply
0 Kudos
37 Replies
FrankZo
Contributor
Contributor

Hi!

Seems to me that you have some kind of a configuration error in the SVC area. Within the download of the SRA there is a PDF were the right configuration is explained.

You use non-preconfigured. So you let SRM do the work for you, but did you make all the prequests that are described? Is the user used to connect the array manager a admin?

I'm using SVC and SRM5.0 on the edge of migrating to 5.1. Everything is woring fine right here. Please let me know

Reply
0 Kudos
TaupoJohn
Contributor
Contributor

Hi Frank,

There could very well be an error with the SVC or on the V7000 itself. We are new to the V7000 (and definitely have not set up the SRA for it)

Do you have this SRA working with SRM 5.0?

If so - are you using the same version as me? v2.1.0.121224 or v2.1.0.121108?

Are you using non-preconfigured? You're right - I want SRM to do all the work. I don't see why we should set everything up ahead of time 😉

If it's working for you - then there is a good chance we can get it working with SRM 5.1 - and this will help you 😉

I *think* I've set up the SVC correctly. The user that SRM logs in with is an admin. I just filled in the Test MDisk Group ID (we only have one) and not the other boxes

On the V7000 we set up one small lun to replicate (metro) and this is working fine. We have set up one consistency group - and nothing else.

Do we need to present the replicated lun to the recovery site hosts?

I'm not at work now - but I can upload a file with screen dumps of everything if necessary.

rgds, John B
Reply
0 Kudos
visak
Contributor
Contributor

In our setup recovery site volume was configured to map to esx server initial config , after configuring the consistency group I have noticedthat mapped volumes were removed from the recovery site.

If you run the recovery you will notice that volume will be mapped to esx automatically. Therefore you do not need to map the recovery site volume to esx servers.

Visak

TaupoJohn
Contributor
Contributor

Thanks for that Visak. (I wasn't sure about this)

Do you have the SRA working?

rgds, John B
Reply
0 Kudos
TaupoJohn
Contributor
Contributor

Just to provide more detail:

I really can't find what else can be configured on the SRM servers - for the SRA

I have also worked out that the "SAN Volume Controller" = SVC = V7000 as  far as the pdf documentation is concerned. This was causing a little  confusion too, as we come to grips with the terminology they're using.

The V7000 on each site has only one big Storage Pool - and this contains the volumes. One volume is replicated, to test out SRM.

The "Test MDisk Group ID" which is set on each SRM Server is the same as the Storage Pool name? (if not then that is my problem - but I don't see what else to use)

Our name is long (Tier1_600GB_10k) but the documentation uses 0 (zero) as the "Test MDisk Group ID"

One thing to note is - the pool name is different on each site. For some reason they are called:

Tier1_600GB_15k - protected site

Tier1_600GB_10k - recovery site

I have configured the SRM server on each site to use the correct name for it's "Test MDisk Group ID"

Our test volume is "consistent synchronized" to the recovery site, and the "copy" on the recovery site is not mapped to any esxi hosts.

I would love to be able to extract the essential commands from the perl scripts (and add my own values for th evariables) to test these scripts manually.

If anyone could decipher the perl scripts to manually (by entering values instead of using variables) run a test failover - that would be great!

PS - SRM is set up with a simple protection group containing the one test lun and a test VM on that lun.

rgds, John B
Reply
0 Kudos
visak
Contributor
Contributor

Our setup is SRM 5.0 with v2.1.0.121224 SRA version which works fine. I have configured the SRA utility using non pre-configured setup and configured on both sites.

If the volume size are more Than 2 TB have look at the V7000 compatibility matrix for additional notes.

I assume you might gone through this doc .

http://www03.ibm.com/support/techdocs/atsmastr.nsf/5cb5ed706d254a8186256c71006d2e0a/906fb3333c0b35b0...

Thanks -visak

Reply
0 Kudos
TaupoJohn
Contributor
Contributor

Thanks for the link Visak (I had to change the www03 to www). I've seen an older copy of this document - so I am going to run through this tomorrow to see if we've missed anything.

I'm glad to hear you have this working with SRM 5.0 - it really should work with 5.1 then.

I'll let you know...

rgds, John B
Reply
0 Kudos
TaupoJohn
Contributor
Contributor

Well, we are making progress. I did to things, that let us get a lot further:

- set the serach path for the perl bin folder (as on p62 of "0003 - Implementing DR solutions with IBM Storwize V7000 and VMware Site Recovery Manager.pdf")

- set "Test MDisk Group ID" (for the SRA configuration on the SRM Server) to 0 instead of the actuall name of the Pool (there is no such thing as a MDisk Group ID - so I'm assuming they mean the pool name)

Having done this - now the SRA creates a flash copy (when I run a recovery plan test), but the flash copy map just sits at 0%. So it seems to create the empty snapshot volume, but can't copy the contents of the "real" replicated volume to it.

Then we realised that when we create a Flash Copy Snapshot manually - it does the same thing. We thought that it was finished after the first few seconds when a target volume is created, but it turns out it's just sitting there at 0% when trying to copy into the snapshot volume.

We've logged a call with IBM, so we'll see if they can advise us what's not right with our SAN. Then maybe the SRA scripts will actually work. The good thing is - after I cancel the test recovery plan (and wait a while for it to actually cancel and finish) - the cleanup script works fine. So it seems that the SRA scripts are actually comminicating with the SAN OK.

rgds, John B
Reply
0 Kudos
visak
Contributor
Contributor

Please check my SRA utility screen shot and SAN advance setting configuration.Make sure remote copy consistency group fully sych before running the recovery.
You can increase rate of background FC copying by right clicking relevant FC volume under copy services> Flashcopy

Note:

- I am assuming you are not running any IBM flash copy manager  software or anything to do with  LUN re signature setting.

Reply
0 Kudos
FrankZo
Contributor
Contributor

well thats oke. I'm not using the test functionality only the planned migration - Reprotect.

I still have an strange issues with the SRA, in Devices view you can see all the replicated vdisks. vCenter SRM maps the disks to datastores and with that you can create a protection group. You can see all the other disks to which belongs to Windows / Unix. Why? Really.... filter that out

Then again. If your datastore at site b resides on svc vdisk-id, lets say 23, and there is another disk at the remote SVC with the same vdisk-id but belongs to another system. Than SRM will show this disk as a 'vmware' disk as well.

In the picture you see that 1 vCenter volume is mapped to 2 devices. 1 is the "real" vmware device and the other is, in this case, a Windows replicated disk.

Dont you all have this problem? Or is VMware on seperate SVC cluster? We have one svc cluster with vmware, unix, windows etc....

Reply
0 Kudos
TaupoJohn
Contributor
Contributor

I shouldn't need "protect source volume" and "protect target volume" should I?

I'll try it with space efficient mode true though.

rgds, John B
Reply
0 Kudos
TaupoJohn
Contributor
Contributor

We only have VMware hosts on the SANs. Also - we only have one replicated pair of volumes so far.

As soon as we get something working, we'll have more.

rgds, John B
Reply
0 Kudos
TaupoJohn
Contributor
Contributor

==== UPDATE ====

OK - we got it to work today, but it seems very slow. It took 14 minutes to run the recovery plan test for 1 VM.

All it had to do was make a flash copy of the replicated 50GB LUN, and set up the VM and bring it up. It seems like this should be a lot quicker. Shouldn't the flash copy be pretty much instantaneous? The VM is a small XP one with a thin provisioned 10GB disk - using 4GB at the moment. Nothing is changing on it - so there is nothing replicating on the LUN.

Here is what I set up for the SRA

sra config.jpg

Also I got these settings from vfovermars on the other discussion ( http://communities.vmware.com/message/2232055#2232055 )

settings.jpg

I had been using 1,3,1800,2. I'm sure we gave it this much time yesterday before we cancelled - so perhaps the "true" for space efficient mode helped, and perhaps the resignatureFailureRetryCount (from 1 to 2) helped.

We have a call logged with IBM, so hopefully will find out if this is how long it should take to do a simple test failover. Then of course we have to try it on bigger LUNs with "real" VMs.

Thanks for all your help (everyone) so far - I will keep you notified of whay IBM say.

rgds, John B
Reply
0 Kudos
vladpmf
Contributor
Contributor

Hi, with small vm's(small volumes) works for me to, the problem is when I use bigger volumes, such as 400GB/600GB

Reply
0 Kudos
FrankZo
Contributor
Contributor

not using the test facility only planned migration.

What is slow? When you do a planned migration it syncs the disks a couple of times. After the unmound at the A site it will sync, aftherworths its going to rescan the HBA's on site B. I've got arround 20 ESXi hosts in that site. It rescannes 2 times. So a planned migration is easily 30 minutes plus.

Gracefull shutdown of VM's can take up to 15 minutes at our site (extended de default time-out)

In the screenshot a planned migration test from 1 VM on a 600GB datastore.

Works like a charm

Reply
0 Kudos
TaupoJohn
Contributor
Contributor

Like I said:

All  it had to do was make a flash copy of the replicated 50GB LUN, and set  up the VM and bring it up. It seems like this should be a lot quicker.  Shouldn't the flash copy be pretty much instantaneous? The VM is a small  XP one with a thin provisioned 10GB disk - using 4GB at the moment.  Nothing is changing on it - so there is nothing replicating on the LUN.

no gracefull shutdown, as it's a test failover.

rgds, John B
Reply
0 Kudos
FrankZo
Contributor
Contributor

I think that 14 minutes is reasonable. It's a per datastore operation . So 1 VM will take 14 minutes. When you add up to 10 vms on that datastore it will still be 14 minutes (ok starting 9 exta vm's will add some time Smiley Happy )

Reply
0 Kudos
TaupoJohn
Contributor
Contributor

Ahh OK, thanks for that Frank. I thought the flash copy was supposed to be really quick.

We're still waiting for IBM to phone back - but maybe this is working OK. If so, then we'll test a bigger LUN next.

(vlad says he has problems with bigger LUNs)

rgds, John B
Reply
0 Kudos
visak
Contributor
Contributor

For me it took between 13-16 minutes (20 VM’s) 

Reply
0 Kudos