Solved: Re: Failback steps for recovered VMs

cswaters1 · ‎03-31-2009

Hi,

I've just been reading the admin guide and some 3rd party vendor documents on reconfiguring SRM for failback for use in a production design.

I notice that there are steps in the admin guide to shut down all virtual machines that were recovered to recovery site as part of a completed failover (see attachment). The steps guide you to remove the placeholder VM files on the array and delete any files/directories in vCenter at the recovery site that contains VM configuration files created during the protection group creation.

This would mean this if you were performing an 'assisted failback' you would need a system outage for all recoverd VMs for the time it takes to complete all reconfiguration tasks and perform a test failback? Is this correct or is this just a cut and paste from the evaluation guide where we don't care about system outages as it's eval?

We use EMC CLARiiONs for storage and I've reviewed the H5583-VMware_Site_Recovery_Manager_with_EMC_CLARiiON_CX3_and_MirrorViewS_Implementation (see attachment)- this document doens't mention any of these steps at all.

Can anyone who has sucessfully completed a failback comment?

Surely the placeholder VM files on the storage array and any folders/directories in vCenter can be removed after the failback is completed, that way all changes can be made while the recovery VMs are up and running, and no downtime will be experienced by the business? -isn't that what the outcome you would want in this situation?

Look forward to your opinion on this subject.

Craig.

Craig Waters | vExpert | Melbourne VMware User Group Leader | website: craigwaters.org | twitter: @cswaters1

bladeraptor · ‎04-14-2009

Hi Craig

The word from Chad is the target date and this point, but subject to change, is June

This is in no way a comittment to make the functionality avaialble at that time - but the engineering teams are looking at that sort of time frame

Hope that helps

Kind regards

Alex Tanner

View solution in original post

cswaters1 · ‎04-03-2009

No comments then? Come on guys, give us some direction here, help out a budding architect (spare any change)?

Craig Waters | vExpert | Melbourne VMware User Group Leader | website: craigwaters.org | twitter: @cswaters1

Smoggy · ‎04-03-2009

Hi Craig,

I am sure your lack of replies was a friday afternoon'ism

Your basic understanding of the failback process is correct during failback when the storage is being reveresed you will need to bring down your VM's that you want to failback. Usually you would not do this en-mass since you would failback certain groups of luns at a time in a pre-defined order, probably same order you used to failover as that usually represents the "priority" of the VM's to your business.

I need an example to explain but lets say we start with Site1 (Protected) and LUNA replicated to LUNA-REP Site2.

When we perform a "Test" failover in SRM we will create (or utilize) an array based snapshot as the storage to boot the recovered VM's from.

When we perform a "Real" failover in SRM the SRA will usually split LUNA-REP from LUNA and now present LUNA-REP as read/write to the ESX hosts at Site2 and failover.

This means when you failback the storage array has to quiesce / lock LUNA-REP and copy its changes back to LUNA at Site1. On nearly all storage arrays I can think off to ensure consistency this means you need to shutdown the VM's running on LUNA-REP so that the array replication software can re-sync all the changed data back to the original site. At this time the LUNA-REP will be made read only to the ESX hosts.

The amount of time taken to copy the changes back to the original site is dependent on the array type and its replication software type. it is also a factor of the amount of time the source and destination luns have been "apart" so to speak. If the luns have been "apart" for a short time and rate of change of data is relatively small then many arrays are able to do incremental resyncs where only changed tracks are copied back which is much faster than doing a full lun copy.

During failback the amount of time it takes to cleanup placeholders etc shouldn't be that big a deal since you can easily script that, one thing i always do in my designs is have a non-replicated datastore at each "recovery" site that holds ALL my placeholder VM's. For me this makes more sense then spreading them around for 2 reasons:

1. i know where they ALL are for troubleshooting / cleanup (with scripts)

2. stops any other admin deleting them by mistake if they stumble across them in a datastore browser window for example and think "hmm a folder with 3 small vm config files in and no vmdk's, must be an old VM we forgot to cleanup properly i'll delete that."

Hope this helps,

Lee

PS: will be out (and not online) most of next week...so if you have any follow on questions apologies in advance if my reply is delayed.

cswaters1 · ‎04-05-2009

Thanks for your valued comments Lee, your reputation proceeds you!

I think this is a MirrorView 'Feature' we are discussing here, I hope an EMC representative can step in and confirm this as the details I include may not be accurate, but here goes:

Let's say we have primary LUNs on Site A and Secondary replicated LUNs on Site B.

With MirrorView and SRA/SRM, when a controlled failover occurs (i.e. both sites available, not a real disaster, no forced / local promotes occur) MirrorView automatically fractures all LUNs associated with the protection group and then promotes the Secondary LUNs on Site B into Primary LUNs. At the same time the Primary LUNs on site A are demoted to Secondary LUNs.

This means that to recover from the above using Failback, there will be no manual reconfiguration required to reverse synchronisation on the SAN, in fact all that needs to be done is the usual SRM clean-up prior to failback (why don't EMC release a vCenter plugin to do this, actually I think they are about to (there is one for Celerra if you have access), knowing when it's available or getting hold of it is something else though):

1. Remove recovery plans from site B

2. Remove protection groups from site A

3. Remove placeholder VMs from non-replicated LUN on site B (does this also become optional as these will be reused in future?? - any suggestions?)

4. Recreate protection groups, inventory mappings and recovery plans in reverse (Site B is the protected site, Site A is the recovery site)

(The above high level steps does not include snapshot reconfiguration for testing the failover)

So, when the second failover (failback) occurs, the only time the protected VMs are shutdown is as part of the recovery plan.

I think that's how it would work, VMware are covering all types of replication in their document (as what Lee mentions, some replication solutions work differently to others)

Again, if anyone can contribute to this discussion I would be really grateful!

Thanks,

Craig.

Craig Waters | vExpert | Melbourne VMware User Group Leader | website: craigwaters.org | twitter: @cswaters1

bladeraptor · ‎04-06-2009

Hi

I am writing as an EMC Employee

As you suggest in the scenario where we have a light site to dark site to light site requirement, the promote feature of MirrorView could indeed be used to fail over and then fail back an SRM environment

As you are no doubt aware SRM is a broad framework designed to allow storage vendors and other solution providers to interoperate with a common framework.

SRM v1 as a framework was designed to achieve some very specific criteria and to do it on a broad basis to allow as many storage vendors on VMware's HCL as possible to participate with their relevant replication solutions

Due to differences in implementation and sophistication not all vendors have the components in place to be able to failover and failback and as the desire to provide a broad open framework in the first instance was the overriding goal - failback (in the scenario where the production array remained available) was not considered a mandatory element of recovering from a 'smoking hole in the ground genuine disaster.

If you are aware of the EMC portfolio you may know that we offer various geo-stretched clustering solution such as MirrorView Cluster enabler. This allows ther user to fail a cluster between a node located on Sites A (say London) and Site B (say New York) and then fail back again. This implementation demonstrates the ability of MirrorView to demote and promote the mirrors as you suggest. In the case of failback from a full failover - reconfiguration of the snapshots is not necessary as we suggest that both the production and the secondary array have snapshots configured and snapshots are not involved in a full failback as opposed to a test

So the ability to do as you suggest with promoting and demoting the mirrors exists now.

You are correct in that there is a Celerra failback vCenter plug-in which allows the user to failback selected or all failed over SRM Celerra Sessions. This was architected by the same team that wrote the SRM plug-in. Basically having written the scripts to fail the environment over, the same group then wrote the scripts to fail it back the other way and clean up the environment.

Now it cannot be emphasized enough that this is a purely EMC development and is not a pre-cursor on EMC's part to a wider VMware driven SRM failback scenario. VMware as I understand it and Lee can comment much more authoritatively is defining the SRM failback framework as we speak and my understanding is that it will be broader and deeper than the current EMC articulation.

The porting of the logic behind the Celerra failback wizard is coming to other of the EMC SRM failback solutions - I have no access to confirmed timescales but the commitment from my boss, Chad Sakac is there.

When it does appear it will ideally work as you suggests and allow a largely painless failover and failback of an SRM environment.

Note however, that due to the nature of the way in which protection groups are created - these are not automatically recreated upon failback and must be recreated manually.

Recovery plans however do not need to be recreated - the new protection groups can simply be added back into existing Protection Groups

I hope this helps

Kind regards

Alex Tanner

cswaters1 · ‎04-06-2009

Thank you for joining this discussion Alex, again your comments and insight are most welcome.

I will be incorporating your comments around snapshots and protection

groups into my design (thanks again - this will greatly reduce the

manual steps required during a failback reconfiguration of SRM).

I have to ask, is it possible you could approch Chad (virtual geek) and

ask the following (maybe you could bring this post to his attention?): -

Release Date / Availability of the CLARiiON failback vCenter (VI3)
snapin - (I understand if an exact date cannot be made, but I'd like to
know if it will be within the timeframe of my current SRM design - 3-6
month to full implementation lifecycle)
EMC Storage Viewer vCenter Snapin - (I'll be honest I created
another post which I was hoping to get a bite with, but seeing as I
have an audence I'll try here) I cannot beg / steal or borrow to get
this utility... Chad even stated that it is free to all EMC Customers
(thats me), I've been pushed back by account managers, technical
specialists and other EMC employees (admittedly I am in the backwaters
of Australia and this kind of stuff is probably still only just
becoming available in the US, but still...

Can anyone from VMware comment on a time frame for the next release of SRM and what that may introduce (I have to ask...bump ).

Look forward to your reply Alex, if anyone else can contribute please do not hesitate!

Craig.

Craig Waters | vExpert | Melbourne VMware User Group Leader | website: craigwaters.org | twitter: @cswaters1

bladeraptor · ‎04-07-2009

Hi Craig,

You are most welcome. Both Lee and I are based in the UK - so we can understand the backwater comments :]

I am afraid at this time I cannot give you a firm date but I will escalate with Chad and will try and get something within a month date timeframe and get back to you

As for the EMC Storage Viewer it is avaialble now on PowerLink in the following section

Home > Support > Technical Documentation and Advisories > White Papers > Configuration/Administration

The White Paper is titled White Paper: Using EMC Storage Viewer for Virtual Infrastructure Client - A Detailed Review

I will email you privately the details of your local EMC VMware specialist and please let me know if you don't get any joy from that route

I have worked extensively - in my lab and at VMworld US 2008 with failing over and failing back a CLARiiON SRM environment and it works well. The SRM failback tool for CLARiiON should simplify this process by automating many of the tasks which are done manually now

Many thanks

Alex Tanner

depping · ‎04-07-2009

can't comment on this one
http://virtualgeek.typepad.com/virtual_geek/2009/04/where-to-get-the-emc-storage-viewer-vcenter-plug...
reachout to Chad via the link in 2). it's his blog and he can answer your questions for sure.
no I can't comment on the availability of the next version of SRM and it's new features. no one can and/or will i guess...

Duncan

VMware Communities User Moderator

-

Blogging: http://www.yellow-bricks.com

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

cswaters1 · ‎04-07-2009

I really appreciate all your feedback, it's good to know there is a support network out there and we are not alone...I'm from the UK originally, but moved out here about 8 years ago, I love the lifestyle, but it can get frustrating when you know more than the vendors regarding their own products (no offence, none taken... :smileysilly: )

I've already posted a thanks on Chad's site for getting the EMC storage viewer out to the masses (good work Fella!). I look forward to your reply Alex in regard to the timeframe for the availability of the CLARiiON SRM Failback Plugin - thanks again for responding to this post, I really appreciate it!

Regards,

Craig.

Craig Waters | vExpert | Melbourne VMware User Group Leader | website: craigwaters.org | twitter: @cswaters1

cswaters1 · ‎04-13-2009

Alex, just wondering if you have spoken with chad and had some feedback on the availability of SRM Failback vCenter plugin for CLARiiON?

I'd like to close this post and award you the correct answer.

Regards,

Craig.

Craig Waters | vExpert | Melbourne VMware User Group Leader | website: craigwaters.org | twitter: @cswaters1

bladeraptor · ‎04-14-2009

Hi Craig

The word from Chad is the target date and this point, but subject to change, is June

This is in no way a comittment to make the functionality avaialble at that time - but the engineering teams are looking at that sort of time frame

Hope that helps

Kind regards

Alex Tanner

cswaters1 · ‎04-14-2009

Just marking this post as answered, thanks for all your comments and feedback!

Craig.

Craig Waters | vExpert | Melbourne VMware User Group Leader | website: craigwaters.org | twitter: @cswaters1

All

Failback steps for recovered VMs