Hi All
We have SRM 5.0 protecting our VM's with EMC VNX array based replication in Asynchronous mode.
To prove our DR capabilities we need to do a full failover (planned option) and run machines live from our DR site for 2 hours.
In preparation for this I have a test LUN that's in its own protection group / recovery plan to play around with but due to only having 5 spare licenses I cant create many VM's
One of the questions I have is around the length of time to do the failover from the VMware / SRM perspective as the number of VM's protected increase (we have 130).
If we exclude the storage sync requirements for now as i know that is going to be an unknown dependent on the rate of change.
I created 2 test VM's on my Test SRM LUN, executed a recovery plan to failover my test group to the 2nd site, re-protected , Failed back to primary again and then finally re-protected so I was back in original state.
I then created an additional 3 VM's so the total was 5 and then repeated the same procedure again.
I noticed when looking at the recovery steps that most steps where similar in elapsed time, it looks as though on most steps like powering off and on VM's is roughly the same regardless of if you have 5 or 50 VMs, the only exception to this was the Prepare Protected site VM's for migration which increased when the additional VMs were added.
What's actually happening under the hood at this stage? is there a rough guide to allow you to calculate times, e.g allow 1 min per VM or anything like that?
Also on a separate note has anyone done a failover / failback using a VNX array (Block Fibre) in Async mode using Mirrorview. We have spoken to EMC about the failback side of it, does it do deltas or full re sync and have had 2 conflicting answers back from them, this is an issue for us as we are doing failover and failback in 1 weekend, if its a full re sync then we just wont have the time window to re syncall the data.
Thanks
Nick
Nick,
There are literally dozens of variables here that could affect your final RTO in this scenario. Most are small variations that quickly grow when scaled out to 100+ workloads. Simple things like how many simultaneous power on operations can be affected by vcenter, esx and SRM versions but can swing wildly if we have sequencing dependencies such as multiple recovery plans, vms in different priority groups and vms without vmware tools installed. I know you don't want to hear "your milage may vary" but the truth is that it will definitely vary and you're going to have to do some testing with your specific plan to see what your RTO will actually be.
(Multiple little variables x small numbers of minutes) x 100+ machines = potentially large RTOs
The good news is that you can run the non-disruptive test and while the storage steps will be different, the power operations should all be the same meaning you can get a rough idea of what to expect. Just be sure to set the expectation that an actual failover is never going to be exactly the same as a ND test simply because of the different storage operations involved.
Now as for your question regarding reversal or re-silvering the answer again is "it depends". Obviously arrays that are doing synchronous replication have an easier task because at the point of failover the two SANs contain identical data so its possible to simply reverse the direction at the time of failover without worrying about journals and catching up with changed blocks. Asynch arrays have a harder task because there's no guarantee the second array is consistent with the first. What does one do if, at the time of disaster the protected side has newer data than the recovery site? Obviously this is a problem! High end arrays typically use journalling to track updates however mid-range arrays (such as a VNX) don't typically have this feature or it must be licensed.
I have a VNX failover planned for this week so I may be able to answer the "does VNX resilver?" question but my personal recommendation is that whenever you see a mid-range array running in asynch mode you should prepare the client with the expectation that the array will need to fully re-silver. If the client has a great storage admin who knows how to optimize the replication configuration then you may potentially exceed expectations however the more likely scenario is that this is the first time the SAN guy has worked with SRM and there may be "room for improvement".
Just my .02!
Hi
Many Thanks for your reply, I did think that may be the case, thankfully all of our VM's have the VMware tools and we have a few P1 VM's and then the rest are all the default P3.
I would be very interested in what you see with the VNX, From the testing we have done it appears that it does only re sync the deltas, I played around failing over 5 VMs that had a total of 300GB used data, when looking at the storage stage of the recovery steps. the initial failover, re-protect and then the failback and re-protect were all around the same, about 2 minutes each time, based on the fact that we have a 1Gb link connecting the two sites there is no way it could do a full synch of 300GB data in that small time window.
The only difference between my testing and the actual failover test day is the business are insisting we cut the links between the 2 Data center's to simulate an actual site failure of our on the day, Obviously during my test failovers the arrays never loose connectivity to each other, I wonder what happens if the link is lost.
Regards
Nick
Nick,
In the end we wound up executing a forced failover... both the Planned Migration option and the DR option required a pre-sync of the storage which was hanging at 99%. I won't bore you with the russian novel of what we were seeing but suffice to say we didn't solve the issue due to time constraints and instead used a "Force Failover" option which only executes steps on the recovery side.
Now that we've failed over and used SRM 5.1 to "Re-protect" I can tell you that we are definitely seeing a full re-silver of the original LUN. It's 1024 GB (1 TB) and I would estimate it will take 60-65 hours to complete over a 100 Mb link. This is in line with the original synch operation performed last week and while I think this is extremely slow the client and the sales team say it's normal and they're happy.
