VMware Cloud Community
KENZVMWARE
Contributor
Contributor
Jump to solution

SRM Failover time?

Guys, can someone give me an idea of how long it would take 20 vm's to failover with SRM - client question..

0 Kudos
1 Solution

Accepted Solutions
timantz
Enthusiast
Enthusiast
Jump to solution

Ken,

In a correctly configured SRM environment, you could see VMs booting at the recovery site somewhere between 10 and 20 minutes from 'hitting the big read button'. Depending on your storage vendor, it can take 5-10 minutes to make the changes to the back end storage, register VMs, and finally start to process the Virtual Machines. This will then start them up and make IP changes(if indicated) based on how the recovery plan is configured. A good estimate is 15-25 minutes for VMs to be up and running at the recovery site. Again, this is a ballpark estimate, and your actual mileage may vary.

"There are 10 types of people. Those who understand binary and those who don't."

-Tim Antonowicz @timantz VCDX 112

View solution in original post

0 Kudos
9 Replies
JeffDrury
Hot Shot
Hot Shot
Jump to solution

That is dependant on quite a few things that are beyond the control of SRM. During a failover SRM relies on the capabilities of the underlying storage to promote the recovery site storage to primary, which can be different from vendor to vendor. This is also dependent on what is being failed over and what infrastructure needs to be in place prior to VM's coming up. Will there be an IP address change? Is DNS ready at the recovery site? Is Active Directory ready at the recovery site? Also during a failure high priority VM's will be brought up serially while medium and low priority VM's will be brought up in parallel based on the number of ESX hosts that you have. If there are 4 ESX hosts at the recovery site then 4 medium and low priority VM's can come up at once. If there is only 1 ESX host at the recovery site then all 20 VM's will have to come up one VM at a time.

In a perfect sterile lab environment with fast storage and no infrastructure considerations then SRM could bring up the VM's in maybe 15 to 20 min per VM. Part of the SRM planning process is to determine the RTO you can provide during an actual failure. Usually business processes and infrastructure changes at the recovery site are the limiting factor in this and make it a question of, "how fast can your orginization recover" rather than"how fast can SRM recover".

mullo
Contributor
Contributor
Jump to solution

Hi,

Everything that Jeff suggests is correct, but to give you a rough idea;

I am failing over 15 VM's in 20 minutes (keep in mind this is in test mode). The recovery site is running on a IBM DS4700 with two hosts and no IP/Network changes in the recovery.

Hope this at least can give you a rough idea.

timantz
Enthusiast
Enthusiast
Jump to solution

Ken,

In a correctly configured SRM environment, you could see VMs booting at the recovery site somewhere between 10 and 20 minutes from 'hitting the big read button'. Depending on your storage vendor, it can take 5-10 minutes to make the changes to the back end storage, register VMs, and finally start to process the Virtual Machines. This will then start them up and make IP changes(if indicated) based on how the recovery plan is configured. A good estimate is 15-25 minutes for VMs to be up and running at the recovery site. Again, this is a ballpark estimate, and your actual mileage may vary.

"There are 10 types of people. Those who understand binary and those who don't."

-Tim Antonowicz @timantz VCDX 112
0 Kudos
Cl3gh0rn
Enthusiast
Enthusiast
Jump to solution

Hi Kenz,

Jeffs mail pretty much says it all. Probably 20 mins sounds about right but it will depend on your Storage vendor.

Doing the Test Failover will give you a very good indication of how long the actual Failover and consequent Failback will take. The longest part of the Recovery Plan always seems to be Step 4. In the Recover VM task, which is "Wait for OS Heartbeat", which is just the vmtools heartbeats.

There is an interesting post regarding this step relating to ESX 3.5 U3 FYI: http://communities.vmware.com/thread/185949

Hope this is useful.

VSP, VTSP, VCP
0 Kudos
Paul_de_Vries
Contributor
Contributor
Jump to solution

Although I will start testing in real datacenters soon,

just to put it a bit more sharp,

do I understand from these contributions,

that there may be a difference in failover time (depending on datacenter situation, storage vendor, infrastructure, etc.) between:

(10-20 minutes delay after pressing red button)

15 to 20 min per VM

to

15-25 minutes for VMs to be up and running at the recovery site?

I may assume SRM will boot up VMs per priority, but also in parallel, not serial,

else with for example a full site failover with 100 VMs the first option would take >24 hours...

I'm surprised about the enormous deviation in recovery time stated here.

- Paul

0 Kudos
Cl3gh0rn
Enthusiast
Enthusiast
Jump to solution

Hey Paul,

I think the caveats on how long will it take depend on a few things but a factor is the SAN you are using and how quickly the SRA scripts will promote the secondary (replicated) LUNs at the Site you are failing over to, to Primary. I do not think this will be a major factor anyway.

The 15-25 mins is with reference to how long the entire Recovery Plan will take to complete. The completion of the Recovery Plan being all of your 20 VMs back up and running after being failed over.

So if you had a Recovery Plan with 100 VMs, based on 20 VMs taking 25mins then 100 VMs will take 125mins. So only 2 hours. I haven’t tested to this degree yet and only tested with a small number of VMs, however I don’t know if the correlation to number of VMs and Recovery Time is as direct as this as you can choose to bring VMs back online in parallel.

Cl

VSP, VTSP, VCP
0 Kudos
JeffDrury
Hot Shot
Hot Shot
Jump to solution

Paul,

I would reiterate the point that the question should not be "how fast can SRM recover?" but "how fast can the business recover?". If you take the simple example of a multi-tiered application that requires a DB, application, and web server, recovery is a process requiring several steps. SRM could bring all of these VM's up at the same time but the application would not work as the app and web servers require the DB to be functional before they are started. In a real world scenario you would want SRM to bring up the DB server and place a call out for the proper people from the IT organization to verify that the DB is up and functional and all required resources like AD and DNS are working correctly before moving to the next step of recovering the app server, etc. The recovery process that each organization needs to follow can vary depending on their specific requirements, therefore RTO and RPO will be different for each organization that implements SRM. The recovery plans in SRM will document and ensure that the process is followed for a successful recovery. I don't think it is reasonable to assign a RTO before you have gone through the process of creating the recovery plans and understanding the organizations recovery needs.

Yes SRM could bring everything up in as little at 30 minutes but that very likely would not result in a successful recovery.

0 Kudos
depping
Leadership
Leadership
Jump to solution

Anywhwere between 10 minutes and 5 hours. It all depends on what kind of storage you are using, how many LUNs / VMs etc. I've configured it several times and for HP with anywhere between 10 / 50 vm's it took 20/25 minutes. For EMC DMX4 with the SRDF SRA the same amount of VMs almost took me an hour. Like I said, it really depends...

Duncan

VMware Communities User Moderator

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
Paul_de_Vries
Contributor
Contributor
Jump to solution

Guys,

I'm glad I put some emphasis on the failover time-differences, as your last detailed contributions gives me (and hopefully also other people) much more insight in failover time related aspects,

and indeed, business recovery is much more than just technical successfull fail-over of a bunch of VMs.

Tx,

- Paul

0 Kudos