VMware Cloud Community
Baoth
Enthusiast
Enthusiast
Jump to solution

SRM testing with a full network fail over / disconnection

Hello

I am working for a client that is performing network fail over testing as part of an ongoing DR project. They currently use SRM to protect the primary site (well, a subset of VM's within the site), and have had no issues with an SRM test itself.

However, the project have been told that if network connectivity was cut between the primary and DR site, SRM couldn't be used to bring the VM's up at the recovery site as when the network connectivity is subsequently restored, the SRM database would have issues as both sites would believe they are the protected site, and I presume changes to the databases would have happened would be the reason behind this.

I was wondering if there is any advice on how to approach a full / proper DR test where the network is disconnected, SRM can be used to bring up protected VM's at the DR site, and then play happily when network connectivity is restored between sites please?

Might be worth noting that we are using VMware vSphere 6.0.0, SRM is 6.1.2.1, and during the DR test the primary site will continue to operate with skeleton staff on site.

Thanks

Tags (3)
1 Solution

Accepted Solutions
ThompsG
Virtuoso
Virtuoso
Jump to solution

Hi there,

Okay I think I'm understanding more of what is being proposed and yes this is not the intended use of SRM. You are right, this will be why VMware support at apprehensive Smiley Happy

Sorry to repeat but want to be clear. The idea is to have the production workloads available at the Recovery Site while still running the production workloads at the Protected Site - think this part is clear. The unclear part is what is the "network failover" testing which is taking place? It cannot be the core switching/routing at the Protected Site otherwise the VMs there would lose network connectivity?

The reason for asking this is that I'm sure you are aware you can run a "bubble" test with SRM. The issue here is that this creates multiple isolated networks so that VMs on a host could possibly talk together, but across hosts will fail. This causes issue with multi-tier applications because you cannot confirm that everything worked ok. To work around this what we did was create a network that spanned our DR hosts (vlan_998) and configured SRM to use this network in a Test failover. This network cannot route outside itself so this is isolated just to VMs connected to it and therefore should only be during the testing of SRM plans. To allow for routing between guests on this network we have a virtual router which has the gateway addresses configured on it to allow the different VMs in the bubble test to talk.

Just to check my understanding, I could pose the following approach:

  • Run a Planned Failover while both sites are healthy. This mean no data loss?
  • Test everything works for an agreed period of time
  • Fail back once testing is complete

This wouldn't cause SRM to have any issues, but if there was a problem, VMware Support would still help if called up.

Correct - this should cause no data loss as long as your arrays are replicating happily. This is the approach we take with a slight modification but this might not suit your particular environment.

We have a layer-2 network which is stretched between our Protected and Recovery sites with the routing engine running from the Protected. During a normal production day our internet feed and branch offices (and global MPLS) all terminate at the Protected site. Leading up to the DR testing failover (when we run a planned failover), the branch networks and routing engine is moved to the Recovery site. This allows us to confirm we have a stable network for running the planned recovery.

Due to the way our DMZ works, we move the internet feed and DMZ networks on the night. There is a pause during the Planned Failover (after the VMs power down) where we swing these networks and then its full steam ahead again with SRM.

This fits with your suggestion of a separate network failover from the SRM one and makes sense. We essentially do this even though they run in the same week, i.e. we failover the Branch WAN and routing engine on the Friday and then the VMs on Saturday night. They are two separate exercises but required to complete the task Smiley Happy

Kind regards.

View solution in original post

Reply
0 Kudos
8 Replies
Finikiez
Champion
Champion
Jump to solution

How both sites are used? are they both active? or VMs are running only on protected site?

If you break the network connectivity between sites then you have only one option to make failover to the protected site - do disaster recovery in SRM.

However you will get two sets of working VMs in this scenario - one on protected site and one on recovery site.

So I doubt that you want to do this.

SRM is not the tool that should be used when you break network between sites I guess.

Reply
0 Kudos
Baoth
Enthusiast
Enthusiast
Jump to solution

Hi Finikiez

Thanks for the reply.

Yes, VM's are running only in the protected site.

I think you are right.

What is your opinion on performing a planned migration prior to breaking the network link? Once the network link is restored, would a planned migration back keep everything ticking over nicely and not break the SRM configuration?

Thanks again.

Reply
0 Kudos
Finikiez
Champion
Champion
Jump to solution

What is your opinion on performing a planned migration prior to breaking the network link? Once the network link is restored, would a planned migration back keep everything ticking over nicely and not break the SRM configuration?

My opinion that this is the only thing you should do.

Or just do nothing Smiley Wink because splitting network between sites can just break storage replication (if you replicate storage via ethernet like NetApp) and generate some alarms about remote site availability.

Reply
0 Kudos
ThompsG
Virtuoso
Virtuoso
Jump to solution

Hi there,

Sorry a little late to the party but running a failover while the network is offline between the two SRM servers will not break SRM.

With SRM 5.5 and before you could select an option when doing a Planned Recovery to change this to Forced Recovery. This was used in the advent that some catastrophic had happened to your protected site and you needed to get things running on the recovery site. This would perform the failover but not run any of the operations at the protected site, i.e. power off VMs, etc.

This has changed slightly with SRM 5.8x and above however still possible. Obviously the previous option displayed through the GUI was too easy for somebody to make a mistake so a Forced Recovery now requires an advanced option to be set to put SRM in Disaster Recovery mode. This means you know what you are doing and really want to proceed Smiley Wink

Anywho - to get back on topic once you have run the Forced Recovery and have got your Protected Site back online, with VMs shutdown and replication sorted, to put SRM back to a normal Failover state you simple run the Recovery Plan again without the Forced option. As communication at this point is available between both SRM server this will check the Protected Site, realise the VMs are powered down, check the array replication state and work out it is failed over and then finish successfully. Well... that is the glossy brochure.

Read here for more information on this process: Running a Recovery with Forced Recovery

I would be at Code Brown if required to do this but it can be done Smiley Happy

The best and least risk of data loss, is a planned failover with both sites healthy. With my current employer, we do this once a year - failover to our Recovery Site, run there for a week and then failback.


The scenario you are describing should ONLY be used in a disaster scenario so the business would need to accept some data loss but it can be done and SRM will not be affected.

NOTE: As Finikiez said, I would NOT be doing this in a DR test scenario unless the business wants to lose data. Deliberately creating a split-brain scenario for your arrays is potentially a career limiting move.

Does this help?

Finikiez
Champion
Champion
Jump to solution

Good point, I completely forgot about this option Smiley Happy

Reply
0 Kudos
Baoth
Enthusiast
Enthusiast
Jump to solution

Hi ThompsG

Thanks for the info and link Smiley Happy

I think it does help, yes.

So in my scenario, the added complication is that any VM's that are protected at the primary site need to remain on / working during the DR testing, which is probably what VMware tech support are telling us that SRM will break once reconnected.

At this stage, they are planning on testing the network failover works properly, and want to use SRM to test what they are protecting, during that agreed test window for the network testing.

Thinking about this a bit more after your post, it sounds like it would be a much more sensible approach to test these separately - so network failover In November for example, and if that all goes smoothly, look at testing SRM in December.

Just to check my understanding, I could pose the following approach:

  • Run a Planned Failover while both sites are healthy. This mean no data loss?
  • Test everything works for an agreed period of time
  • Fail back once testing is complete

This wouldn't cause SRM to have any issues, but if there was a problem, VMware Support would still help if called up.

Cheers

Reply
0 Kudos
ThompsG
Virtuoso
Virtuoso
Jump to solution

Hi there,

Okay I think I'm understanding more of what is being proposed and yes this is not the intended use of SRM. You are right, this will be why VMware support at apprehensive Smiley Happy

Sorry to repeat but want to be clear. The idea is to have the production workloads available at the Recovery Site while still running the production workloads at the Protected Site - think this part is clear. The unclear part is what is the "network failover" testing which is taking place? It cannot be the core switching/routing at the Protected Site otherwise the VMs there would lose network connectivity?

The reason for asking this is that I'm sure you are aware you can run a "bubble" test with SRM. The issue here is that this creates multiple isolated networks so that VMs on a host could possibly talk together, but across hosts will fail. This causes issue with multi-tier applications because you cannot confirm that everything worked ok. To work around this what we did was create a network that spanned our DR hosts (vlan_998) and configured SRM to use this network in a Test failover. This network cannot route outside itself so this is isolated just to VMs connected to it and therefore should only be during the testing of SRM plans. To allow for routing between guests on this network we have a virtual router which has the gateway addresses configured on it to allow the different VMs in the bubble test to talk.

Just to check my understanding, I could pose the following approach:

  • Run a Planned Failover while both sites are healthy. This mean no data loss?
  • Test everything works for an agreed period of time
  • Fail back once testing is complete

This wouldn't cause SRM to have any issues, but if there was a problem, VMware Support would still help if called up.

Correct - this should cause no data loss as long as your arrays are replicating happily. This is the approach we take with a slight modification but this might not suit your particular environment.

We have a layer-2 network which is stretched between our Protected and Recovery sites with the routing engine running from the Protected. During a normal production day our internet feed and branch offices (and global MPLS) all terminate at the Protected site. Leading up to the DR testing failover (when we run a planned failover), the branch networks and routing engine is moved to the Recovery site. This allows us to confirm we have a stable network for running the planned recovery.

Due to the way our DMZ works, we move the internet feed and DMZ networks on the night. There is a pause during the Planned Failover (after the VMs power down) where we swing these networks and then its full steam ahead again with SRM.

This fits with your suggestion of a separate network failover from the SRM one and makes sense. We essentially do this even though they run in the same week, i.e. we failover the Branch WAN and routing engine on the Friday and then the VMs on Saturday night. They are two separate exercises but required to complete the task Smiley Happy

Kind regards.

Reply
0 Kudos
Baoth
Enthusiast
Enthusiast
Jump to solution

Hello all

Thanks for the help on this, and apologies for not getting back. I've marked one of the answers as correct, but there were a few that helped me.

Off the back of the advice, and the fact that SRM isn't really built to do what it is my client wanted to try and do, I managed to talk they around to another way of looking at the situation.

Thanks again

Paul