my managment and network team want to run a TEST w...

brucericker · ‎09-16-2009

my managment and network team want to run a TEST with the native VLANS, not the test bubble.

I'm assuming that the secondary site VC needs to talk to the primary site VC. is this correct?

If so, they are suggesting i put the primary VC in its own VLAN and re-ip it so when the network to phoenix is completely cut, i can still run my recovery plans from the denver site, and bring up all of the vms on their normal prod vlans.

what do you think?

in addition, the primary SRM server is the Prodcution Virtual Center is also the License Server.

Is re-iping it the absolute wrong thing to do, because of all of the implications....or do we even need access to the Prod VC to run a recovery plan in Denver if the link to Phoenix is down?

brucericker · ‎09-16-2009

according to the following, i do not actually have to be connected to the protected site at all when TESTing a recovery plan....is this correct? this is from the admin guide..

*

Test a Recovery Plan

*

You can run frequent tests, which simulate an actual recovery. You can run test

recoveries and edit the recovery plan to fix any problems when you run the tests. SRM

runs exactly the same plan that is run for both tests and actual recoveries with the

following exceptions:

Recovery tests do not connect to the protected site and shut down virtual

machines.

Recovery tests create test networks so that the infrastructure of the protection and

recovery site is protected. The test network is removed after the test is completed.

This action ensures that the infrastructure of both sites is protected. You can,

however, select an actual network to test recovery.

The virtual machines in the recovery site typically start from a datastore that is

cloned from the target datastore in the recovery site to ensure that the test is run

against a storage infrastructure that is isolated from the production environment.

To test a recovery plan, the following conditions must be in place:

The VI Client must be connected to the recovery site.

The role of Recovery Plans Administrator.

CrisRobinson · ‎09-17-2009

There is no need to re-ip the VC server. The VC server can be on it's normal subnet. For the test the VM's will be re-ip'd and belong to a new network on the recovery side. VC and SRM will always have visibility to the VM's because they don't need the physical or virtual network to manage the VM's, they rely on the VMKernal and VMConsole connects for those communications.

Am I missing anything in the reasoning?

Thanks!

brucericker · ‎09-17-2009

Cris

Thanks for your response

Here is the scenario

Site A - Primary site - Phoenix AZ

adccluster

2 blade chassis c7000, 8 esx hosts each (16 total)

All prod VM's

Adcutil26 is the VC, primary SRM and Lic server, is a virtual

Site B - Secondary Site - Denver

ddccluster

1 blade chassis c7000, 8 esx hosts total

All DEV/TEST vm's

Ddcutil26 is the VC, secondary SRM, it’s a physical

the network group will be blocking off the ADC in phoenix via ACL's, so this is why I was thinking about re-ip'ing the VC. It's network will be brought up in site B, as will all of the other prod networks. We are TESTing a recovery plan with the production networks.

what are the SRM ramifications of the primary site being blocked off completely?

What happens when I SRM the primary VC to the secondary site?

I believe I can TEST a recovery plan, without needing to contact the primary SRM VC, so do I even need to SRM the primary VC over?

Gracias

-Bruce

CrisRobinson · ‎09-17-2009

I see...

One thing to be careful of is that during a "Test" as apposed to a "Failover" the replication continues so you want to be sure that depending on the Array replication you are using that you either keep that VLAN up or suspend the replication.

Next, I see why the Network folks want to make the ACL change, I have been down that road myself. Keep in mind that whatever the network guys do you need to be able to have the ESX servers visible across the network to the VC\SRM server(s). I highly doubt that you want to re-IP the ESX hosts. I also suspect that there are other applications on physical servers that are being tested that the VM's need to talk to them as well in an isolated environment. What I suggest doing is this: Dual home the VC. The VC absolutely needs to talk to the host ESX servers, period. Have the network guys create a VLAN or ACL to keep that connection alive. That way the VC can talk to the cluster and you will have admin access to the VC via RDP.

"what are the SRM ramifications of the primary site being blocked off completely?" None other than SAN array replication as stated before.

"What happens when I SRM the primary VC to the secondary site?" They go into a disconnected state. When you switch back they should re-establish reciprocity.

"I believe I can TEST a recovery plan, without needing to contact the primary SRM VC, so do I even need to SRM the primary VC over?" Correct, you would NOT SRM the primary VC. Let it be.

I hope this helps!

JeffDrury · ‎09-17-2009

Bruce,

There is no need to recover your primary VC at the disaster site. SRM requires a VC server at each site so you do not have to bring up your VC server in a new location and possibly with a new IP. I would recommend you leave your VC servers out of your recovery plans. If your recovery site cannot see the primary site during the test it will still bring up VM's and allow you test functionality, all without effecting the production VM's at your primary site.

CrisRobinson · ‎09-17-2009

Exactly! Don't even include the VC in a recovery plan. In fact, put it on a NON-replicated datastore if possible or set that VM to not power up.

brucericker · ‎09-17-2009

One thing to be careful of is that during a "Test" as apposed to a "Failover" the replication continues so you want to be sure that depending on the Array replication you are using that you either keep that VLAN up or suspend the replication.

Which VLAN? The san replication VLAN?

Keep in mind that whatever the network guys do you need to be able to have the ESX servers visible across the network to the VC\SRM server(s).

Which esx servers need to be visible over the network? Production in Site A needs to be seen be the VC in Site B?

I highly doubt that you want to re-IP the ESX hosts.

Correct, we don’t want this

I also suspect that there are other applications on physical servers that are being tested that the VM's need to talk to them as well in an isolated environment.

This is correct. The TEST we are running will be using all of the production networks from site A, , however running them in Site B

The VC absolutely needs to talk to the host ESX servers, period.

The primary VC needs to continue to talk to all of the ESX hosts in both sites?

The primary VC is on a subnet 167.96.83.x that will be blocked via acl's as other VM's in Prod in site A are on that same subnet...

What I suggest doing is this: Dual home the VC. Have the network guys create a VLAN or ACL to keep that connection alive. That way the VC can talk to the cluster and you will have admin access to the VC via RDP.

So if this is the case, then the primary VC in phoenix needs to be dual homed or the secondary? How is the connection kept alive? We need to change dns in the recovery site to point adcutil26 to the new vlan and ip?

"what are the SRM ramifications of the primary site being blocked off completely?" None other than SAN array replication as stated before.

"What happens when I SRM the primary VC to the secondary site?" They go into a disconnected state. When you switch back they should re-establish reciprocity.

"I believe I can TEST a recovery plan, without needing to contact the primary SRM VC, so do I even need to SRM the primary VC over?" Correct, you would NOT SRM the primary VC. Let it be.

brucericker · ‎09-17-2009

Okay...thanks Jeff. So essentially if I remove the primary vc from being replicated at all, I should be good to go there. I can do that, no problem...the primary vc wont come up in the recovery plan...

If your recovery site cannot see the primary site during the test it will still bring up VM's and allow you test functionality, all without effecting the production VM's at your primary site.

Excellent, that’s what I was wondering..

So when the acl's are put in place and the primary can no longer see the secondary, and vice versa, does SRM actually try to do anything like a true failover?

What happens with SRM when the acl's are put in place and the primary and secondary sites can no longer see each other for several hours during this test....and then when they can hours later?

CrisRobinson · ‎09-17-2009

Yes, if you are replicating across the network your are either routing between subnets or have a spanned vlan. Either way the source array will complain when you break that network connection if you are running a replication. I suspect it is asyncronous based on the distance between phoenix and Denver.

What arrays are you using? EMC, HDS, IBM? FC or ISCSI ?

You need to keep that communications alive between VC and ESX. I suspect that the Network guys can vlan you off with their failover scripts.

We really only care about Site B. If we lose contact with Site A that is ok. When the network is switched back it will re-establish communications on the SRM site. It may spit out a few errors.

Only the DR side, Denver, needs to have the VLAN between the VC and hosts established. Remember also that the ESX cluster has heartbeat between them for HA\DRS VMotion Etc, if you are licensed for all that, so it really is you best bet to have a script that preserves their network via a VLAN as well as the VC to ESX communications.

JeffDrury · ‎09-17-2009

"So when the acl's are

put in place and the primary can no longer see the secondary, and vice

versa, does SRM actually try to do anything like a true failover?"

SRM does not do anyting automatically. This is a good thing in that SRM will not initiate a very permanant failover on it's own. An admin is required to initiate any kind of test or failover. As long as you click the "Test" button on the recovery plan you are good. If you hit the "Run" button you will initiate an actual failover that will cause your SAN to mount the data at the recovery site as the primary copy. Once you do this failback is a manual process, and you will likely need to get your SAN vendor involved. Whatever you do, don't hit the "Run" button unless it is an actual failover.

" What happens with SRM

when the acl's are put in place and the primary and secondary sites can

no longer see each other for several hours during this test....and then

when they can hours later?"

SRM does not care that it can't see the other site. You can still test your recovery plan without having communication with the primary site. When the link is restored SRM will again establish reciprocity with the primary site. You may need to stop/start the SRM service at each site to speed up this process. As Chris mentions your data replication, which is handled by your storage vendor, may be sensative to the loss of communication to the primary site. If you are doing asyncronous data replication and it can't replicate for several hours those changes should be queued and replicated when the remote site is back up. This may cause your WAN link to be saturated with replication traffic or not be able to replicate the changes with the available bandwith. Again this is an issue that can change depending on the type of storage that you are using.

brucericker · ‎09-17-2009

Okay...on the denver side the VLAN is already there and established and will remain untouched. Nothing will happen in denver to the VC or the hosts on the denver vlans. Also nothing will happen in phoenix between the VC and the hosts in Phoenix.

Are you saying that the hosts in denver need to talk to the VC in Phoenix, even though they are different clusters?

Phoenix will just be cut off from Denver via acl's...so everything remains up and running in phoenix, just not accessible from denver.

I was worried that when this is done, that srm would attempt to failover. If this doesn’t occur, how is this acl blocking different from a true disaster...what would trigger srm to attempt a failover?

The test recovery plan is set to bring down the test vm's on those hosts and bring up the prod vm's from the protected stores that are replicating..right now on the Auto, test bubble network. I am going to alter those to come up on their prod networks so there is full communication and testing between all of the vms and some physicals in denver.

One of those subnets is 167.96.83.x this is where the primary VC is and this subnet will then be live in Denver but cannot also be live in Phoenix as the real prod systems are on that same subnet there...hence the acl's and the primary vc being unavailable

So it seems I do not need to do anything at all except configure the recovery plan networks from auto to their prod vlans.

I will not have to do anything with SRM at all? No Break in protection or anything prior to the ACL's being put in place to block everything?

It is asynchronous and its emc clarion cx4 -480

Thanks for your help Chris and Jeff. I appreciate it. trying to get my head around this...

brucericker · ‎09-17-2009

Okay thanks Jeff. I knew that, I am definitely only running test...

So if a bomb hits this building and I am a smoking hole, SRM will not auto failover to denver?

CrisRobinson · ‎09-17-2009

>Are you saying that the hosts in denver need to talk to the VC in Phoenix, even though they are different clusters?

No, they are not talking other than SRM.

>It is asynchronous and its emc clarion cx4 -480

Mirrorview\A or Recoverypoint? Jeff is right, it will just queue up the block changes until the network comes back. If you do this on the weekend, depending on your business, there will probably be less changes to the data.

No problem. We are all here to help each other out!

CrisRobinson · ‎09-17-2009

Nope. Just hope you're out fishing when it happens!

brucericker · ‎09-17-2009

Okay good....thanks gents

Its recoverpoint

brucericker · ‎09-17-2009

So I do not need to stop replication on the san either.

My only real concern then, is to get the primary vc out of the recovery plan?

Speaking of recovery plans, do you know if multiples can be run against recoverpoint? I know this is dependent on the SRA correct? So far I have only attempted to run one recovery plan at once...

CrisRobinson · ‎09-17-2009

I have not tried to do more than one at a time. I suspect that they will queue up.

Gonna have to try that!

JeffDrury · ‎09-17-2009

Page 23 of the SRM documentation, http://www.vmware.com/pdf/srm_10_admin.pdf, shows the following:

Table 2-2. SRM Configuration Maximums

Protected virtual machines 500

Protection groups 150

Replicated LUNs 150

Running recovery plans 3

You may want to check with the EMC documentation for the SRA to see if there are any stipulations on the number of active recovery plans with recover point.

brucericker · ‎09-17-2009

Thanks Jeff

I have checked and emc says they do support it, so I will be testing multiple recovery plans running tomorrow...

I'll keep you posted

All

my managment and network team want to run a TEST with the native VLANS, not the test bubble, they want to re-ip the primary SRM, VC, License server?