fordian
Hot Shot
Hot Shot

To better understand SRM

Jump to solution

Hello,

We have 2 sites, on each site we have 4 ESX servers and on each site we have a Clariion CX380 but all is managed by only one Virtual Center. All ESX servers are on the same VC Datacenter. Finally we have 2 physical sites that we managed as one site.

I know with SRM we need two VCs, one on each site, but do we need two Datacenters into VirtualCenter ? Why I am asking this question is actually we use MirrorView/S to sync LUN and we have LUN resignature concern because we can't displays the mirror and mirrored lun with the same uid when all ESX servers are managed by only one VC.

Do SRM can managed this for us ?

I hope I am clear with my problem

Thank you

Dominic.

0 Kudos
1 Solution

Accepted Solutions
Smoggy
VMware Employee
VMware Employee

First the current version of SRM does not support a single VC use case so if the customer won't budge on that one then you cannout currently use SRM. there are many downsides in DR terms from using a single VC, hence reason why SRM architecture utilizes two.

When you have two VC's SRM can work active / active and by that i mean you can have two production datacenters each replicating to the other each acting as DR for the other in the event of a site loss. SRM supports this configuration.

In my experience VMotion between sites with the current VC25/ESX35 suite does not ever really give you much in terms of use case since it is only really an option for customers who's DC's are close together and even then all you end up with is a VM running at one site that is accessing its virtual disks at the other site, not that desireable. Other point here is that to make this work means you also have to have SAN zones that allow luns to be accessed across sites and also potentially across security boundaries.

Granted I can see some obsucre use cases for wanting to be able to vmotion a VM across site but nothing that really makes it worthwhile architecting this right now, ok one day you might want to vmotion everything across to the other site whilst you perform some hugh hardware upgrade but there are other ways round this with storage vmotion that still give you no downtime.. Going forward there are other things coming that will make vmotion over distance, tighter integration with storage replication more worthwhile.

Where you are right now i think the most important thing to establish is what does your customer really want, is it simply a solution that lets them run two production DC's so they don't have kit standing idle at one location? if yes then SRM can do this for them now in an active / active setup.I find a lot of customes think they need cross site vmotion without really thinking it through and without thinking about the implications for things like vCenter design, point such as:

VC Inventory / Layout, be careful with the design, as everything is stretched you need to be very consistent and accurate with naming conventions across all inventroy objects the VM's will use

DRS/HA settings, ensure that you know which VM's are important and define the correct settings per VM for recovery. Unless you have N+1 capacity spare at each site you will need to put in place HA/DRS settings that bring online the most important VM's first and dont end up in a failure situation with all your dev/test VM's online and half the production VM's "down" because you did not set correct priorities in HA. In SRM this is something the recovery plan handles and you can control.

Split Brain, if you run the two sites as one big HA/DRS cluster ensure you test out the various failure scenarios, for example if DRS (or manual VMotion) moves a bunch of VM's from site1 to site2 but no failure as occurred at that time you now end up with VM's CPU/Memory/Network contexts running on hosts at Site2 but accessing their VMDK's on site1. This will work but is not always desireable from a latency point of view (might be none-issue if bandwidth sufficient) however what happens next if you now suffer disk outage at Site1, at this point the VM's will not crash immediately at Site2 and it will take HA sometime to realise these VM's have an issue. Try it and see, if you disconnect storage from a VM the VM will cling on to life (assuming IO pattern is normal) for quite sometime before a bluescreen is seen.

Storage Presentation, if your customer wants the zone across the sites to effectively be "open" to all ESX hosts then ensure you understand the implications of the ESX LVM settings with regards snapshot / disk resignature. You potentially will have ESX hosts that could at some point access both a source and target lun at the same time if someone or something altered the LVM defaults. As we discussed earlier, with SRM architectures this is handled by SRM as in the SRM archtiecture you would not have this kind of open zone.

Zoning, if the vsan / zones are truly open or all hosts in same then certain fabric events can be a potential pain. Any rogue events such RSCN will disrupt both sites at the same time if all ESX hosts are on same open fabric so be careful here. Not something that is too common but i have seen it hurt a few customers, usually comes down to bad HBA or cables but can be a real pain to track down.

VC / ESX limits, as you build the design out for campus cluster ensure the design wont have you quickly reaching the limits of what it supported in terms of things like max number of VMs/VC, max number of luns/ESX host, max number paths/lun/ESX host etc.

View solution in original post

0 Kudos
8 Replies
fordian
Hot Shot
Hot Shot

oops ... read : managed all ESX servers into one Datacenter.

0 Kudos
bladeraptor
VMware Employee
VMware Employee

Hi

I am writing this as an EMC employee

My understanding is that you cannot implement SRM without having two separate vCenter instances and consequently if you create and use Data Center objects, you will have two 'Data Center' objects as VMware understands them, one at the protected / production site and one at the recovery / remote site.

The two vCenter instances are critical to allow for the creation of a separate SRM database which is populated with the details of the protected or production site at the remote / recovery site.

Without the second vCenter you would effectively be trying to replicate the production or protected VMware site configuration to itself

The second recovery / remote site also provides important features such as the inventory mapping that allows you to recover your environment in terms of folders, networks and resource group objects at the 'recovery site'. This process would not be possible with only one vCenter incidence

As far as I can see if you wish to implement a disaster recovery solution deploying SRM you will need to accept two vCenter environments.

Are you saying that all the hosts can see all the storage - i.e. hosts at site A can see the storage at Site B and vice versa?

The LUN issue should not appear due to the mandated use of the LVMresignature option that will provide a different identity for the snapped and remote volumes.

However I take it that you will have some ESX hosts at the recovery / remote site ready to bring online when recovering the protected / production site and these will need to be part of the second separate vCenter instance

Please let me know if that does not make it clear

Many thanks

Alex Tanner

fordian
Hot Shot
Hot Shot

If SRM use LVMresignature then you have all LUNs from both site in the same Datacenter, am I wrong or simply SRM requires to have 2 Datacenters ?

0 Kudos
bladeraptor
VMware Employee
VMware Employee

Hi

I am not sure what you mean by - "If SRM use LVMresignature then you have all LUNs from both site in the same Datacenter."

SRM can operate in a scenario where the foundation ESX licesning model is used and consequently I don't beleive one has to utilise the datacentre object at all

If you have two separate vCenter incidences as SRM apprears to require you to do, then as vCenter is not currently fedeated you will have two independent data center objects. Even if you named the data center objects the same they would still be separate

It is then a case of how you alocate your hosts among the vCenter instances

I would not associate the use of the LVMresignature with the data centre object - but rather the cluster object - as you could have a number of clusters beneath a data center object and i believe the cluster is the boundary that determines the relevance of the LVMresignature rather than the data centre object

However all of this is secondary to the need to have two vCenter instances and two SRM instances

Regards

Alex Tanner

0 Kudos
Smoggy
VMware Employee
VMware Employee

SRM will handle all of the LVM settings for you so these do not need to be manually set.

In the current SRM architecture you have two separate VC instances (one at each site). Part of the reason for this approach is that should anything happen to the source site (as it traditionally does in a disaster) then everything you need to invoke your recovery plans is held ready to go in the running VC/SRM setup at the recovery site. This means to perform a succssful you have no reliance on any element of the source (protected site).

In terms of LVM settings these affect the UUID stamped into the metadata of each lun/VMFS volume. When your ESX hosts scan their devices if they see a UUID that they think they have seen before but something about the physical devices is different then by default the LVM settings ensure that the VMFS volume in that device is not mounted, this is what is known as ESX treating that device as a snapshot (i know the term snapshot get used far too often to mean different things!)

In a best practice DR environment you should not have to be concerned about presenting the source and replica luns to a single ESX host at the same time since it is very unusual to have ESX hosts at the source site zoned into the replica copies at the DR site. SRM ensures the correct LVM settings are used so even if you have an "odd" zone setup that does allow hosts at both sites to see all devices (source and replicas) then you will not be at risk.

cheers

Lee Dilworth

fordian
Hot Shot
Hot Shot

"In a best practice DR environment you should not have to be concerned about presenting the source and replica luns to a single ESX host at the same time since it is very unusual to have ESX hosts at the source site zoned into the replica copies at the DR site."

That is the point, our concern is that our customer wants to use the production site and the "recovery site", ie use the recovery site to run VMs from the production site, VMotion between both site, displays LUNs from one site to the other, vice versa, managed by one VC.

Is SRM can works with this setup ???

Dominic

0 Kudos
Smoggy
VMware Employee
VMware Employee

First the current version of SRM does not support a single VC use case so if the customer won't budge on that one then you cannout currently use SRM. there are many downsides in DR terms from using a single VC, hence reason why SRM architecture utilizes two.

When you have two VC's SRM can work active / active and by that i mean you can have two production datacenters each replicating to the other each acting as DR for the other in the event of a site loss. SRM supports this configuration.

In my experience VMotion between sites with the current VC25/ESX35 suite does not ever really give you much in terms of use case since it is only really an option for customers who's DC's are close together and even then all you end up with is a VM running at one site that is accessing its virtual disks at the other site, not that desireable. Other point here is that to make this work means you also have to have SAN zones that allow luns to be accessed across sites and also potentially across security boundaries.

Granted I can see some obsucre use cases for wanting to be able to vmotion a VM across site but nothing that really makes it worthwhile architecting this right now, ok one day you might want to vmotion everything across to the other site whilst you perform some hugh hardware upgrade but there are other ways round this with storage vmotion that still give you no downtime.. Going forward there are other things coming that will make vmotion over distance, tighter integration with storage replication more worthwhile.

Where you are right now i think the most important thing to establish is what does your customer really want, is it simply a solution that lets them run two production DC's so they don't have kit standing idle at one location? if yes then SRM can do this for them now in an active / active setup.I find a lot of customes think they need cross site vmotion without really thinking it through and without thinking about the implications for things like vCenter design, point such as:

VC Inventory / Layout, be careful with the design, as everything is stretched you need to be very consistent and accurate with naming conventions across all inventroy objects the VM's will use

DRS/HA settings, ensure that you know which VM's are important and define the correct settings per VM for recovery. Unless you have N+1 capacity spare at each site you will need to put in place HA/DRS settings that bring online the most important VM's first and dont end up in a failure situation with all your dev/test VM's online and half the production VM's "down" because you did not set correct priorities in HA. In SRM this is something the recovery plan handles and you can control.

Split Brain, if you run the two sites as one big HA/DRS cluster ensure you test out the various failure scenarios, for example if DRS (or manual VMotion) moves a bunch of VM's from site1 to site2 but no failure as occurred at that time you now end up with VM's CPU/Memory/Network contexts running on hosts at Site2 but accessing their VMDK's on site1. This will work but is not always desireable from a latency point of view (might be none-issue if bandwidth sufficient) however what happens next if you now suffer disk outage at Site1, at this point the VM's will not crash immediately at Site2 and it will take HA sometime to realise these VM's have an issue. Try it and see, if you disconnect storage from a VM the VM will cling on to life (assuming IO pattern is normal) for quite sometime before a bluescreen is seen.

Storage Presentation, if your customer wants the zone across the sites to effectively be "open" to all ESX hosts then ensure you understand the implications of the ESX LVM settings with regards snapshot / disk resignature. You potentially will have ESX hosts that could at some point access both a source and target lun at the same time if someone or something altered the LVM defaults. As we discussed earlier, with SRM architectures this is handled by SRM as in the SRM archtiecture you would not have this kind of open zone.

Zoning, if the vsan / zones are truly open or all hosts in same then certain fabric events can be a potential pain. Any rogue events such RSCN will disrupt both sites at the same time if all ESX hosts are on same open fabric so be careful here. Not something that is too common but i have seen it hurt a few customers, usually comes down to bad HBA or cables but can be a real pain to track down.

VC / ESX limits, as you build the design out for campus cluster ensure the design wont have you quickly reaching the limits of what it supported in terms of things like max number of VMs/VC, max number of luns/ESX host, max number paths/lun/ESX host etc.

0 Kudos
fordian
Hot Shot
Hot Shot

Thank you, very good answer.

Dominic

0 Kudos