Solved: Re: What's the benefits of using SRM over HA and V...

jhiraldo · ‎02-09-2009

I have a 2 site environment connected via Fiber.I can do a cluster campus between the 2 sites or a SRM failover from one side to the other.

1) Cluster environment with HA and VMotion with Lefthand will give me an always up environment and I don't need to buy any SRM license.

A) If one building goes down I can still move on without a hickup

B) Don't have to worry if my SRM will work when site 1 goes down.

2) Failover environment with SRM

A) I can seperate the sites. what are the Benefits?

B) SRM will do the failover (But then I have to do the failback with some manual intervention) this means some down time.

Can someone tell me what is the benefits of SRM when I can do a cluster environment and always be up?

TomHowarth · ‎02-10-2009

No Dave you are incorrect in your assumption, a Metro Cluster is two deivces but the SAN attached machine will only see one logical LUN. the NetApp deals with the continous replication between sites.

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

Tom Howarth

VMware Communities User Moderator

Blog: www.planetvm.net

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410

View solution in original post

bladeraptor · ‎02-09-2009

Hi,

I am writing this as an EMC employee,

Please see inline for my comments on the specific points highlighted in your post.

However I would headline it by saying that for the most part as things stand currently from the majority of vendors and also reflecting bandwidth / link latency and replication architectures - we should be careful to separate a VMware capability designed to address local site uptime from that providing for catastrophic production site failure.

For the most part HA and VMotion are about providing local site uptime - a series of technologies that allow services located in the production environment to offer as close to 100% uptime as is possible.

VMotion and HA are not, in terms of most vendor's architectures and VMware best practices, designed to span geographically dispersed sites and by their nature they anticipate common locally sited storage, hosts and 'network' links.

This is in line with IT infrastrucure realities such as planned downtime for hardware and software maintenance and upgrades. HA and VMotion are designed to deal as much with eliminating the need for planned downtime on a local site and as for counteracting unplanned downtime

SRM is designed to operate cross-site - whether local campus / MAN or WAN.

It anticipates the loss of the production site and a key business driver in this scenario is the recovery of services at the secondary site as quickly as possible with minimal data loss - but SRM anticipates unplanned downtime - so while in this scenario retaining a high uptime rating is a good thing - when a disaster happens recovering services is all that matters and SRM is designed not to rely on any elements of the infrastructure - hosts, network or storage remaining on the primary site

1) Cluster environment with HA and VMotion with Lefthand will give me an always up environment and I don't need to buy any SRM license.

VMotion does not function if one element is down - it requires both 'hosts' to respond (so unless LeftHand is doing something clever that is not VMware related) VMotion will fail if one of the hosts is down. This 'cleverness' is offered by some vendors but wariness needs to be exercised around ESX host overhead, the impact of network latencies and complexity and cost of deploying a solution that is trying to fudge the design principles behind HA and VMotion - i.e a local site dependency

A) If one building goes down I can still move on without a hickup

Again if you are talking VMotion and one of the ESX hosts is in a building that has gone down (unless LeftHand is doing something clever that is not VMware related) VMotion will fail.

In the scenario you are talking about it may be possible to retain some uptime from the environment - but truly distributed implies that you will have storage / networks and hosts at both sites and this separation is not conducive to HA and VMotion operating across those distributed resources - unless some kind of extrapolation layer is sewing them together to appear as one environment and in which case there is an overhead as everything that is on site one must be duplicated and available all the time on site two and there is managment, configuration and networks overhead required to achieve this.

If the requirements justify this in terms of all the related overhead and the distances and latencies are not an issue then what you describe may be possible and workeable - but consider this

What you are describing is an extended single site environment and while you could tolerate a very limited local site failure - albeit running at half capability - you have not created a true DR scenario.

If an incident of significant magnitude was to take out this Campus environment - you could not recover. I would suggest that if the environment comprised of two sites is close enough to enable this 'stretching of VMotion and HA in what you suggest is a non-disruptive fashion - it is close enough to be taken out by a range of natural and man made catastrophes.

So even if you go ahead with the deployment as you suggest I would still consider some type of recovery strategy in the event that both local sites are taken out - depending on the severity of the service level requirments you are being asked to meet.

B) Don't have to worry if my SRM will work when site 1 goes down.

This is the whole beauty of SRM. SRM creates two additional databases one per 'site' that hold the configuration information of the other site. The recovery is driven through recovery plans that are created at the recovery side and are not reliant on communication from the production site to be enabled in the evenmt of production site failure. SRM is about being able to recover the entirety of your VMware infrastrucutre in the event of catastrophic production site failure - i.e nothing is left of the production site - hosts, storage or networks

2) Failover environment with SRM

The failover environment with SRM is a replica of the production site. This is achieved at the implest level through the inventory mapping function

A) I can seperate the sites. what are the Benefits?

That the sites are not dependent on each other in the event that one site fails. Each site can be configured completely differently down to host IP addresses and this can then be mapped to allow either site to be recovered in the other site's environment with minimal reconfiguration or reconstruction

B) SRM will do the failover (But then I have to do the failback with some manual intervention) this means some down time.

That is the scenario with the current version of SRM. SRM, however, is a constantly evolving framework that will be undergoing constant revisions and a road map future includes the possibility of a seamless failback mechanism - this wasn't included in V1 as the desire was to creatre an intial iteration fo the SRM framework that was as inclusive of storage vendors as possible. Not all vendors can do failback.

Regards

Alex Tanner

weinstein5 · ‎02-09-2009

The one thing I would ask is which building does your SAN used to house your virtual machines reside in - because using VMotion and HA as your DR solution SAN is the single point of failure if the building with SAN is goes down then HA and vmotion will not work - if your answer to that is well I am replicating my data between the building that is fine then what you will gain with SRM is a quicker recovery time because SRM will allow automate teh failover where if you did not have SRM then you would manually have to bring up your environement causing a longer outage - also with SRM you would be able to test your DR Procedures without affecting the production environment -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

jhiraldo · ‎02-09-2009

If I have a cluster environment between 2sites using SAN/iQ or Netapp if site1 goes down with the NAS and Hosts, all the VM's will restart automatically on site2 without data loss because the data would off been copied accross multiple sites and the vm's should come back online. After I get site1 back up site2 would update any changes. I get my host up and the resources will be applied to the cluster that needs it. I understand that if both sites goes down then i don't have a way to restart anything and that goes for both solution with or without SRM unless I have backed up all my vm's to tape or a 3rd location. This is what Netapp and HP are selling to me as a solution and what I have read from their white papers, am I misunderstanding their design and how it fits in regards to a faster DR solution?

weinstein5 · ‎02-09-2009

I do not think HA will failover because it will see the SAN as two different LUNs - have your tested this? Snce you have the replication between the sites implementing SRM should not be a problem - is you do not want SRM other options could be double take, vizioncore or plate spin -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

jhiraldo · ‎02-09-2009

I don't dislike SRM is that with the 2 solution I don't see a need for it. I need someone that has implemented Netapp or SAN/iQ cluster design to comment on this to verify that this solution is what they're really selling and not a marketing or sale pitch.

JeffDrury · ‎02-09-2009

Weinstein5,

For HP SANiQ that is not correct, not sure about NetApp.

With SANiQ deployed in a campus cluster between two locations, there is a single LUN presented to the ESX servers. If one side of the deployment goes down the LUN is still presented to the surviving ESX hosts, and HA will recognize that the VM's are down due to ESX host failure, and restart on the surviving ESX hosts. There is no interuption in service from a storage perspective. SANiQ is essentially providing the same syncronous replication at the block level that the products you listed would perform at the file level. You could not use SRM between these two sites because the storage is presented as a single site and SRM would not recognize that there are two distinct storage clusters. In effect this is giving you better DR protection than SRM as long as one of the sites survives in the event of a disaster. You could also add a third site as an SRM target to backup the data at the two main sites if they both fail.

JeffDrury · ‎02-09-2009

Weinstein5,

For HP SANiQ that is not correct, not sure about NetApp.

With

SANiQ deployed in a campus cluster between two locations, there is a

single LUN presented to the ESX servers. If one side of the deployment

goes down the LUN is still presented to the surviving ESX hosts, and HA

will recognize that the VM's are down due to ESX host failure, and

restart on the surviving ESX hosts. There is no interuption in service

from a storage perspective. SANiQ is essentially providing the same

syncronous replication at the block level that the products you listed

would perform at the file level. You could not use SRM between these

two sites because the storage is presented as a single site and SRM

would not recognize that there are two distinct storage clusters. In

effect this is giving you better DR protection than SRM as long as one

of the sites survives in the event of a disaster. You could also add a

third site as an SRM target to backup the data at the two main sites if

they both fail.

bladeraptor · ‎02-09-2009

Hi

Please see my thoughts inline to your original comments

If I have a cluster environment between 2sites using SAN/iQ or Netapp if site1 goes down with the NAS and Hosts, all the VM's will restart automatically on site2 without data loss because the data would off been copied accross multiple sites and the vm's should come back online.

>>>> Clearly we have moved away from the earlier assertions around VMotion - as a VM restart is not in keeping with the capabilities of VMotion - so the predominant capability here is HA and what you have is a stretched HA environment where effectively hosts are looking at 2 SANs and you have stretched networks, DNS etc.

>>>>>There will de some form of delay in which the system confirms that the other side is down - I would verify how this is achieved to avoid split brain syndrome and whether it will understand network outages and other incidents that impact cross-site communication without causing an inadvertent 'failover'

>>>>>I would be very careful when considering claims of no data loss. There is no storage system on the market that can offer zero data loss without some form of host operating system and often application integration. The modern caches on many servers particularly ESX hosts at 16-32GB means that there could be any number of transactions passing through host cache when a failure occurs. Furthermore you may get all the relevant data to array cache or disk - but the question is then what state is that data in - if the data is coming from an application such as a database or mail construct - there might be a loss of synchronicity between the database and logs portion. Many applications can recover from this - but with the loss of transactions or mail - hence the existence of the VSS and VDI frameworks for quiescing applications prior to conducting array operations

>>>>Taking into account my earlier comments about data loss, your comment that "VM's will restart automatically on site2 without data loss because the data would off been copied accross multiple sites and the vm's should come back online" - is a statement that basically covers exactly what SRM will do".

>>>I am confused by your use of the term 'multiple sites' - I take it you mean across two sites, as opposed to some form of cascaded architecture?

After I get site1 back up site2 would update any changes.

>>>>Again I would validate how these changes are mapped onto the operations of Virtual Machines and applications and whether they are applied in a consistent fashion ensuring that your services are resumed without data loss or corruption

I get my host up and the resources will be applied to the cluster that needs it. I understand that if both sites goes down then i don't have a way to restart anything and that goes for both solution with or without SRM unless I have backed up all my vm's to tape or a 3rd location. This is what Netapp and HP are selling to me as a solution and what I have read from their white papers, am I misunderstanding their design and how it fits in regards to a faster DR solution?

>>>> I can't comment on their design or capabilities but can perhaps as I have tried to do briefly here stress some of the questions and concerns it is worth posing around what is being offered

>>>>As suggested earlier the fact that the two sites are so close implies that they could from a DR perspective be considered a single site and would suggest that you investigate a DR location that anticipates local loss of utility services such as power or inclement natural conditions

Regards

Alex Tanner

Jay_Judkowitz · ‎02-09-2009

jhiraldo,

I agree that when you look at sites that are close together, have spanned IP ranges, active/active storage, high bandwidth, and the ability to be managed with one VirtualCenter, there is some overlap of benefits with SRM.

I would say there are two big distinctions.

With SRM, you get a much more well defined failover.

The VMs start in a specified order
You can set some VMs to be started serially with others starting in parallel
You can designate VMs at the recovery site to suspend to make room for recovery VMs
You can have callout scripts and predefined breakpoints to make sure that critical non-VMware activity is done at the right time and place
You can set the resource pool at the remote site (with the same size or different as the source resource pool) so that you get a predictable and defined QOS on CPU and memory

Once you have that well defined failover plan, you can test it and audit the results

Testing will automatically snap the recovery LUNs so you can power on the recovery VMs without interrupting replication
You can specify a test network at the second site that SRM will automatically put the recovery VMs on during a test so that they do not interfere with the running VMs
You can therefore do non-disruptive DR testing any time without warning. The recovery plan executes the same as for failover, but in a "test bubble" where storage and network IO are safely segregated away from production work.
There is a test results page for the recovery plan which lists all test runs, how long they took and how successful they were. From this page, you can drill down to each test run and see exactly what steps succeeded and failed and how long they took to run.
With the history page, you can grade your organization over time. With the detailed reports, you can troubleshoot specific runs.

So, I would say that even in the scenario you describe, SRM makes preparation, testing, and failover much more repeatable, reliable, and auditable. You can ensure proper ordering and QOS and you can make sure it works the same way every time. Obviously, the benefits of this structured test and failover are increased as the size of the deployment increases. What you might do manually for 5 VMs and may script for 20 VMs becomes impossible to maintain for 100 VMs without some extra tools like SRM.

Hope that helps.

Thanks,

Jay

JeffDrury · ‎02-09-2009

Jay,

While I completely agree with your points about SRM I believe there are also some drawbacks in implementing SRM for this scenario:

Licensing

Implementing SRM requires licensing and maintaining another VI Server as well as the SRM licenses.

Failover / Failback

During an actual failover the process requires manual initiation and will be slower than HA.
Failback is dependant on the capabilities of the underlying storage. Some vendors do this better than others but it is not automated and built into the current version of SRM.

If the benifits of testing and controlled failover are desired then SRM may be the best fit. However, if the underlying storage and networking can present two physical sites as one logical site to ESX why not use HA / VMotion? Failback with HA / VMotion is much easier as it would only require bringing up the ESX hosts and VMotioning the running VM's back, which could also be done via DRS. Licensing is much easier / cheaper as you do not need another VI Server with SQL hanging around to manage a failover. Additionally hardware utilization is improved as you do not have a warm site waiting for the primary to fail.

To look at this another way in a single site you could have two ESX hosts using SRM to migrate VM's between each other, but you would be more likely to use HA / VMotion as it is a much easier and reliable process.

Jay_Judkowitz · ‎02-09-2009

Jeff,

Like the original post by jhiraldo, you bring up valid points. Here is how I would respond to each.

Licensing - It will not surprise you given my position at VMware, but I would contend that the value provided by the features I described are well worth the licensing costs. For enterprise class reliability, predictability, and auditablity of DR protection of hundreds of x86 workloads, there really is no comparable solution.
Recovery time - Initiation is a one time effort - you just hit the button. Does that add to the RTO? Sure, it can add a couple of minutes. But, many cases, that's a positive thing. With most customers I speak to, a site failover is actually a business or financial decision, not a technical one. Furthermore, this prevents split brain scenarios and false positives. But, if you absolutely, positively want automation of the failover, it's a simple enough program to write using the SRM API.
Failback - Yes, the time and effort to setup the failback is greater than just letting DRS move half the workloads back automatically. Obviously, this is an area the product could improve upon (and no, I'm not going to answer specific roadmap questions on a public forum ). I would say that the price here, setting up the protection in the reverse direction before failback, is not that bad an operational hit considering that (a) it's all UI driven and (b) it's really infrequent. If I were still in IT, I would be optimizing for the most common occurrences, which in the case of DR, ought to be testing. Failback must be supported and supported reliably, but flawless repeatable testing should outweigh the importance of the simplest and most hands-free failback, in my opinion.

That said, some reasonable people will disagree. Each IT organization has their own set of priorities. Also, I think scale has a lot to do with things, as I said in my last post. So, not everyone will use SRM in all situations, but to answer the original question in this thread, for the reasons stated in this post and in my previous one, this is why I would consider SRM even with the environment described.

Thanks,

Jay

jhiraldo · ‎02-09-2009

JeffDrury -

Here is a sample of what SAN/iQ can do in regards to what I have been posting here. Take a look at the PDF and let me know what you think about, see the link below.

www.vxplore.com/PDF/LHN_VXplore%20London%202008.pdf

Smoggy · ‎02-10-2009

Interesting discussion. having implemented both types of solution all i can add to this is that we have two solutions here for two use-cases and try and share my own experiences.

campus cluster / stretched HA environments work well if you have the right kind of infrastructure but they are not really DR solutions as typically the two sites are very close together and most customers I work with do not consider a DR site true DR if it is located within a certain distance of the primary. we had a couple of customers a few years ago whose "campus" solution was wiped out entirely when the UK oil field disaster struck and took out both datacentres at the same time (they were 0.5 miles apart). Extreme example maybe but illustrates the difference.

If you can live with the limitations of a campus cluster solution and they fit your needs then they can work well. As we say in the UK take what the whitepapers say with a pinch of salt until you've tried it yourself.

With any cross site storage architecture I have implemented, there will be *some* kind of pause whilst the system sorts things out. The amount of time this takes depends entirely on what failed. Could be 2 seconds, or it could be 2 minutes or more, then you need to wait for HA to kick in. So when talking about failover initiation I would not say SRM vs stretched HA solutions are really any different time wise, indeed if you wanted to automate the initiation of an SRM recovery plan you can do this though if it were my pair of sites i would want this process at some point to be kick started manually by someone once the true nature of the event was understood.With an SRM recovery plan the storage integration "tells" the storage to come online rather than having to wait for a failover heartbeat or similar to be detected by the storage itself.

Going back to campus clustering although array/disk shelf failover can be automated this does not always happen automatically either in my experience, again sometimes it may require a manual intervention (click a button, or type a command to failover) and you need to have the process defined clearly for that event. Loosing a controller in either site for most vendors should be no big deal and the failover operation should take care of the storage side. If you loose the entire site, then manual intervention will (probably) be required to failover it can sometimes be possible to script round this using staged heartbeats. Again still adds time to the failover.

If we look at failback, with the campus implementation the process to failback is not as simple as bringing up Site1 and then just vmotioning the VM's back from Site2, again it depends on the failure. If you lost site1 completey and have had to failover to the disk shelves at Site2 then the VM's will now (once HA has restarted them) all be running from the disk shelves at Site2 if you simply vmotion them back to Site1, when its ready, then the storage will still be accessed via Stie2's controller / disks until you tell the storage arrays to go back to their default configuration, which will require restarting the VM's again and will incur downtime in the same way and SRM failback would work. I cannot imagine you would want a situation where Site1 came back online and you vmotioned 50% of your workload back to Site1 but left 100% of your disk workload running at Site2, I think in all cases customers I have put this in with have wanted the storage to "go back to how it was" ready for the next event or failure.

I think the DR benefits of SRM have already been outlined above so I wont go through these again. Obviously the biggest difference in terms of customer feedback i receive is that the ability to perform automated, repeatable non-disruptive DR testing is one of the key factors moving customers towards SRM.

Only other items you need to be thinking about with campus cluster are below I am not adding these to say "SRM is better" these are simply things I have had to work through when implementing campus cluster and some of these nuances don't always make it into the whitepapers/datasheets shall we say

VC Inventory / Layout, be careful with the design, as everything is stretched you need to be very consistent and accurate with naming conventions across all inventroy objects the VM's will use

DRS/HA settings, with campus clustering ensure that you know which VM's are important and define the correct settings per VM for recovery. Unless you have N+1 capacity spare at each site you will need to put in place HA/DRS settings that bring online the most important VM's first and dont end up in a failure situation with all your dev/test VM's online and half the production VM's "down" because you did not set correct priorities in HA. In SRM this is something the recovery plan handles and you can control.

Split Brain, if you run the two sites as one big HA/DRS cluster ensure you test out the various failure scenarios, for example if DRS (or manual VMotion) moves a bunch of VM's from site1 to site2 but no failure as occurred at that time you now end up with VM's CPU/Memory/Network contexts running on hosts at Site2 but accessing their VMDK's on site1. This will work but is not always desireable from a latency point of view (might be none-issue if bandwidth sufficient) however what happens next if you now suffer disk outage at Site1, at this point the VM's will not crash immediately at Site2 and it will take HA sometime to realise these VM's have an issue. Try it and see, if you disconnect storage from a VM the VM will cling on to life (assuming IO pattern is normal) for quite sometime before a bluescreen is seen.

Storage Presentation, if your vendor wants the zone across the sites to effectively be "open" to all ESX hosts then ensure you understand the implications of the ESX LVM settings with regards snapshot / disk resignature. You potentially will have ESX hosts that could at some point access both a source and target lun at the same time if someone or something altered the LVM defaults.

Zoning, if the vsan / zones are truly open or all hosts in same then certain fabric events can be a potential pain. Any rogue events such RSCN will disrupt both sites at the same time if all ESX hosts are on same open fabric so be careful here. Not something that is too common but i have seen it hurt a few customers, usually comes down to bad HBA or cables but can be a real pain to track down.
VC / ESX limits, as you build the design out for campus cluster ensure the design wont have you quickly reaching the limits of what it supported in terms of things like max number of VMs/VC, max number of luns/ESX host, max number paths/lun/ESX host etc.

hope that is of some use. As much as I like SRM solutions I also like the campus cluster / single pane of glass approach as well where it works/fits. Both use-cases are valid but ensure you work out what you actually need. I think long term some of the roadmap stuff we cannot go into here will see a coming together of a lot of the above but I'll leave that for Jay to talk about sometime

cheers

Lee Dilworth

TomHowarth · ‎02-10-2009

No Dave you are incorrect in your assumption, a Metro Cluster is two deivces but the SAN attached machine will only see one logical LUN. the NetApp deals with the continous replication between sites.

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

Tom Howarth

VMware Communities User Moderator

Blog: www.planetvm.net

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410

lcw1982 · ‎07-26-2011

Are there any way to achieve HA within protected site, yet make use of SRM for DR site should protected site fail?

TedH256 · ‎07-26-2011

Well of course. HA functions within a given cluster and protects against host failure.

SRM functions between sites/clusters, and protects against storage failure - two totally different things.

lcw1982 · ‎07-27-2011

I see, in that case I can configure VMware HA/ FT together with SRM?

TedH256 · ‎07-27-2011

HA and FT are two separate technologies. I am not sure whether a VM pair configured with FT can also be part of a protection group. Check the documentation.

However I am wondering if you are not certain about the proper role of HA and SRM?

HA is a feature that can be enabled on an ESX cluster. It will cause the VMs that are running on host A to start up on host B, in the event that host A fails.

HA has nothing to do with SRM, which is a totally separate product. SRM protects VMs that are running on replicated storage LUNs. If the administrator clicks the "failover" button, then it is presumed that storage at the protected site has failed, and VMs are brought up (and if necessary customized) on the DR site using the replicated storage.

So - yes, you can use HA and SRM both at the same time. Again, you will need to check to see whether SRM supports protecting FT VMs (I am thinking not - but even if not, just put the FT protected VMs on a datastore that is not part of an SRM protection group ....

All

What's the benefits of using SRM over HA and VMotion