Solved: Re: SRM 4.0 on HDS9990 and VMs consistency

germoles · ‎10-26-2009

Hi all,

this post is an attempt to better understand the inner workings of SRM 4.0 with respect to VMs consistency (which is our main goal).

Our farm is going to be upgraded from ESX 3.5 Upd 3 to vSphere, and we do need to implement a DR solution which will be entirely based on an HDS9990 storage solution.

What is not clear to us is whether or not SRM 4.0 is **THE** best product we can integrate in our environment in order to guarantee VMs consistency or not.

From the SRM 4.0 documentation it seems that a SRM 4.0's based DR solution should suite our needs.

Our question are: does SRM always guarantee VMs consistency?

Does it only relies on HW-based storage replication mechanisms? What is the role of the storage vendor's Replication Adapters?

Wouldn't it be better to rely on a snapshot-based solution (i.e. one build on VMware Consolidated Backup technology), instead?

Thank you very much in advance for your help.

Best regards,

Salvatore

Smoggy · ‎10-26-2009

SRM is an integral part of a disaster recovery solution which it sounds like is the project you are working on.

In a HDS / SRM solution the data replication element is being handled by TRUECOPY at the array level. The consistency of data is determined by the replication schedules you apply to the volumes on the array (sync / async) and all of this is working at the block level.

A lot of customers ask about VM consistency when looking at DR / SRM solutions but you need to think about the scenarios in which you would utilize SRM/HDS. If you have a sudden disaterous event then you will not have time to make things consistent in nearly all scenarios you can think of as they are sudden and usually catastrophic. So if SRM took quiece points this would not give you a realistic view of how things would happen in real life.

One of SRM's strengths is that it allows you to perform non-distruptive DR tests without affecting your production environment. SRM does this by utilizing the storage replication adapters (SRA) to communicate with the array at the recovery site and present a snapshot of the storage (containing the protected VM's) at the array level using array functionality. These array snapshots are then presented by SRM to the ESX hosts at the recovery site and the VM's are recovered and connected to the defined "test" network(s).

If we allowed SRM to integrate with something like VSS and ensure that all data was filesystem rather than crash consistent at the recovery site this would give you false confidence that your environment could be recovered in the event of a sudden DR event so the simple reason if a plane or meteor suddenley lands on your datacenter at 3:30 AM on a Sunday how would you know? how would you have time to quiesce everything?

For this reason SRM does NOT do this and simply utilizes the latest copy of the data on the storage array for testing (and also for real failovers). This is massively useful in that you can now prove to the business that in the event of a sudden unexpected outage you have the capability to prove you can recover the environment from cold and that all your OS's and applications can crash recover themselves to the last known state. RDBMS's systems are one good example. In a previous life I was involved with a well known RDBMs for a long time. One feature it had (like most of them) was the ability to do roll forward logging in the event of a system crash. Now although you could always tell the business that kind of protection was in place via the application transaction logs it was very difficult for the company to test you on your claims as that meant pulling the plug from the server. Now with SRM you can test this scenario VERY easily by simply running a recovery plan in test mode, bringing the RDBMS up at the recovery site and then proving the database recovers as its configuration says it should.

Don't forget Consistency in terms of storage devices during replication updates can be logifcally enforced at the array level but the blocks on data during any recovery with any array will always be the latest copies. This means if you have multi tiered applications or applications that utilize many disks then these are normally grouped together on the array to ensure consistency during replication update. During recovery the OS's and applications crash recover and indeed most if not all modern day OS's we know and love are more than capable of doing this.

Your point around VCB is relevant in that as well as having a solution like SRM in place with HDS truecopy as in your example there is still a place for VCB. Just because you have DR in place doesn't mean you don't need backup. Backups give you versioning and should go hand in hand with any DR solution. Most customers will replicate their backup vaults to their DR sites as well so that they not only have the latest data (provided by the storage replication) but also have the backups available as well should they ever then need to do version recovery. Remember VCB is a backup solution NOT DR. In the event of a DR situation SRM will recover your environment following a pre-programmed recovery plan you have created to ensure everything comes up in the right order and it requires little operater input to do this (a mouse click). If you relied on backup images for DR you would be there a long time restoring, activating and sequencing and would in most cases get no where near the RTO/RPO targets you might have set for yourself.

Hope this helps,

Lee Dilworth

View solution in original post

Smoggy · ‎10-26-2009