VMware Cloud Community
germoles
Contributor
Contributor
Jump to solution

SRM 4.0 on HDS9990 and VMs consistency

Hi all,

this post is an attempt to better understand the inner workings of SRM 4.0 with respect to VMs consistency (which is our main goal).

Our farm is going to be upgraded from ESX 3.5 Upd 3 to vSphere, and we do need to implement a DR solution which will be entirely based on an HDS9990 storage solution.

What is not clear to us is whether or not SRM 4.0 is **THE** best product we can integrate in our environment in order to guarantee VMs consistency or not.

From the SRM 4.0 documentation it seems that a SRM 4.0's based DR solution should suite our needs.

Our question are: does SRM always guarantee VMs consistency?

Does it only relies on HW-based storage replication mechanisms? What is the role of the storage vendor's Replication Adapters?

Wouldn't it be better to rely on a snapshot-based solution (i.e. one build on VMware Consolidated Backup technology), instead?

Thank you very much in advance for your help.

Best regards,

Salvatore

0 Kudos
1 Solution

Accepted Solutions
Smoggy
VMware Employee
VMware Employee
Jump to solution

SRM is an integral part of a disaster recovery solution which it sounds like is the project you are working on.

In a HDS / SRM solution the data replication element is being handled by TRUECOPY at the array level. The consistency of data is determined by the replication schedules you apply to the volumes on the array (sync / async) and all of this is working at the block level.

A lot of customers ask about VM consistency when looking at DR / SRM solutions but you need to think about the scenarios in which you would utilize SRM/HDS. If you have a sudden disaterous event then you will not have time to make things consistent in nearly all scenarios you can think of as they are sudden and usually catastrophic. So if SRM took quiece points this would not give you a realistic view of how things would happen in real life.

One of SRM's strengths is that it allows you to perform non-distruptive DR tests without affecting your production environment. SRM does this by utilizing the storage replication adapters (SRA) to communicate with the array at the recovery site and present a snapshot of the storage (containing the protected VM's) at the array level using array functionality. These array snapshots are then presented by SRM to the ESX hosts at the recovery site and the VM's are recovered and connected to the defined "test" network(s).

If we allowed SRM to integrate with something like VSS and ensure that all data was filesystem rather than crash consistent at the recovery site this would give you false confidence that your environment could be recovered in the event of a sudden DR event so the simple reason if a plane or meteor Smiley Happy suddenley lands on your datacenter at 3:30 AM on a Sunday how would you know? how would you have time to quiesce everything?

For this reason SRM does NOT do this and simply utilizes the latest copy of the data on the storage array for testing (and also for real failovers). This is massively useful in that you can now prove to the business that in the event of a sudden unexpected outage you have the capability to prove you can recover the environment from cold and that all your OS's and applications can crash recover themselves to the last known state. RDBMS's systems are one good example. In a previous life I was involved with a well known RDBMs for a long time. One feature it had (like most of them) was the ability to do roll forward logging in the event of a system crash. Now although you could always tell the business that kind of protection was in place via the application transaction logs it was very difficult for the company to test you on your claims as that meant pulling the plug from the server. Now with SRM you can test this scenario VERY easily by simply running a recovery plan in test mode, bringing the RDBMS up at the recovery site and then proving the database recovers as its configuration says it should.

Don't forget Consistency in terms of storage devices during replication updates can be logifcally enforced at the array level but the blocks on data during any recovery with any array will always be the latest copies. This means if you have multi tiered applications or applications that utilize many disks then these are normally grouped together on the array to ensure consistency during replication update. During recovery the OS's and applications crash recover and indeed most if not all modern day OS's we know and love are more than capable of doing this.

Your point around VCB is relevant in that as well as having a solution like SRM in place with HDS truecopy as in your example there is still a place for VCB. Just because you have DR in place doesn't mean you don't need backup. Backups give you versioning and should go hand in hand with any DR solution. Most customers will replicate their backup vaults to their DR sites as well so that they not only have the latest data (provided by the storage replication) but also have the backups available as well should they ever then need to do version recovery. Remember VCB is a backup solution NOT DR. In the event of a DR situation SRM will recover your environment following a pre-programmed recovery plan you have created to ensure everything comes up in the right order and it requires little operater input to do this (a mouse click). If you relied on backup images for DR you would be there a long time restoring, activating and sequencing and would in most cases get no where near the RTO/RPO targets you might have set for yourself.

Hope this helps,

Lee Dilworth

View solution in original post

0 Kudos
5 Replies
Smoggy
VMware Employee
VMware Employee
Jump to solution

SRM is an integral part of a disaster recovery solution which it sounds like is the project you are working on.

In a HDS / SRM solution the data replication element is being handled by TRUECOPY at the array level. The consistency of data is determined by the replication schedules you apply to the volumes on the array (sync / async) and all of this is working at the block level.

A lot of customers ask about VM consistency when looking at DR / SRM solutions but you need to think about the scenarios in which you would utilize SRM/HDS. If you have a sudden disaterous event then you will not have time to make things consistent in nearly all scenarios you can think of as they are sudden and usually catastrophic. So if SRM took quiece points this would not give you a realistic view of how things would happen in real life.

One of SRM's strengths is that it allows you to perform non-distruptive DR tests without affecting your production environment. SRM does this by utilizing the storage replication adapters (SRA) to communicate with the array at the recovery site and present a snapshot of the storage (containing the protected VM's) at the array level using array functionality. These array snapshots are then presented by SRM to the ESX hosts at the recovery site and the VM's are recovered and connected to the defined "test" network(s).

If we allowed SRM to integrate with something like VSS and ensure that all data was filesystem rather than crash consistent at the recovery site this would give you false confidence that your environment could be recovered in the event of a sudden DR event so the simple reason if a plane or meteor Smiley Happy suddenley lands on your datacenter at 3:30 AM on a Sunday how would you know? how would you have time to quiesce everything?

For this reason SRM does NOT do this and simply utilizes the latest copy of the data on the storage array for testing (and also for real failovers). This is massively useful in that you can now prove to the business that in the event of a sudden unexpected outage you have the capability to prove you can recover the environment from cold and that all your OS's and applications can crash recover themselves to the last known state. RDBMS's systems are one good example. In a previous life I was involved with a well known RDBMs for a long time. One feature it had (like most of them) was the ability to do roll forward logging in the event of a system crash. Now although you could always tell the business that kind of protection was in place via the application transaction logs it was very difficult for the company to test you on your claims as that meant pulling the plug from the server. Now with SRM you can test this scenario VERY easily by simply running a recovery plan in test mode, bringing the RDBMS up at the recovery site and then proving the database recovers as its configuration says it should.

Don't forget Consistency in terms of storage devices during replication updates can be logifcally enforced at the array level but the blocks on data during any recovery with any array will always be the latest copies. This means if you have multi tiered applications or applications that utilize many disks then these are normally grouped together on the array to ensure consistency during replication update. During recovery the OS's and applications crash recover and indeed most if not all modern day OS's we know and love are more than capable of doing this.

Your point around VCB is relevant in that as well as having a solution like SRM in place with HDS truecopy as in your example there is still a place for VCB. Just because you have DR in place doesn't mean you don't need backup. Backups give you versioning and should go hand in hand with any DR solution. Most customers will replicate their backup vaults to their DR sites as well so that they not only have the latest data (provided by the storage replication) but also have the backups available as well should they ever then need to do version recovery. Remember VCB is a backup solution NOT DR. In the event of a DR situation SRM will recover your environment following a pre-programmed recovery plan you have created to ensure everything comes up in the right order and it requires little operater input to do this (a mouse click). If you relied on backup images for DR you would be there a long time restoring, activating and sequencing and would in most cases get no where near the RTO/RPO targets you might have set for yourself.

Hope this helps,

Lee Dilworth

0 Kudos
germoles
Contributor
Contributor
Jump to solution

Hi Lee,

please excuse me for delay in my feedback.

I truly thank you so much for your answer, which I think couldn't have been more complete and useful.

Here's what we'll be doing in the next few weeks:

- upgrade of our current ESX 3.5 farm to vSphere 4.0

- install of the remote site's farm

- configuration of the local and remote HDS9990 SANs

- pairing of the local datastores with the remote ones (we'll be using HDS's Universal Replicator feature to replicate datastores' content to remote site; by temporarely suspending the pairing we'll also be allowed to produce a safe 'golden copy' of our replicated datastores)

- SRM server install on both local and remote sites

- creation of all the SRM recovery plans and test

Thank you again for your help!

Best regards,

Salvatore

0 Kudos
Smoggy
VMware Employee
VMware Employee
Jump to solution

Glad to have helped.

you might want to start and review this document:

most issues around getting the HDS SRA working with SRM are caused by incorrect HORCM setup so if your not familiar with this then its worth looking at the doc and working with your storage team.

hope that helps,

Lee

germoles
Contributor
Contributor
Jump to solution

Wow, Lee... you're very kind, and it looks like you're a true storage guru! :smileyblush:

Our HDS9990s are directly managed by Hitachi support specialists, but we'll surely tell them to pay extra attention to the HORCM setup when configuring their SRA.

Thank you again and best regards,

Salvatore

0 Kudos
mjamal
Contributor
Contributor
Jump to solution

Hi germoles, wondering if you managed to put HDS horcm files in place for both SRM servers?

I have similar setup (USP-V arrays) needing example of horcm files as the PDF from HDS had some inconsistencies with names/IDs and LUNs.. any chances you could share horcm config files for both sites? And also, does it change format if you are in Active-Active site mode with source and target LUNs each on both sites?

many thanks in advance..

Mo

Mo
0 Kudos