SRM w/ EMC CX4 vs HP SAN/iQ Cluster and Nettapp FA...

jhiraldo · ‎02-04-2009

I'm trying to implement DR for my company and I'm undecided between 3 vendors that are offering their services to me.

I have to implement a DR solution and I have 3 sites to work with.

site1 and site2 are connected via fiber and are 500 feet apart, site3 is out of state and this would be where I would like to back up all my VMs in case of major city disaster (site3 is for a later project down the road but will take suggestions).

I've been working with 3 companies and they suggested different designs and now they have made everybody in the team confused.

We would like to know if anyone here has implemented any of the vendors and what vendor they chose and why.

What made you decide one over the other, Do you have SRM with failover or HA and VMotion on a Cluster environment. What is the best approach when creating a DR environment; do you go with High availability in cluster environment or a Failover solution with SRM.

I hope I explained myself in a way where you can understand what we are trying to accomplish, please feel free to ask me any further questions.

We are looking for the following features:

Ease of use = I want to be able to use the NAS without having to be a GURU with storage.

Useable space = The EMC CX4 claims to provide 70% of useable space while Netapp and HP SAN/iQ is around 50%.

Practicality = What would be the best scenario for this type of environment (CX4 with SRM, Nettapp stretch cluster, or SAN/iQ Cluster).

Here is a picture of the EMC CX4 design, I don't have the other vendor's design but as soon as I get it i'll post it, thanks.

JeffDrury · ‎02-04-2009

In your attached EMC design I don't see how the 3 sites are implemented, it only looks like a 2 site topography. I could definately see how you could do the three sites with HP SANiQ. The two sites that are connected via fiber could be a single SAN campus cluster and a SRM protected site, with the remote site serving as your DR/recovery site. At the campus site half of your equipment would be on each side and because of ESX clustering and SANiQ replication you could loose an entire site and still have avilibility of data and through HA still maintain your VM's. If both of the campus sites failed you could use SRM to recover at the DR site.

If you look closely at VMware's SRM documentation you will find that SANiQ is used for most if not all of the example configurations. The reason for this is how tightly the SANiQ SRA is integrated with SRM. Additionally SANiQ has a detailed fail back plan to move from the recovery site back to the protected site. This is not an easy process with EMC and NetApp. As for the useable space the reason that SANiQ is 50% is because of its redundancy. With SANiQ you can loose half of your storage hardware and still maintain availibility of your data, that failure rate would likely not be available with the other solutions. Licensing is another issue to consider. Be sure to find out the cost of each licensing add on with the EMC product. I believe SANiQ includes all of the licensing necessary for SRM with the base product cost. It's not fun to be in the middle of an EMC implementation and find out that you need to purchase a license to finish the project.

jhiraldo · ‎02-04-2009

Hello and thank you for your rapid response,

I only have a T1 connection to the 3rd site and is about 150 miles distance and EMC wants to sell me an avamar solution.

The reason why the 3rd site is not in the EMC diagram is because I can not find a solution to replicate to that site within my budget.

Money is a factor here like always and the avamar is too much money for what I want to do. The solution you suggested is our top pick but, I don't know how to replicate it to my 3rd site without putting in 2 more T1's and some kind of expensive dedup technology, it will put me out of budget. Do you know if SRM will work with the bandwidth/distance restrain i'm facing? I would be fine with just backing up my VMs to tape and taking it of site and hopefully restoring it at the 3rd site in DR situation.

bladeraptor · ‎02-05-2009

Hi

I am writing as an EMC employee - so while I won't claim to be a customer deploying a particular technology - I will say that the team I belong to and myself have now been involved in a number of SRM deployments. I can only hope that other contributors to this post are as free with their credentials :]

SRM, as we are all aware, is a framework provided by VMware. It ensures that vendors provide solutions that work with the core VMware technology.

The SRM framework is evolving all the time and new features such as fail back, cascaded sites, many to 1 and 1 to many configurations are on the roadmap. There are differences in how the vendors have chosen to adhere to the frameworks, but I would suggest that to claim that one vendor is more tightly integrated than another into a framework that provides for a generic featureset is probably a bit of an oversimplification. Recent statistics would seem to indicate that the number of downloads of the various SRAs for the different vendors reflect the overall market share of the vendors in question

From an EMC perspective all of our Replication products support SRM.

However as will be the case with other vendors, the current SRM framework does not necessarily reflect the full capability of the underlying replication technology - yet!

Celerra Replicator being a good example of this. The native Celerra replication product for example supports 1024 configured replication sessions, 256 active replication sessions, 4 file system or iSCSI replication sessions from particular source object, so a 1 to 4 fan out ratio. The Celerra can natively replicate from source to destination, then use the destination as a source to

replicate to a third Celerra. A-to-B, B-to-AC. So if there was support for this feature within SRM currently - you could achieve your objectives as stated with Celerra Replicator

EMC replication products embody a failback model and the process of failing back is designed to be simple. As you may be aware EMC MirrorView technology was used to demonstrate SRM at VMworld in September of last year and I worked closely with my VMware colleagues to get that particular demo working well - so we had a lot of experience with failing over the system and back and we could failover and failback in this particular environment in minutes. The challenge is not so much to document a failback procedure as to be able to implement it in the technology so it is quick and intuitive - EMC is working closely with VMware to make this a reality. I enclose the link to our guide for setting SRM with Celerra using our Celerra simulators - so you can become familiar with the technology without needing to buy it first

I notice that on the diagram you have MirrorView and SnapView in the diagram - were you planning to replicate using MirrorView for fibre channel presented LUNs as well as Celerra Replicator for iSCSI LUNs? If you plan to use Celerra Replicator for all your replication MirrorView is not required

Clearly one of the key considerations when selecting a vendor is not only their ability to integrate with the current SRM standard but should also demonstrate the ability to embrace newer developments in the SRM framework as they come online. With the release of the next generation of VMware products and solutions the use of frameworks (of which SRM is only the first) will be widespread and again when selecting the vendor who will provide the platform for your ESX environment over the next 3-5 years it is perhaps worth investigating their ability to demonstrate a clear intention to integrate with key frameworks such as the vStorage and vCenter offerings.

In terms of your comments around a choice of approach to this project it would seem to me that perhaps you may need to start at a slightly higher level than what the technology can offer you and perhaps focus on what the benefits of any solution will be back to your business. With the three site model, what is it that the business is trying to achieve?

For example you could consider that the two local sites are seen as a single site from a geographic recovery perspective (i.e. the SRM consideration) - with the second local site providing backup and recovery of the first site - this would imply that you would use a different approach for these two local sites that didn't involve SRM but used any number of technologies to create regular backups of the production environment for rapid restore from the second local site to the primary.

For this EMC offers products such as Replication Manager that through its integration with the Virtual Center API can capture VMware Snapshot consistent array based replicas of entire VMFS volumes that can be mounted to another node in the cluster or any ESX host managed by the same Virtual Center incidence and recovered (either by literally navigating down to the VMFS datastore and copying the files back to the destination in the case of single VM corruption or by backing the mounted VMFS volume off to tape / disk or virtual tape if required.

SRM could then be used to provide the geographic Disaster Recovery in the event of total local 'two' site failure. From the looks of things Avamar is being proposed for the remote leg not just to provide the benefits of deduplication - i.e. in a VMFS volume full of AC:\drives reducing the amount of data to be backed up by up to 90% - but also because Avamar replication includes compression and WAN optimisation and may be the best way of getting the data across what appear to be fairly limited links.

The combination of the deduplicated volume of data, the WAN optimisation and compression should keep the impact of transporting the data to the remote site to a minimum. This would clearly exclude the use of the SRM framework - but this is a reflection of the scalability of the available links. Without understanding the volume of data being moved over the links I could not comment on whether a technology like asynchronous Celerra Replicator could be used instead - but I suspect that short of probably a minimum of 100mbs link then probably not

The separation between deploying a disaster recovery solution,a local and remote backup and recovery approach and maintaining uptime is important. The mention of local site VMware features such as HA and VMotion will provide you with local site uptime - a key value proposition for the business but perhaps separate from your ability to recover your environment across site.

There are solutions that offer at least metropolitan capabilities to 'stretch ESX cluster' but these are characterized by a degree of cost and complexity. VMotion for example anticipates that both hosts are up and consequently loses some of its pertinence when deployed in disaster scenarios that anticipate the losses of one of those hosts.

I suspect that there are a number of different ways of architecting this - but having a clearer understanding of what the business objectives are in providing this business continuity would certainly help clarify which are the best options. Would you be happy to contact me privately to provide you company details and I will ensure that the local US VMware specialists get in touch with you to pursue this further? (I am based in the UK)

Kind regards

Alex Tanner

JeffDrury · ‎02-05-2009

SRM will work fine at any distance, the bigger concern is the ability of the storage to replicate the data in a timely manner. This can be a tricky calculation as replication technologies differ in the way they replicate data. I can speak to SANiQ replication as that is the technology that I have the most experience with.

Your replication will involve an initial copy of the base data, which should be done with the systems on the same LAN network, and then delta changes at the block level thereafter. The first copy is usually huge, as that is an exact copy of all of your data. After the first copy SANiQ monitors block level changes and only replicates the changes made to the base image. So if you have a 1TB volume with10GB of daily changes you need a WAN link that can push that ammount of data in the given replication period. I believe a T1 running at full bore for 24hr can push ~12Gb-14Gb.

SANiQ allows you to specify the frequency of the replication as well as the ammount of bandwidth it will take on your WAN link. So you could set up a once daily replication that uses 100% of the WAN link, or an hourly replication that uses 50% of the WAN, or whatever combination of settings that you would like to use based on your SLA.

Again this replication technology is built in to the basic SANiQ license so there is no need to worry about add on licensing and support costs. There is also the option of using a hardware SAN at the primary site(s) and a VSA at the remote sites to cut hardware costs. That however is dependant on the level of service you need to supply at the DR site. If you would like to talk this through you can send me a private message and I will supply my contact info.

Thanks,

Jeff

All

SRM w/ EMC CX4 vs HP SAN/iQ Cluster and Nettapp FAS 3140 cluster w/out SRM