A Little History

 

In a previous article (http://communities.vmware.com/blogs/ManualAutomation/2008/05/15/the-big-plan-business-continuity) I discussed why I was looking closely at SRM and what I needed to get done before I could implement the product. Now that I've successfully tested the product I'd like to give an update.

 

The Celerra Code Upgrade

The code of both of my Celerras was upgraded to 5.6 in mid July. It wasn't pretty - no fun being in the data center until 3:00am. To EMC's credit, their CE hung in there with me, got the problems escalated and ultimately we got the VMware data stores working again. We were bit by the LUN resignaturing "bug". EMC knows the code upgrade causes this but for some reason we were surprised and found out the hard way at about 12:00am.

 

It took another month to recover other services such as CIFS and iSCSI replication. When I was young, my father insisted was that when I handled someone else's property, I should always return it in the same or better condition than when I first received it. My main problem with EMC in this respect is that they left me with a system that didn't work like it did before they upgraded it. I'm past the CIFS and iSCSI replication problems now, but I'm still experiencing problems with CAVA that didn't exist before. Luckily, I don't think it's anything too difficult or serious and I will be calling EMC support to get this last problem resolved.

 

While I've given feedback on this event to EMC support, note that I still am a fan of their unified storage product. It's not right for all companies or all situations but it is for my environment. Also, to be fair, many Celerra customers may never need to experience a code upgrade event. The only reason to do this is if you need some feature or improved capability that the upgrade provides. I've had an EMC CE tell me that they have retired EMC hardware that had the original code installed making it over three years old! This says volumes about the code's stability and reliability. In my case, I needed the expanded functionality of iSCSI LUN replication and compatibility with VMware Site Recovery Manager.

 

The Evaluation

Anyone semi-familiar with installing VMware products will have no problem getting SRM installed. Note that you'll need to obtain the Storage Replication Adapter (SRA) from your storage OEM and install it in the proper sequence per the documentation. In my case I used documentation from EMC and VMware to install and configure the product. See the "Additional Resources" section at the end of this article.

 

One of things that's awesome about VMware is the amount of attention they've given me regardless of whether I was working for a large $3 billion enterprise or a mid-sized $500 million dollar company. In this case, my sales rep offered to have a local VMware systems engineer (we'll call him "Dave") come out on-site and work with me to complete a proof-of-concept.

 

I had SRM and the SRA components installed. I wanted a technical resource in case I needed it while performing that first test. Well, I needed it and got it. Keep in mind I hadn't purchased the product yet(!). Dave was able to help me work through a couple of issues we ran into during that first session such as file system sizing and licensing issues. It only took 2-3 hours but when finished, I had 4 VMs running in my remote data center 325 miles away! (Thanks to Dave and Ken!)

 

Another tip I learned during this session: review the SRA log. In the case of the Celerra's SRA, it documents every command it executes and the results. It's a great way to learn what SRM is really doing behind the scenes with your storage in order to get the LUN(s) setup and ready to be used as a data store by ESX.

 

Subsequent Test Results

I have more testing to do but can report that I'm starting 4 VMs from a single replicated LUN in 8 minutes. And I'm not talking about from the time of just powering on, I'm talking about pressing the "big red (test) button" - powering-up the VMs - starting the Windows services - and the recovery plan completion. Try that using physical servers! Sorry, but even restoring servers from a B2D solution that's replicated to your DR site won't be as fast.

 

I demonstrated SRM for the DR team and initially got a "that's all?" kind of reaction. I quickly realized that SRM, with the combination of array-based replication, worked too well! Meaning, it did such a good job of hiding the complexity and number of steps required to get from A to Z that my non-technical DR teammates didn't understand what SRM was really bringing to the table. If there's only one thing you take away from this article, make sure it's that you're better off explaining in simple terms the steps SRM is executing in the background before running a demonstration.

 

Talking about the virtues of SRM is one thing (the recovery run book, the steps it automates, the testing capabilities (which are awesome by-the-way), etc.), demonstrating these product features for your DR team is another. If your experience is like mine, you'll find it dramatically influences the discussions on the project plan. In my case, we will be significantly changing the testing phases - actually streamlining those thanks to SRM.

 

I wouldn't declare SRM to be a perfect specimen of engineering excellence; I reserve that title for Windows ME (yes that's a joke). But there are a couple of things that could be improved. I would like finer-grained control over when my VMs are powered on - I'd like to be able to specify dependencies between VMs. It seems like VMware is bent on specifying everything as "High", "Medium" and "Low". What if I want six groupings instead of just three? There are also a number of folks complaining about the lack of fail-back. Yes, there's no "big red button" to press to perform a fail-back but most storage OEMs including EMC are providing documentation describing how to get this done. Finally, I'd like VMware to consider non-array-based replication capabilities. I don't think you'll replicate 20 VMs this way, but it sure would be nice for those one or two one-offs for which you don't want to replicate an entire LUN. I can also image customers with smaller implementations or those with non-supported back-end storage using this feature.

 

Because the POC exercise was a success it was easy to convince management to purchase the product. I think purchasing Site Recovery Manager is the best endorsement I can give it and VMware. Now I can't wait to see what the next version brings!

 

Additional Resources

SRM Product Site: http://www.vmware.com/products/srm

SRM Product Documentation: http://www.vmware.com/support/pubs/srm_pubs.html (The Getting Started PDF is particularly useful and pay attention to the compatibility matrix.)

SRM VMTN Forum: http://communities.vmware.com/community/vmtn/mgmt/srm

SRM Book: http://www.rtfm-ed.co.uk/?p=584 (Mike's blog is also a good one to watch.)

Storage OEM Docs: The EMC documentation can be obtained by registering on their Powerlink (http://powerlink.emc.com/) site and searching for "Site Recovery Manager". For other OEMs, contact your sales representative, search their web site or call support.