Our non-SRM DR solution is simple. If the main office goes away, present the latest data from Recover Point to ESX hosts at the DR site, bring everything up and plug into our WAN to present everything to remote offices and clients working from home. Our DR test is also simple. Cut the network connection between the main office and the DR site to simulate the office going down. Present a snapshot copy of the data on Recover Point to ESX hosts at the DR site. Bring everything up and DON'T connect it to the WAN. No crazy re-IPing, no changes to networks at DR. Very simple.
Along comes SRM. Our recovery plan in SRM is just as simple. Cut the network connection, fail stuff over.
We setup a recovery plan with a couple of test VMs so we could kick the tires on this SRM thing. We hit the "Test Plan" button and it does everything flawlessly. The cleanup process is great. Everything works as it should until .......... we try it with the network connection cut to the DR site. Then we get "Failed to create snapshots of replica devices" when trying to do a snapshot on the Recover Point.
My first call to Vmware had me in a conference call with Vmware and EMC. It was discovered that there was a bug in the Recover Point software that could cause this. We needed to upgrade to the latest release. We did that and no dice. Still the same error.
The next tech I spoke to at Vmware tells me that this is the way it's supposed to work. To do a test, you have to select "Run Recovery" then do a "Planned Migration"
If this is the way you have to do a test, why even have a test button?
For a planned migration it tells you right on the screen that "the process will permanently alter virtual machines and infrastructure of both the protected and recovery datacenters". I don't want to alter ANYTHING for a test. Vmware says that after the planned migration and testing to run a re-protect. This will reverse the replication and copy the data from DR back to the main office. How is this ever a good thing for a test? When we test, we have application owners beat their systems to death with bogus transactions and run test scripts. We DON'T want that stuff replicated back to our production site EVER.
Vmware went down the line of "You can use VRF at the DR site to create a bubble to recover everything into". Well ok, how do I add the physical machines we have to recover into the bubble? "Well, the network team can make changes on the switch ports". What about VPN? "They would have to make changes there too. And also, you would have to create jump box VMs for people to access the applications in the bubble". But in a real DR situation, you wouldn't do any of that. For a real DR, we just let it rip. For a test, we create a bubble and make all kinds of network changes.
So, with my manual process, our test is exactly the same as our real DR except for one step. With SRM, our test looks NOTHING like a real DR. How is that a good thing?
Here's what I think is happening with the error message. SRM is connecting to the protected site Recover Point appliance to do the snapshot. When the link is down, it can't get there. If they just pointed to the recovery site to do the snapshot, it would work.
So here are my questions:
Am I the only one that tests the DR plan by just cutting the link to the main data center and bring up VMs at the DR site?
I see a lot of people posting here that they get the same error. Do these people have the link to their DR site up or down?