Re: My experiences of using VMware Data Recovery (...

Goatie · ‎12-01-2010

Hi all,

Thought I'd share my experiences using VDR in a disaster recovery test my company performed a few weeks ago.

Bit of background:

Our DR plan is to recover about 80 servers for our first stage DR set, mostly VMs but a hand full of physicals backed up using Symantec Backup Exec System Recovery (BESR) for image backups. All are recovered into a separate site using vSphere 4.0.

Due to various incompatabilities or other issues we recover using the following methods:

VDR
Stand-alone converter from BESR images
Built-in converter in vCenter client for older versions of BESR
BESR CD booted into a blank VM for images that neither copy of VMware converter can handle, usually slight vagaries in the config files that make Converter not detect the OS or disks.

I'll focus on the VDR restores here, some issues are planning or architecture and some are issues with the product itself and its resiliancy to faults.

One very good positive thing I've got to say is that the backups and the dedupe are great.

Easy to configure, BUT you do need to check them daily as some times they'll wig out and just stop working with no real reason.

Due to various architecture reasons (which are being rectified) we have our layout as such:

Two separate data centres linked by 10Gb dedicated fibre

VDR vApps run in the primary datacentre (PDC) and in the Secondary Datacentre (SDC) the physical backup servers contain CIFS/windows shares to which the VDR vApps connect to and use for backup destinations. (we don't have any production network vSphere hosts at the moment) This setup caused some network contention issues.

Ways to make VDR backups fail:

If for some reason you backup the same server from two vApp separate jobs or VDR appliances you'll get snapshot errors consistantly
The guide says a max of 500GB data stores.. i'd go for 200 to 250 MAX. If something goes wrong (and unfortuanately it does happen often) and it needs to do a full index or integrity check because a backup reported a fault, doing so on a 500 GB datastore TAKES DAYS!!!!!!!!!!!!!!! During this time NO BACKUPS RUN! Keep them small and have lots of vApps rather than larger datastores and fewer vApps. I went from two VDR vApps with two 500GB LUNs on each to 6 VDR vApps running 250 GB LUNs. Backups fail less and the impact of an integrity check rarely means I miss a nights backup.
Keeping the datastores small also means during a DR if it reports that it is corrupt and has to rebuild the index then you'll only be out for a small amount of time rather than 3 days.

Issues we had during our DR test:

Can't restore a VDR image whilst the integrity check is running. Had to cancel it and hope its all good (which it was)
Lock the VDR vApps so DRS doesn't move them around during the DR restores and point all the VMs from that vApp to restore to itself, although stagger the destination datastores if possible to avoide VMDK fragmentation and better FC LUN path utilisation. This results in a far faster restore for all VMs as you're not sucking data in and out of the Host's NICs, all traffic is memory copies.
due to not really thinking about restores (a common practice I've found) and only thinking about backups we ended up putting a lot of our early 1st stage restores on the one datastore server. Our main file servers (3TB of group/home/profiles) along with our VDR backups of Citrix and other key first line restored servers were all stored on the one physical backup server. Thus when we kicked off all those restores at once (well, nearly at once) all the VDR restores started failing with read or write faults. - Solution: spread the load and increase the number of out-bound NICs on the backup servers.
Restoring too many VDRs at once caused many issues including random failures to restore as well as surprisingly different restore times. For example I kicked off a simultaneous restore of 9 Citrix XenApp servers backed up on the same VDR vApp and the same destination datastore (to obtain the best dedupe performance) and some restored in 40 mins (20GB each) and others took SIX hours whilst a few just failed. Retrying those was successful.
Error messages WAAAAAY too vague and quite unhelpful in trying to resolve restore failures in the heat of the moment. 'Read error' or 'write error' being the extent of it. Not too bad when we're doing a DR test, but if it was for real I wouldn't be happy. And it really doesn't give VMware tech support much to go on either!!!! Help yourselves here guys!
No view of the destination datastore sizes and available space. The restore wizard is slow and clunky so when you're flying blind associating VMs to datastores it doesn't feel too good. It means a lot of fluffing around gathering VM sizes, checking and noting down where you'll be restoring each vmdk to and then running the wizard. All that would be cut out if you just had the view of the datastore free space when selecting the server/datastore destinations in the wizard.

Solutions to our issues:

Move the VDR vApps so they run from the SDC
Set the VDR vApps to use local VMDKs rather than CIFS shares
Increase the number of NICs in the teams on the backup servers (teamed NICs do not increase inbound I/O, only outbound. Inbound data traffic must be addressed to the one MAC address whereas outbound traffic is spoofed to appear as though it came from the primary MAC but in fact came from the others)
Recognise which VMs you'll be recovering during which phases of a DR restore and make sure they are all spread across multiple VDR vApps rather than pooled into the one vApp or worse, onto the one Job/datastore. This increases restore throughput.
Restore VMs from the VDR vApp onto its vSphere host rather than sending the data out onto the LAN to other vSphere hosts
Keep your VDR datastores small (150 to 200 GB I reckon) and have more vApps.
Having more VDR datastores increases wasted VMDK overhead (recommendation is to have 50-100% disk overhead from the quantity of data backed up for future change data growth over the life of the backup -- 7 years for most) So, use thin provisioning on the VMDKs or the LUNs at the SAN level

Ultimately this is a free solution and for our use only a stop-gap between using BESR for VM backup, the phasing out of VCB and the VMware Site Recovery Manager implementation.

Happy to answer any questions about our DR and really looking forward to SRM to get rid of a LOT of our manual tasks and hopefully make DR a one day exercise again.

Overall the DR test was a success, although the ease at which we could kick off many VDR restores at once really bit us in the rear and I've had to break down the restore process into smaller chunks so we don't hit those bottlenecks in the future. The issues we had did definately put a sour taste in the management level's mouth for VDR as a product though and they are still a little skeptical about using it (hey, it's free!)

That said, VDR did do its job in the end and got us through

Cheers,

Steve

DSTAVERT · ‎12-01-2010

How much time did it take and knowing what you know now and implementing the changes what do you think the difference might be?

This would make a great start for a Document.

Forum Upgrade Notice - the VMware Communities forums will be upgraded the weekend of December 12th. The forum will be in read-only mode from Friday, December 10th 6 PM PST until Sunday, December 12th 2 AM PST.

-- David -- VMware Communities Moderator

parkut · ‎12-02-2010

Thank you very much for putting the result of your test here. Much food for thought.

Goatie · ‎12-29-2010

Hi,

It took us about 5 days to recover all the VMs, 3 SQL and one Exchange cluster, an Oracle AIX server and about 10TB of data to restore.

The work was done in 8 hour days, no overnight working (as would be the case in a real DR) except for large data restore jobs through Backup Exec.

As senior management weren't too happy with the results (some of the data backups weren't fully recovered as well as the time it took) we've been asked to do it again in early Jan. This time around we're running in two 8 hour shifts around the clock to see how a real DR would perform, how much down time we would have etc. We'll be able to focus on the VDR/BESR recoveries continuously with this method and then the trailing end will (hopefully) just be data restores of our file servers.

We'll not been in a position to implement most of the recommendations prior to the 24hr DR (due to not having a production VM attached vSphere cluster in the DR site yet), only some minor shuffling of data storage locations and changing the restore order will be done at this point.

One issue that I still need clarification on (but I have a good idea at this point) is if/when we move the VDR appliances out to the DR site, they will not be able to directly mount the VMs they back up onto themselves as they don't have access to the Prod SAN. Therefore their backups will be running over the network between the source host and VDR vApp. This I/O through the management network may cause issues. This will need to be tested before a full move is performed. Normally a VDR vApp snapshots the source VM and mounts the VMDK onto itself and backs it up.

I'm not sure how much impact this will have on backup times, probably not too much as the incrementals are pretty darn quick. We have a 10Gb dedicated fibre between sites, its just a matter of ensuring our network is configured correctly to take advantage of this.

I'll update this further after our 24hr DR run.

Cheers!

ericsl · ‎12-30-2010

If your management is not happy with the Recovery Time Objective (RTO) of VDR maybe you should consider proposing an alternative solution that does not involve restoring anything, such as the various third party solutions (Veeam, Vizioncore, etc) that constantly keep replica copies of virtual machines available. This would probably shorten your RTO to within a few hours, essentially the amount of time required to start the various VM's...

~Eric

www.myManagedBackup.com

Goatie · ‎01-26-2011

Hi all,

Well our re-run of DR went quite well this time.

We ended up using two teams of two doing 8 hours each (with a bit of overlap), then a break then starting again at 8am. We ended up needing five shifts (a total of about 45 hours) all up which equated to roughly the same amount of time of the previous DR took (five 8 hour days), but was more successful.

We had planned to shift VDR backup servers around and such, but in the end we kept the architecture the same as the previous DR Test and just modified how we ran it.

Three key tasks made a significant difference:

Breaking up the DR into sections and ensuring all servers in a particular section were fully complete (aside from large data restore actions) before starting on servers in the next section.
Not restoring more than 4 or 5 VMs from the one VDR appliance or source server at any one time and not restoring any more than TWO Backup Exec data file recoveries from the one backup server at a time.
End-to-end shifts. By the end of the first 8 hour shift, we had only got to the point of having AD and vCenter up (the first 4 or 5 hours are chewed up with pre-DR tasks, reconfig of hosts, SAN and network before a single restore is even started.) Without the back-to-back shift and starting again on day two at 8am, we would have lost 14+ hours of possible fileserver recovery time because the second shift recovered all of the file servers and exchange and kicked off a few large (4TB in total) Backup Exec recoveries.

A great benefit from breaking up the restore into sections was that the DR coordinator (and us) had a better understanding of where we were at at any given time and whether we were falling behind or not. From the previous DR I had rough timings of each restore and I make an estimated time chart of when each section would start and end. This helped a great deal and I highly recommend everyone have this so that management has greater visibility of where you are at with the DR process, which is especially important in a real DR.

Some VDR recoveries still had differing times for similar systems. An example of this was recovering Citrix XenApp servers. All VMs are 30GB and have the same applications installed. We restored five VMs at a time with most VMs taking 1-1.5 hours and then one VM would take 3 hours. I think that may be due to higher fragmentation within the VDR datastore of that VM than the others, but I'm not sure.

Our next step will be to implement NetApp SnapMirror and VMware Site Recovery Manager (including having live AD and vCenter at the DR site). We're adding another 15 applications to protect for the next DR, so hopefully SRM will reign in the restore time at the same time as increasing the DR scope. We hope that SRM and SnapMirror will enable us to have an RTO of hours instead of days, but only time will tell (and a lot of work!)

Good luck with everyone's DRs and may you never have to actually use them for real!

Cheers!

DSTAVERT · ‎01-26-2011

Thanks for the update.

Considering that you have a 10GB connection between the datacenters couldn't you do some cloned images on a daily basis so that some initial VMs would be on site and only require minor updates. I could see AD and system state information shipped off. Smaller less active VMs. Just asking the question.

-- David -- VMware Communities Moderator

Goatie · ‎02-09-2011

Yes, that is a good idea to look into. Although from previous experience the Backup Exec System Recovery VSS client and the VMTools VSS client seem to conflict on random servers. It works on some and constantly fails to close down the VSS snapshot on others.

We're running BESR backups of the vSphere infrastructure and theyre' being kept in the secondary data centre. At the moment, that's enough as we have the SRM installation in the pipeline. If we didn't have that, then yes, I'd look further into doing some nightly clones.

Cheers!

DSTAVERT · ‎02-09-2011

Do you have SAN replication available? A shame to not use as much of the 10GB connection as possible.

-- David -- VMware Communities Moderator

D1Q4 · ‎06-30-2011

hay goatie..

thanks a lot for share ur experience using VDR..

i have some experience to using vdr, and now I'm stuck, because the integrity check takes a very long time. i want to add more datastore for destination backup all backup job. but the integrity check pointing the main virtual adapter /SCSI-1:0/ then i cannot add new vdisk and mount it for primary destination backup job.

my poin is, is it ok if i manualy cancel/stop the integrity check???

thanks for your advice

All

My experiences of using VMware Data Recovery (VDR) in a Disaster Recovery test