I love the fact that VMware is providing VDR as part of the vSphere package. It's definitely a step in the right direction, albeit I'm still inclined to think this software hasn't been put through the ringer in terms of proper QA. I'm just trying to put out a feeler to see how many others have experienced some of the same issues I'm having.
To start, I'm backing up my VMs via a network share on a standalone Windows 2003 server that has a NAS attached to it.
Some of the issues I've noticed:
1) Backups take an inordinate amount of time. I can understand the first backup, but my VMs don't change very much from day to day. Most of the data being manipulated is located on RDMs are these are backed up using Tivoli, not VDR (I use VDR solely for the OS partitions). Each partition is approximately 25GB, there are 15 VMs and my backup window (10pm - 6pm) isn't sufficient to complete the process.
2) Integrity checks for the backups are taking a crazy amount of time and will usually stop due to my window being closed (see point #1)
3) I'm getting inconsistent "failures" for certain VMs (the report will simply state that a VM failed to backup, not much else). It also varies per night and not always the same VMs (not exactly sure if this is related to #1 where the window is closing while VDR is executing)
4) I had the most difficult time setting up the remote share from the VDR appliance in vSphere. The username and password would never be accepted (even though if I tried the same share with the same user/pass on a Windows machine, it would work fine). I finally narrowed down the problem to the simple fact that the VDR appliance can't handle passwords that have special characters in them (this password had an "@" and a ","). Looking at the console while attempting to mount the share would spit out a CIFS error -22. Changing the password to include only numbers and letters was sufficient to work around this issue.
5) Snapshots not being created for no apparent reason and thus failing the VDR process. I'm fully able to do a manual snapshot with or without the memory state, so I'm not sure why VDR can't do it. This issue is very intermittent. I had it often when I first setup VDR, but now it only happens every so often (without any type of consistency).
I think that's all I can think about for now..
Absolutely the same issues here:
1. Constant auth issues accross various clients
2. Timezone hacks required to get the appliance time to work properly
3. Integrity check unusably slow, often fails and tends to create corrupt restore points
4. Random snapshot failures, even though I can manually snap and all DNS is configured correctly
And to top if off - no backup success/failure notification
Very poor even for a 1.0 product
A very disappointed enterprise customer here with regards to VDR
Has anyone opened a Support Request on vDR? If not, please do so, as it needs to de documented and forwarded to product support team to review. Thanks.
VMware Communities User Moderator
Some thoughts - if possible, the suggestion to open up an SR is the right one since it will help the team determine root cause
1) Make sure tools is updated on VMs - especially for Windows VMs since we leverage VSS to quiese the VMs. You mention that you are using VDR to protect the OS partitions, what are the OS?
2) There is no needs to scan the virtual disks for HW7 VMs since the vmkernel provides a changed block to VDR, while for HW4 VMs, VDR needs to scan the virtual disks at every backup to determnine changed blocks. Do you have both VM HW revs in your environment? Do you see a performance difference?
3) Re: destination disk, performance of vmdks and RDM has observed to be better than CIFS. Some dedupe store guidelines are posted here
4) You can override the backup window by selecting the job, right click and choosing bring into compliance. This should begin backups of VMs that did not fit in window. Is you backup windows really 10pm-6pm or is it 10pm-6am?
5) Snapshots are executed by the vmkernel - VDR sends a call to execute one. So, still a VMware problem but the root cause is still with the snapshot engine.
6) I have observed random/odd issues with VDR in the communities where a reboot or a redeploy resolves the problem. It be worth trying both if your issues continue - note that the state is not lost since it is kept in the dedupe store. So, once you have redeployed appliance, reattaching the dedupe store/destination disk will bring the state back (jobs, logs, restore points)
You should note that my infrastructure is completely new since vSphere. We did not have any ESX servers (or VMs) until vSphere was released (this was by design). All our VMs are P2V's...
1) We're using Windows 2003 R2 (and a few Linux CentOS/Ubuntu VMs). VMware Tools is the one that comes with ESX4
2) We only have HW7 unless of course the VDR appliance uses HW4..
3) I guess I could use a LUN.. If there's a significant performance hit using CIFS that is. I'd prefer to use CIFS as I'm dumping the Store onto Tape from a Backup Proxy server.
4) My backup job is between 10pm and 6am. My mistake. I know I can bring things into compliance, but there definitely seems to be some problem doing this. I had a "copying <VM-NAME>..." task that ran for 8 hours (no reason this should happen) - the VM was only 20gb in size and it wasn't the first backup. I had to reboot the VDR (I've done this on numerous occasions, it's getting frustrating).
5) I had this at the start, but less and less.. This isn't so much a concern now (for snapshots) as it seems to be working (just thought I'd mention it)
6) Redeploying the VDR is a PITA for our environment - the reason being is that our network assigns externally-accessible IP addresses to each machine that connects (we don't have a company-wide firewall), hence I need to configure iptables/shorewall on each of my appliances. It gets frustrating having to reconfigure this everytime.
FYI.. I was unable to mount a CIFS share unless I did a :
yum install samba-client (which also installs samba-common). Once this was done, CIFS would mount.
I don't want to troubleshoot the problem in the forums - and it seems that you have done a lot of work to get it to this point. Probably best if you were to open up an SR and really see what is going on. The fact that this is brand new vSphere environment is a good thing since this is the VDR sweet spot. However, troubling that you are running into a variety of issues - I can see one issue but as many as you have seen is troubling. Thus why I am somewhat puzzled.
Having said that
1) It would be interesting to know if the performance issues are related to the CIFs share at all. A test of backing up to an RDM or VMDK should provide us this data point.
2) In terms redeploying - again, some of the issues that you are seeing are fairly random (snapshots failing, copying an unchanged OS disk for 8 hours. Again, other contributors to the forum have resolved random issues by redeploying. If it is a PITA, then probably best to get with VMware support and see what they suggest - it may come down to this but maybe it can be postpone to the absolute end. One though around redeploy would be
1) Power off original VDR
2) Deploy new VDR appliance
3 ) Assign IP address of original VDR to new VDR appliance
4) Attach original destination disk to new VDR appliance
Will this still require a reconfiguration of your network (just familiar enough with your constraints)
Another thought that came to mind is the fact that you are using separate arrays for your source vmdk and the destination for the backups. This means that all the data blocks are travelling over the LAN - not sure what the network bandwidth is and what the backup traffic is contending with (i.e are the VDR backups running at the same time as the TSM backups over the same network?).
Are the source vmdks (OS disks) on shared storage?
VDR is definatly a step in the right direction. The problem is it doesn't really cut it as an Enterprise solution.
Have you checked out esXpress at all? It's the same concept as VDR (VMs backing up VMs) except it's been around a lot longer and is much more 'polished'
Community Rep for PHD Virtual Technologies Inc
esXpress radically alters the notion of how to protect data in virtual infrastructures in one simple way: we use the virtual infrastructure to back itself up!
Let me add this issue to the mix:
"Trouble reading from the destination volume, error -2241 ( Destination index invalid/damaged)."
I can no longer use VDR with this CIFS share (unless of course I nuke the current backups, which kind of defeats the purpose of VDR!). I've tried restarting the appliance, redeploying via the OVF template..
This really should never have gotten out of beta.
This VDR is a great attempt but clearly not ready for production yet. VMware should speed up on releasing a stable and very well tested VDR. (they are supposed to release a new version by the end of june which supports file-level restore)I do also have several issues like many others in this post. Briefly I will list them briefly
Operations take too long to perform (regardlesss your destination location)
Backups fail without helpful information
Backups do not start at the backup intervals
VDR crashes during backup/restore
I appreciate the comments and we take them to heart. When we GA a product, it has done through customer and internal validation that matches our exit criteria for GA. However, we know that it will not be a perfect product and that "real world" testing by customers help us make a better product. I am not sure if you submitted any SR for the issues that you outlined, but that will provide more data about your environment/testing. This then drives changes in our internal QA environment, which we hope will close that gap in our QA test matrix. Some comments
1) The file restore client is available today for Windows (experimental support). Sorry I don't have link handy, but it is in one of the forum posts
2) We have seen some snapshot and connectivity issues when name resolution does not occur. Most customers have been to work around this by adding the hostnames in the \etc\hosts files on the VDR appliance.
3) You are correct that backups do not start exactly at the opening of the backup window. However, unless there is a VDR process running that "locks" the dedupe store (integrity check for example), the backups should start soon after the backup window. I usually look for consistency of time stamps of the restore points - (backed up roughly the same time each day)
4) When you saw "VDR crashes" - what specifically is happening (error messages, state of the VDR virtual appliance), etc?
5) When the backups fail, there should be an error message (yes, I agree that some may not be customer deciperable) - can you share what these error messages were?
Answer to your items:
3) Backup window is between 5pm and 8am. after I created the backup job it ran right away and that's the only backup I have after 3 days. The status shows as waiting/idle and there's nothing else locking it. I have not manually force it because I want to test that the backup window actually works.
4) In a separate test env. The VDR would crash (reboot) in the middle of a backup. The logs would just show errors saying the task terminated unexpectedly possibly due to a power failure or system crash.
5) Backup failures show different messages like
trouble writing to destination volume, error -22,
the task terminated unexpectedly possibly due to a power failure or system crash
can't access backup set, execution errors
I have tested the destinations (network shared & vmdk in SAN) and they seem fine no problems with the I/O.
Name resolution is not a problem
Having all of these issues myself. VDR is defintely not ready for a small business let alone enterprise.
What is most annoying though is the constant WAITING. I get hardly any feedback as to what is going on.
Failing snapshots, random terminations, slow backing up etc..etc..etc...etc..
Really don't see how it ended up in 4.0 in this state. This isn't even beta level.
I have a different set of problems...
When I tried to do a rehersal restore, there's no hosts showing up in the list, and to be able to select another destination for the restore, I had to move the VM to another cluster. If I didn't, there's only ESX host local disks showing up in the destination list.
If I select another host for destination, the restore-VM disappears...
Anyone else seen this behavior?
There is a VDR update!
Latest Released Version: 1.0.1 | 07/09/09 | 176771
The v1.0.1 release addresses the following
Large Temporary Files Removed as Expected
Data Recovery modifies virtual machines' vmdk files' settings so a snapshot can be created for backup purposes. In the past, after the backup has been created, the vmdk file's settings was sometimes left configured for snapshots even after the backup was complete. This led to these virtual machines being left in snapshot mode while accumulating snapshots that were undetected by vSphere Client. This process has been redesigned so that these temporary files are no longer be left behind. In previous versions of Data Recovery, this issue can be resolved by following the process described in the knowledge base article titled "Delete ddb.delete entries and snapshots left behind by Vmware Data Recovery".
Backups Can Be Completed While Integrity Checks Are Running
Data Recovery can complete backup operations at the same time that an integrity check is running. In the past, when an integrity check was running, backups could not be completed.
Improved Integrity Check Backup Speed
Integrity check has been optimized for faster performance. In the past, comparable integrity checks took longer to complete.
Improved VMotion Licensing Support
Virtual machines can be moved between hosts using VMotion without producing licensing issues. In the past, if a virtual machine was moved between hosts using VMotion, licensing checks sometimes produced errors.
Reduced Data Recovery Backup Appliance Shutdown Time
The Data Recovery Backup appliance now shuts down more quickly than it did before. In the past, the appliance often took 15 minutes to shutdown.
Improved Support for Different Time Zones
In the past, Data Recovery did not consistently handle time zones with positive offsets relative to GMT. For example, Data Recovery could encounter issues with data associated with the Paris time zone, which has an offset of +1, whereas data associated with the New York time zone, which has an offset of -5 was handled as expected. These issues no longer occur.
Data Recovery Supported with Essentials Plus Licenses
Data Recovery is included in Essentials Plus licenses. In the past, using Data Recovery with Essentials Plus licenses failed. Backup jobs created with Essentials Plus licenses failed with the error "License not available to perform operation. Feature hotplug not licensed..."
Integrity Check Optimized to Run During Idle Times
Before running regularly scheduled integrity checks, the Backup Appliance determines if the current time is during a backup window. If the current time is not during a backup window, the integrity check runs. If the current time is during a backup window, the backup appliance checks the backup schedule to determine if there will be a time in the next 24 hours that will not be during a backup window. If there is a time in the next 24 hours that is not during a backup window, the Backup Appliance waits for that time. If there is no time that is not during a backup window in the next 24 hours, the Backup Appliance completes the integrity check
I wish I had the same issues as what was reported above. I can't even get as far as having a slowdown. Initially I was getting 3902 errors (file access error) when trying to perform an initial backup. Of course, there doesn't seem to be any information on that error.
I couldn't log in as root onto the appliance (it would just hang). So, I rebooted the appliance, then gave it a static IP. Even though it seems to be configured correctly, I can not open the tool in the vSphere Client. (Could not connect: The operation is not allowed on non-connected sockets).
Don't think I will bother putting in an SR- this product does not look reliable.
Just venting here- echoing what others feel about the product not being ready. I guess I'll look for a third-party product. Very disappointed in a feature that was one of the main reasons we went to ver 4.
This really should never have gotten out of beta.
Yet ANOTHER reason they need to have a PRIVATE beta among those of us on THIS forum that use ESX DAILY. Everytime I post this no one at VM Ware acknowledges this, YET we continue to have products released which CLEARLY are not ready and to the point 'never should have been released from BETA' is valid.
I keep posting and harping about needing some type of BETA program for even vExperts or a SPECIAL closed BETA so WE can test it ( WE being the people that umm.. actually USE the product, gee what a novel concept). Because apparently whoever BETA tests at VM Ware (other than being taken out and beaten with a wet noodle and fingers removed so they can NEVER touch software again) don't actually USE the product, they just assume that since the appliance runs ... well gee it looks fine! Yeah release it!
3 times now this has caused serious undermining issues, the first being that debacle last year when 3.5 FIRST came out .. the 2 weeks time out where VM's wouldn't start... That should have been a clue to VM Ware then that the END users should actually have a vested interest in making sure the product.. uh.. let me .. oh yeah.. WORK?!?!? Interesting.
But no, VM Ware and other beta testers want to wave the 'NDA' flag, well phooey!
You want an actual WORKING product or not, VMWare? You want to keep it among 15 people in California who couldn't BETA test if their life depended on it.. that would be secondary to actually letting people ensure that its ready.
Gee I am not an Enterprise CEO, but that doesn't make sense to me.
BETA, it should be given to those competent enough to ACTUALLY test it THOROUGHLY before giving it a green light. This is not being done at the present.
Time and time again, we deal with inept, lackluster, faulty software. Is this an enterprise class product or not?
Oh it is?!??! Well why doesn't someone TREAT it as such?