Has anyone actually been able to tell if the reclaim job actually does anything to reclaim space because my storage usage only increases and never shrinks in size. I've even decided to do a test where I deleted ALL restore points for ALL VMs so there was absolutely no restore points left not even one, delete all backup schedules so no backups are done and then I reboot the VDR and waited for it to run its recatalog job then the reclaim job and then the integrity check job and after all was done which took about 3 days to complete, I looked at the storage locations and all 200+GB of SLAB and DAT files were still there. So again I ask how do you shrink the amount of data that is NOT being used anymore?
In the VDR Admin Guide under "Reclaim", there's a note that says the following:
"NOTE When reclaim operations free space in files, those files are not compacted to reflect the new free space. As a result, the size of files on the deduplication store does not decrease, even when reclaim operations are reclaiming space. The space which is free is reserved and used for future backups."
I don't think the statement "The space which is free is reserved and used for future backups." is true because when I started the backups again during that test and let it run for 2 weeks I see new slab files being created and the space usage also increase in from 200GB (after reclaim) to 285GB (after 2 weeks of fresh backups). If I read the statement correctly then for the first 200GB of backups it should be using the reserved space and NOT add to it until it fills it up. To me it seems that the VDR doesn't seem to know that it has 200GB of reserved space free to use and just continues to create new slab files.
What version of VDR are you using? There is a slight difference in behavior between VDR 1.1 and earlier versions - the following is how VDR 1.1 works.
For VDR, once a slab file is created in the dedupe store, it will never shrink or be deleted, but we will re-use the space in it for future backups, so note that during a reclaim job, the slab files are not deleted after deleting all restore points. For example, if you began with a brand new dedupe store, backed up 4 VMs that created 10 slab files, even if you deleted all restore points for all 4 VMs, the 10 slab files will remain. However, VDR will reuse these slab files in the future when new back ups are added.
But we think this is what you are seeing - while VDR avoids creating new slab files whenever possible, it does not take into account the size of the already existing slab files. A slab file can be anywhere between 80 MB and about 1GB in size. While they grow as needed, VDR never shrinks them so the following thing example is possible:
Lets says the VDR dedupe store has ~200GByte of data stored in, say 350 slab files. If all the restore points are deleted, 350 slab files that take up 200GByte collectively are still present. -- As VDR re-fills the dedupe store with data, it uses existing slab files as it find them. However, it is possible that VDR will pick slab files that haven't grown to their full size yet and fill them with data before reusing the space in the slab files that are already at their max. size.
So one thing additional thing to check is that the number of slab files has not gone up. This will confirm that VDR is reusing the existing slab files, though it may grow slab files that are not at their maximum size.
Hopes this helps explain things
I am using version 126.96.36.1997 which is the latest.
I will test again whenever I get a chance and see if what you say is correct.
I hope in the next version of VDR it will be smart enough to shrink, consolidate and delete slabs to help the customer reduce their space allocation.
Can you explain what the following two lines in regards to a integrity check means?
Remaining: 19664 files, 105336.4 GB
Completed: 5050 files, 27760.2 GB
I've noticed according to the log that the 'remaining' file count doesn't change from integrity check to integrity check but the number of GB does.
I've also noticed that the 'completed' files are increasing at a constant rate of 156 files from check to check each day and the GB count is also increasing at a constant rate of 882.5 GB each day. If this 882.5 GB is amount of data VDR think is change each day than this is a FULL backup of ALL my VMs.
I'm just trying to understand what these numbers really mean.
Also does the reclaim job provide any more info other than just saying 'Task completed successfully'? It would be helpful to show some stats or something useful. BTW I have logging level set to 6 already.
I think there is one solution to your problem.
you could switch between several destination:
- create a new destination
- wait for a save point to be done on it.
- unmount the old destination and delete the concerned directory.
Concerning other performance problems i can see in this post (like recatalog or Integrity checks that takes a while) there is some rules to apply:
- create one destination per VM. The goal is to limit deduplication. In theory, you could deduplicate large number of VMs of different types with many restore points. In practice, this can runs during one week or two. After that everything goes wrong and no more backup are done.
- In the same way, limiting the number of restore points at the beginning is a good idea.
- also limit the number of backup jobs (parameter MaxBackupRestoreTasks in the datarecovery.ini file)It's depends onthe number of physical destinations you get.. In my case, I got 2 SAN, so MaxBackupRestoreTasks=2.In the future, perhaps i will raise this number depending on the performance i will obtain.
- And sometimes when its possible (once a week is good for me), reboot your appliance.
Since I applied these rules, Datarecovery has became a good solution for my organisation, and my nerves.
fgl, did you test if your slab-files just grow and no new ones are being created? I was also startled at how VDR never seemed to free diskspace after reclaim jobs with deleted restore points. I haven't run VDR 1.2 for long enough yet, but I don't think 1.2 handles that any different.
When I upgraded to VDR 1.2 I started fresh with a new datastore but my retention policy is set for 8 weeks and I'm currently on week 6 so I won't really know for a few more weeks. But then again this is the longest I've been able to keep VDR running without any problems and I have not had to delete snapshots due to problems I've had with previous versions. I'll provide an updated after my 8 weeks retention policy is hit.
I agree with this statement. It seems that (in my case) VDR is using around 340GB for it's backups, despite removing a couple of servers from the backup job (which in theory has made a few slab files available to store date) it still grows at the same rate as usual. This is very concerning. I really hope VMware can change the reclaim job to properly reduce the space used by the backup SLAB files. I use DFSR to store an offsite copy of these files, so it would really help if VDR managed them properly in the first instance.
Well it still doesn't seem that VDR 1.2 resolve that problem of reclaiming space as after 10 weeks of a 8 weeks retention cycle I am still seeing new slab files created although the restore points are being deleted it just doesn't seem that VDR is reusing the space. I am now down to 40GB of free space on my 500GB storage, if I create another 500GB storage location and add it to the backup job will VDR start using the new storage or will it wait till it fills up the first 500GB storage location before moving to the 2nd 500GB storage location?
That doesn't sound good...
Actually, since VDR only does incremental backups apart from the first one, I wonder if it is even feasible to delete/reuse space without some kind of merge-process (is that what the reclaim jobs are supposed to do?). Since everything depends on all the previous backup steps. Or maybe I'm just applying craplogic here.
VDR will only fill the dedupe store you select in the backupjob definition. Note that each dedupe store is completely independant, meaning that it will perform a full backup for the first time you backup a VM to a new dedupe store.
Like Mightyman and Azmir said...
VDR should adjust if a backup will exceed space given...meaning it will delete restore points as needed, like my good ole Dish DVR.
If your backup destination is on a vmware VDR formatted VMFS or NFS, the old DAS (disk attached storage) method does not really apply. Ex. 1 TB will show 1TB used but should reuse blocks used previously. Like a VMFS volume on a SAN would function. Think of it like a can of Coke, with a safety valve to let overflow out (delete older restore points), you'll always have a full can.
You using CIFS/SMB? If so, go to VMFS or NFS.
I think you'll be fine but have spare space ready or adjust policy.
We're finding ourselves in the same situation here. It's frustrating to know that there appears to be plenty of "available" space within the slab files, but it's not being used and continues to go to unused disk for backups.
Azmir - Do you know if there are any plans to "fix" this in future VDR releases?
vmbr, mightyman and azmir:
VDR is differently not working as designed in regards to it's ability to reclaim space.
I orginally had a retention policy of 60 days, 4 weeks, 1 month, and zero year and it continue to create new slab files even 75 days into the cycle.
I then reduce the policy to 30 days, 4 weeks, zero month, and zero year and I also added another 250GB to the orginal 500GB bringing my total a 750GB datastore, and after 2 weeks VDR still continues to create new slab files even though it is deleting old restore points and supposely reclaiming the emply space.
As a last attempt to prove me wrong and that VDR is doing what it's suppose to do I reduce the policy all the way down to 1 day, 1 week, zero month, and zero year and stopped all backups and let VDR run for 2 weeks of cycle where it did it's integrity check and reclaim job, so now 3 weeks in I have zero restore points listed and 582GB of slab files I started up the backup schedule again and by the end of 24 hours I now have 618GB of slab files so it shows that VDR is definitely NOT reusing slabs and creating new ones.
I have now completely started fresh AGAIN and have split my backup into 2 different jobs on 2 different datastores to help curb the growth. I'm also looking into integrating vSphere with Avamar and see if Avamar will be better.
Funny you brought this up...I can confirm this too. I stand corrected. Reclaim does work but if the datastore (slabs) never reduces, vdr eventually has fits. Our last backup was the 5th, : (
Our policy is to keep only 1 backup per vm since we use 1.36TB datastore, we have over 30 VM's.
We have datarecovery.ini at MaxBackupRestoreTasks=1 to keep it controlled.
Side note: Take a look at Vmexplorer http://www.trilead.com/ , their latest beta 2.1.18 (ask to be beta tester) now does delta backups, vcenter support is still not ready (they are getting closer though) but normal backups are fine for esx and esxi, also compresses inline.
I've had exactly the same issues as you guys, and also issues with VDR simply stopping working.
Unfortuantely we invested in VMware essentials plus at several sites in order to use this product. It's really not mature enough to use in a production environment IMHO. Simple things like the lack of any email reporting are perplexing.
I've since been trialing veeam backup and replication, it seems to do everything that I expected VDR to do, albeit at an additional cost. The upside is that it will also replicate VM's to another ESXi server to allow for instant server recovery. Their new prodcut (V5)is due at the end of this month, this has native support for remote backups over slow WAN links, something which is crucial for me
I'll keep my eye on this thread, and VDR in general, but so far I'm pretty disappointed with it, sorry VMware
While I was at first was rather content about how more or less stable the recent VDR releases worked (at least compared to the atrocities of 1.0.x stuff), after experiencing the issue being discussed here after some time, I must say this really renders this product completely useless.
Basically, we would have to create 2 (smaller) dedupe stores for the way VDR works now:
Fill the first one with your backups until it is full, then fill the second one and once this is full too, delete the first dedupe store and create a new one, once this is full delete the second and create a new one, repeat endlessly.
Are you serious?
Also someone check the following too:
Make a backup of a new VM you never backed up before and see how much space this full-backup took (df -h before and after). Now delete this sole restore point of the VM and let the integrity and reclaim jobs run. Now make a new backup of the VM, which will of course be a full-backup again, and see how the same amount of used diskspace is being increased again.
So VDR doesn't only not free any space up, the existing (aka wasted) space is not even considered for dedupe anymore even though the same blocks should still be present (or the reclaim job just overwrote them with junk, which makes it even dumber).
I left a voice mail with our contact at vmware today and forwarded this thread 2 days ago, still nothing back, maybe they are recovering from VMworld, to see if anything is/can/will be done.
Anyway...good point, looks as if we need to rotate/purge our datastore every once in a while until they get this fixed. I bet if I had JUST 1 200GB VM and only backed it up... on a 1TB drive, after time it would eventually use all the space even though it may have grew a couple of GB's.
Now... we backup to a NFS datastore (on Openfiler) that was provisioned as a thin disk, we could not make it thick, vmware required/defaulted to thin, the option was greyed-out so we could not change. Could it not be a VDR issue, more of a vmware thin/think disk issue that vdr does not jive with when the reclaiming process kicks in, etc...? Anyone use a true VMFS thick backup datastore?
They need to focus on a complete backup solution with true email notification (not using scripts), what a concept! and stop focusing on all this Cloud talk or stop promoting VDR. In defense of their VDR developers, I don't know if the big wigs have allocated enough resources to help develop VDR, I know the folks working on it are at least trying.
My last backup was 5 days ago, just now our Integrity Check finished..Yahoo! (ran for 2 days and 22 hours), hopefully tonight I can get some backups.
Maybe they buyout Veeam, vRanger, PHDvirtual, somebody...and offer it for us but I'd rather have a vmware solution within vcenter.
Let me wade in...some background first
The VDR de-dupe vault consists of the index tree which maps SHA1 hashes of all known blocks of data to locations where they are stored. The data is stored in the Block Store, which consists of a number of Slab Files. The maximum size of these files are be limited to 1 GB. If more than 1 GB of data need to be stored in the vault, then another file will be added to the vault. The logical address stored in the index tree consists of the slab file number and the logical block index of the slab file.
Once a slab file is created, it is not be removed
Slab files are always 1GB in size but does not necessarily have to be 100% full
When reclaim operation is done, we check on a given block's usage counter - if it reaches 0, it is removed from the the dedupe store. This means that that block is now ready to overwritten.
The "free space" reported in the VDR UI can show 0 byte available even a reclaim operation. The VDR dedupe store reports the "total number of bytes available on the volume to create additional slab files" as "free space". Once VDR has created the maximum number of slab files on the dedupe store, "0 bytes free" will be reported no matter how much space is actually unallocated inside slab files. There should no adverse affect on backups running since there is free space. There is no workaround to increase the "0 byte free" flag unless the dedupe store is extended and increased in size. Note backups will continue since there is unallocated inside the slab files. I believe most folks fall in this category - if there is not space, VDR will automatically run a reclaim and rerun the backup job if there is . If the job completes, then there is still free space in the allocated slab files. If it fails, then you are actually out of space.
HOWEVER, there could be a separate issue where VDR appliance is unable to run any backups the even after a reclaim operation has run and restore points have been deleted to free up space on the dedupe store. The VDR appliance looks at the free space on the disk holding the dedupe store by using a Linux syscall and parses the “total number of bytes available to a non-privileged user" parameter. By default, 5% of the space available on a file system is reserved for the root user under Linux. While the VDR engine does run as root, this free space can dip below 5% free, causing the engine to assume that the destination is completely full. This results in backups failing, even after reclaim operations are run. The workaround is to log into the VDR appliance and use tune2fs to set the number of reserved space to 0. We are fixing this in the next release of VDR.
We are working on putting together KBs on both of these issues and should be posted shortly.
In terms of dedupe management, I'll leave it up to you how you want to manage your datasets. However, we support up to 10 VDR appliances per vCenter Server instance, each with 2* 1TB dedupe stores. So while you will not get universal dedupe across appliances (note, this is not limited to VDR, this is an industry wide issue as more and more customers depend on dedupe for secondary storage), I don't subscribe to the "rotate and purge" of the dedupe store operational model. You have an option to span out backups of VMs across multiple VDR appliances that have multiple dedupe stores - this seems like the more reasonable approach since you are ensuring that you always have backups.