VMware Cloud Community
PakChan
Enthusiast
Enthusiast

Why do I get VDR Integrity Check error on Restore Points?

Situation

I'm running VDR 2.0 backing up about 35 VMs to 1 destination (all SAN-based). The destination size is about 780GB, of which about 75GB is marked as free (deduplicated size is about 505GB, presumably leaving me about 275GB actually free). We have a backup window of 5 hours in the evening, and all VMs are backed up well within that time. After that, we back up the VDR appliance to tape by shutting it down, snapshotting it, and restarting it to allow normal operations whilst we back up the snapshot to tape (this is for disaster recovery purposes in case we lose our SAN). The maintenance window starts on the hour after this snapshotting is done.

Problem

The problem is that we occasionally get an Integrity Check failure for a restore point, after which the destination is locked until the offending recovery point is deleted. This happened over the weekend.

Questions

What I would like to know is:-

1) What is the cause of this? The backup in question seems to have succeeded, and after the VDR appliance restarts, the Reclaim operation that runs succeeds. It's the following Integrity Check that fails, finding a corrupt transaction. Why do these corrupt transactions occur?

2) Given that the only valid recovery procedure for this is to delete the offending recovery point and re-run the Integrity Check, why is there no option to do this automatically?

3) How difficult would it be to script such a recovery procedure in VDR? I'm aware that there is no API, but I'm fairly comfortable with shell scripting on Linux, and so would just need a few pointers to put me on the right track.

Reply
0 Kudos
4 Replies
CRad14
Hot Shot
Hot Shot

To answer atleast part of your question, I believe VDR should be deleting the corrupt Restore points on its own. I however had also had times where I needed to specifically set a restore point to be deleted.

Conrad www.vnoob.com | @vNoob | If I or anyone else is helpful to you make sure you mark their posts as such! 🙂
Reply
0 Kudos
PakChan
Enthusiast
Enthusiast

Thanks for answering. I would have hoped that VDR would automatically delete them, but experience has indicated that this is not the case. VDR has never deleted a corrupt recovery point, but simply reruns at the start of each maintenance window (which in my case is daily from midnight until 6pm) telling me each time that there is a corrupt recovery point and that the destination is locked until that is resolved.

It's really frustrating, and reduces the utility of it as a backup solution if it locks up for no apparent reason, and requires manual intervention to initiate an 20-hour recovery process, all of which is automated apart from the manual selection of the corrupt recovery point (is there any situation where you would not select all corrypt recovery points for deletion?)

Reply
0 Kudos
cag201110141
Enthusiast
Enthusiast

You will continue to get problems with integrity checks due to not enough free space at the target. Thats free space VDR needs to dedup and run internal tasks. I am not sure how VDR needs, but It was more than I had. I lost 11 datastores. The problem always starts once the store fills and reclaim tasks fail. I dont know the exact amount you should have free but I found always having 40-50 % keeps things working. Its a waste but until 2.1 is out there is no fix other than using another product or have plenty of free space.

Reply
0 Kudos
PakChan
Enthusiast
Enthusiast

That does happen, but less frequently than the random integrity check errors. As it turns out, that's just happened to me this weekend, with some of the VMs failing to back up as a result of running out of space. This is despite having 2.4GB free on the root filesystem, 9% (73GB) actual free space, and about 260GB "available" (but possibly fragmented) space within the destination store (i.e. Capacity - Deduplicated Size).

Unfortunately, previous behaviour after extending the destination store (which I have done) is that the additional storage is rapidly swallowed up, and the free space drops back to about 10% free again for a while, until it "runs out of space" again.

Is there a document anywhere which describes how we might predict when VDR is likely to complain about space? I'm guessing that maintaining the index/hash tables on the chunks is taking up a large amount of temporary storage during an actual back up, but there seems to be no guidance as to how much space that process takes.

We're also running with 8 threads; would it be better to run with fewer? If so, how would that impact the backup times?

Reply
0 Kudos