VDR: Slow Integrity check

Mirko_Huth · ‎06-05-2009

I have a VDR store that has a size of 180 GB. Integrity check started the first time this morning at 01:00 am. Now it is almost 6:00pm and it is still running with a progress of 20% !http://communities.vmware.com/message/1274241/!!

Why is it so slow. Performance during backup operations to this datastore is quite good.

I see lots of the following messages on the console of the VDR machine:

CIFS VFS: send error in read = -12

Status code returned 0x0000205 NT_STATUS_INSUFF_SERVER_RESOURCES

The "stop" function does not work. It gets grey and does nothing.

On shutdown i get the the error

Unmounting CIFS filesystem: unmount: /10.133.1.20/vmwaredr: device is busy

CIVS VFS: server not responding

CIFS VFS: No response to cmd 46 mid 27233

CIFS VFS: Send error in read = -11

The backup destination is a Windows 2003 Server and i It has enough free space, memory and CPU ressources.

Any ideas?

Thanks!

Mirko_Huth · ‎06-05-2009

I was able to fix the performance issue by changing the IrpStackSize value on the backup destination (windows 2003 server):

Open Regedit

Edit this key

HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters\IrpStackSize

It was set to 12 (decimal). I changed that to 30 (decimal)

The server service needs a restart after the change. Then reboot the VDR VM.

It has now 11% in 15 minutes. Before it was not responding at 20% in 17 hours!

Mirko_Huth · ‎06-06-2009

Perfomance increased by the change. Unfortunately it stopped again at 50%.

Until now i was not able to have an integrity check of the 200 GB vdr store

There are now several corrupt restore points (maybe due to forced shutdown of the VDR vm. Had no other choice because vdr did not respond to any actions).

Backup and restore operations are prevented until these restore points are removed.

Now guess what.....the only way to remove them is a complete integrity check!!!! :_| uhhmm, did i mention that this is not working? :smileyangry:

Not sure if this product is really ment to be ready for productive use.

I'm not able to set this post back to unanswered so you might see this back in another thread....

cjf91jf1l · ‎06-07-2009

I'm seeing the same issue with two separate destinations I setup. One is 280GB and the other is 150GB.

Integrity checks of both stores have failed since the first attempt. If I reboot the VDR appliance the check will fail again the next time it runs (Whenever it wants. This function really needs to be schedulable by the end user!). I then tried re-deploying the OVF template and mounting my existing backup locations. Now my 'recatalog' job is hanging just like the 'integrity checks'. The CPU of the appliance is pretty steady at 25% and the network throughput is steady at 10MB/sec but it really doesn't seem to actually be doing anything. I did try modifying the registry key you suggested hoping for a miracle, but it doesn't seem to have made any difference.

The performance of the initial backups seemed pretty poor as well. I gave the job 7 hours to run and only got about 70% of the machines done whereas vRanger would complete a backup of the same VMs in about 4 hours. I chalked it up to the deduplication that VDR was performing but after seeing the checks fail I think there is something else going on.

In case it matters, my ESX datastores are attached via iSCSI to an EqualLogic SAN. Backup destinations are CIF shares on a Windows 2003 R2 box with an iSCSI mounted disk on the same EQ SAN as well as an XP VM with an iSCSI disk on a separate EQ SAN in another building. Both of these destinations were previously being used with vRanger and I didn't have any performance/crash issues.

In general I've seen a lot of strange things happen with VDR, from license errors to I/O errors and the integrity check issue. I was really looking forward to first party backup support but this is really looking like it's still in beta.

tbohmer · ‎06-08-2009

The integrity check seems to make this product somewhat unusable at the moment. I put 60 virtual machines in the backup, some of which are quite large, and while the initial full backup was pretty fast (around two days in total), the integrity check has been running for 4 days nonstop now - still only at 30% and very slowly increasing. Local SAN is used for the backup storage. Since no backups take place while the integrity check is running, it seems like we will be getting around one backup per two weeks, and integrity check running rest of the time..

Mirko_Huth · ‎06-08-2009

Seems to be related to the size of the VDR store.

I deleted my backup job/VDR store and created a new one with fewer machines in the backup selection.

I started backup and after it finished the store had a size of 56 GB (before it was >200GB).

After that i started a manual integrity check. It took 1 hour and 56 minutes to complete.

This was the first successful integrity check i had during all my testing!!!

mcwill · ‎06-10-2009

We also experienced this so raised an SR.

The response was that this is a know bug and is planned to be fixed in VDR Update 1.

Regards,

Iain

cjf91jf1l · ‎06-10-2009

I have an SR open on this as well and while I suspect the answer will be wait for U1, it's still with engineering.

Mirko_Huth · ‎06-11-2009

Please let me know if you have any news from the SR.

Thanks!

Mirko

mclapsis · ‎06-16-2009

I also have an open SR for this issue. Last reponse from the tech was that they need another 2 business days working on the issue.

cjf91jf1l · ‎06-17-2009

I've been given every indication that this is now a 'known bug' and will be fixed in 'Update 1' which is coming soon(?). For me this has been a disappointing debut for a product I was really excited about using.

Mirko_Huth · ‎06-17-2009

I hope they will release it asap.

I tried everything with VDR with no success. If there are only two or three machines in the backup job and the store is below 80GB it works slow but without error.

When the store reaches a certain size it gets useless. Currently i have a recatalog running since Monday evening and it is now at 61% !!!!

And i just asked for a nice integrated GUI for VCB....never asked for a new product. VCB compared to VDR is rock solid. Can't understand why they started a new product instead of improving the existing.

cjf91jf1l · ‎06-26-2009

I was made aware of a serious (in my opinion) bug with VDR during a call with VMware support that I haven't seen discussed anywhere. This is an internally known issue that causes snapshots to build up on VM's that are members of VDR backup jobs. I would urge you to do a quick check to see if you are affected by this issue as well.

During the backup process a new snapshot is created and VDR updates the snapshot descriptor file (vm_name-000001.vmdk) to mark the snapshot as un-removable. The bug is introduced when the backup process completes, it fails to mark the snapshot as removable causing them to remain.

The tricky part of the problem is that the snapshots are not visible through the vSphere Client, nor are they listed in apps like 'RVTools' that use the VMware CLI to gather data. They could potentially be listed in the new datastore views but I didn't think to look there before I resolved it in my environment. I ran across them by logging into the service console and running the following command to list all the delta files on the datastores attached to the server.

find /vmfs/volumes/ -name \*delta\*

In my environment I noticed numerous VM's with multi-gigabyte delta files that I couldn't account for via snapshots listed in the GUI. Here is the solution I was given by VMware. Via the service console, browse to the location of the VMDK files for the affected VM. Run this command to identify the descriptors that need to be corrected, replacing ‘virtual_machine_name' with the actual name of the VM.

grep -I ddb.dele virtual_machine_name-000???.vmdk

This command will quickly identify the delta files that are marked as non-deletable. The workaround is to edit the affected VMDK descriptor files and change "ddb.deletable" from "false" to "true". You will probably also need to edit the root VMDK file and change this field as well, otherwise you may be left with one open snapshot. Once you have edited all the files, create a new snapshot for the VM either via the GUI or command line. Then issue the "Delete All" snapshots command to force ESX to combine all the files and close all the visible and hidden snapshots.

Mirko_Huth · ‎06-28-2009

Thank you cjf91jf1l. I just ckecked my datastores and it turned out that i have several VM's with lost snapshots

Your workaround does not work for me because i use only ESXi hosts. Is there another way to remove these orphaned snapshtots?

Thanks!

Mirko

cjf91jf1l · ‎06-28-2009

One solution is to simply clone the VM's as the new machines would have no open snapshots. This was VMwares first solution for me, but you would experience some downtime. If you are trying to avoid downtime as I was you can use the unsupported service console on ESXi to edit the descriptor files. This page has instructions on accessing it:

Can I ask how large the snapshots have become? I really think VMware should be makeing this issue known to it's users.

netkombonnet · ‎09-25-2009

Hi

for ESXi you will find here http://communities.vmware.com/blogs/Knorrhane/2008/08/17/enable-ssh-on-esxi-3-35

the solution for activating ssh, after that you will be ablle to run the commands via CLI.