Solved: Re: Why Failed to clone disk: There is not enough ...

IT_Architect · ‎04-14-2013

The Situation:
1. We have two SuperMicro servers running ESXi 5.1, and do cross-backups each night between the servers using vkmfstools. Each server has Windows 2003 Server VM running Windows Services for UNIX Version 3.5, each with a system virtual hard drive, and a data virtual hard drive, which is an NTFS compressed volume that is shared NFS. All VMs are thin. This has combination has worked fine since 2008. Using this method I can store backups in 1/5th of their THIN size.

2. Recently, I did a hardware refresh making both servers the same model as the most powerful server we had before, except I added a second 6-core processor, doubled the memory to 12 GB, changed to a RAID-10 with 2 TB of drive space, and moved from ESXi 4.0 to 5.1. I expanded the NFS volumes / vmdk on both servers to 500 GB to allow for more backups.

The Problem:

1. When Server2 backs up fine to the NFS volume on Server1, the clones go fine. However, when Server1 backs up to Server2, the largest VM (55GB thin and 80GB declared, fails at ~90% point on each attempt, even if I clear space, which it shouldn't need as the drive only has about 51GB of data on a 500 GB NFS volume, and the ESXi VMware Client shows the same amount of free space no matter which server I check it from.

     Destination disk format: VMFS thin-provisioned
      Cloning disk '/vmfs/volumes/datastore1/my_vm/my_vm.vmdk'...
      Clone: 90% done.Failed to clone disk: There is not enough space on the file system for the selected operation (13).

All of the other VMs after it backup fine. It only fails on the larger VM, but I can back it up fine to a local directory.

Observations:
- I made a new secondary vhd at 500 GB. It worked for a while, but when I went to add more I ran into the same problem.
- I made a new secondary vhd at 1024 GB and the same thing happened.
- I ran chkdsk /r on it and it showed bad clusters. Then I did it on the macnine that works fine, and it shows the same thing. I can run chkdsk /r 50 times on each machine and it makes no difference. The only place it shows bad clusters is with the thin .vmdk backup files.
- After running chkdsk /r, I can backup again, even though the size used on disk doesn't change. However, one of the features of chkdsk /r is to re-calculate free space.

Thoughts:
- It's odd that one machine has the problem and not the other, and why I haven't had the problem for the past 4 1/2 years. The only notable difference is one of them has a 55 GB thin .vmdk, which is somewhat larger than the others. Other than that, the two Windows machines are the same, in fact when set up, one was a copy of the other. Even the hardware on the machine that has the problem is unchanged other than larger drives, more memory, and a second processor. The big change was from ESXi 4.0 to 5.1.

- It seems when I copy thin .vmdks to the Windows 2003 NFS volume, it confuses it Windows. I can see where it might since it provisions a larger size than what it actually uses. However, I didn't have this problem before. When I run chkdsk /r, the last thing it does is recompute free space. That is probably why I can backup again.

Question: Does anyone know if there is a definitive answer somewhere that explains the cause of this phenomenon and if there is a work-around?

IT_Architect · ‎06-06-2013

I found the bug.
Problem recap:

I have a 2003 Server with WSFU that I have been doing this since 2008. HOWEVER, I ran into a problem after “upgrading” from vSphere 4.0 to 5.1. While both Windows and vSphere show plenty of disk space, GhettoVCB/vmkfstools now throws an error that it ran out of disk space.

Experiment:
- I decided to isolate the problem by backing up a single VM to a new NFS VHD. I made a new 250GB VHD, to try something different, I made it EagerZeroedThick (usually thin), and initially left compression off. The VM is 80GB provisioned, and 57GB thin.
- I set the rotation to 2, since you need 2 + 1 to do the backup. After the backup finishes it drops the oldest backup.
- I ran many rotations and there were no problems. I watched the disk space and it would go down to 9GB during backup and bounce back after the backup completed. All worked as expected.
- I turned on compression for the volume in Windows, and waited for it to finish. When it finished, I had a ton of free space, and both Windows and the vsphere client total and free space numbers agreed. I had lots of free space.
- Next I cranked the number of copies to 4 in GhettoVCB. The first backup went fine as expected, and left me with lots of free space due to the compression. The next backup, which would have made the 4th clone, failed with an insufficient disk space error. Thus, it failed at the same point as if there the volume had no compression. (Changing from thin to EagerZeroedThick made no difference.)
- I then deleted two of the backups, so I was back down to 2 again, which would leave room for a third, and ran the backup. It failed. Huh! Even with the 4th failed backup still there, the volume shows 107GB free.
- Next, I deleted the failed backup, went into the second backup directory, made a sub-directory named junk, and copied the backup files into it. This takes vmkfstools out of the equation for the copy. During the copy, I received the error:

"Windows - Delayed Write Failed

Windows was unable to save all of the data for the file <whatever> The data has been lost. This error may be caused by a failure of your computer hardware or network connections."

Wha...t! Windows and vSphere show tons of space, and even if there were no compression at all, there should have been room. However, this is consistent as to what has been happening to people. When it gets to this point, you can delete the files to make space and you still can't do a backup. Something changes in the VHD to where you can't get the space back?
- OK, well I've tried chkdsk in the past, and that didn't fix it so I'll try decompressing the disk and see if that fixes it. Yep! Sooo...it appears Windows is the rat! But why just since I went to 5.1? I suppose it could be that vSphere corrupts the NFS somehow, but for an error like this, that's not very likely. NFS is just a protocol to interface to the host operating system. It's time to see what Windows has to say for itself. After some Googling, I came up with this:
Error message and events are logged in the System log when you try to compress a large file on an N...
The status of the problem is:
"Microsoft has confirmed that this is a problem in the Microsoft products that are listed in the "Applies to" section"

- Applies to:
Windows 2000 until Server 2008. However, the article was last updated in 2007, so I have no confidence that it doesn't also apply to Microsoft's current operating systems.
- Based on what percent complete it fails at, it appears their definition of large is somewhere around ~50GB. I have others that have a provisioned size of 80GB, but their thin size is in the 36GB range, and they don't cause any problems. It's a thing of beauty when VHD is thin and NTFS compression, however, there is a major fly in the ointment. The only theory I have as to why it didn't happen with 4.0, is because with the new server, I have room for more copies. I may not have the exact combination that causes things to fail, but I do know who the rat is, and it's Windows. It's nice to have this bit of misery solved for myself and others using this approach. I don't know if ZFS can do any better, and I don't like the way it sucks up memory like a whirlpool. The newest Windows might work, with its new compression, and de-dup features, but I have yet to prove it.

Other observations:

- Concerning backup speeds, I get vastly different speeds when running with NTFS compression off vs. on. Across a GB Internet connection I get 70MB/second without compression (pretty much wire speed), and about 26MB/second with compression on. The CPU monitor shows that the compression is using only one thread/vCPU.
- One must watch the backup logs carefully for errors. After the backup fails, vcbGhetto still deletes the oldest backup, and retains the failed backup. This means that if left unattended, it will rotate out all of your good backups.

View solution in original post

IT_Architect · ‎04-23-2013

More...
The problem remains. I use GhettoVCB latest to do thin clone backups to a Windows 2003 Server compressed volume, projecting the volume as an NFS share. GhettoVCB deletes the oldest backups, so there is no reason for the volume to run out of space, and indeed both Windows and ESXi agree on the amount of free space, and that there are many times what is required for the backup even if the backups were not thin. This is what I've observed:

1. After so many cycles, the same, and largest VM, will not backup anymore. The error from vmkfstools is that it is out of disk space, even though it shows it has almost a terabyte of free space. I'm assuming the reason the problem is with this VM is because it's larger than the rest all put together. (80 declared, 55 GB thin) I have other declared at 80 also, but they are much smaller from a thin standpoint. It will get to 90%+ complete before it fails.

2. When I run into the problem, no matter how many backups I erase, and no matter how much free space the NFS volumes shows I have, I can no longer backup the VMs. The error returned in vmkfstools is that there is not enough disk space, even when there is almost a terabyte free.

3. If I run chkdsk, no switches, from within Windows (read only), it shows no errors.

4. If I run chkdsk /r, and reboot, it finds and fixes errors. However, if I repeat the process, I will get the same errors on the same files no matter how many times I do it. This behavior is consistent for the system drive as well, which is not exported NFS and has no thin backups on it, and is not compressed. I also get the same behavior on my other 2003 Server VM. It seems chkdsk doesn't work properly inside of a VM. The error returned is: "Windows replaced bad clusters in file <file number>" for each file.

5. The last process of chckdsk /f runs is recalculate the free space. After that, I can backup again.

Thoughts:
- The fact that Windows and ESXi agree on the amount of free space is not impressive because NFS is not a file system, it is simply a protocol, and ESXi gets its information from Windows through the NFS protocol.
- It seems likely that vmkfstools actually does run out of disk space, even though Windows shows there is far more than enough.
- Since the problem is related to the number of cycles, I would theorize that the space is not being returned for use when a backup is deleted. This is supported by the fact that after chkdsk /r, I can backup again.
- I've been using this method of backing up since 2008 with no problems. The things that have changed recently is moving from ESXi 4.0 to 5.1, and updating to the latest GhettoVCB script. Since GhettoVCB simply uses VMware commands, the chances of if being script related seems remote.
- Take together with the fact that chkdsk /r does not work properly with the Windows VMs, there appears to be a compatibility issue between Windows and VMware.
- Since I have performed a chckdsk /f, it has started a new cycle and has not yet reached the point where it recycles backup space yet. As soon as the problem surfaces again, I plan to use article How to locate and correct disk space problems on NTFS volumes pointed to by Microsoft to attempt to determine the cause, and possibly implement a cure.

Edit:
It happened again. I decided to see if Windows has the same problem so I did a copy of a backup. It also failed. The error from Windows was:
"Windows - Delayed Write Failed
Windows was unable to save all of the data for the file ???-flat.vmdk. The data has been lost. This error may be caused by a failure of your computer hardware or network connection."
Since it was a copy of a directory on the same local drive, I can safely rule out anything network related. Notice, Windows didn't say it was out of space, like vmkfstools does.

- I went into Windows Explorer and right-clicked on the volume > Properties > Tools tab > Check Now button, and checked both Automatically fix file system errors, and Scan and Attempt recovery of bad sectors. It says it can't do it now and if I want to do it in the next boot and I answered yes. I believe that is the same as chkdsk /r. After it rebooted and "fixed" the errors, which it states as "Windows replaced bad clusters in file <file number>", I started another backup.

Thanks!

IT_Architect · ‎06-06-2013

I found the bug.
Problem recap:

I have a 2003 Server with WSFU that I have been doing this since 2008. HOWEVER, I ran into a problem after “upgrading” from vSphere 4.0 to 5.1. While both Windows and vSphere show plenty of disk space, GhettoVCB/vmkfstools now throws an error that it ran out of disk space.

Experiment:
- I decided to isolate the problem by backing up a single VM to a new NFS VHD. I made a new 250GB VHD, to try something different, I made it EagerZeroedThick (usually thin), and initially left compression off. The VM is 80GB provisioned, and 57GB thin.
- I set the rotation to 2, since you need 2 + 1 to do the backup. After the backup finishes it drops the oldest backup.
- I ran many rotations and there were no problems. I watched the disk space and it would go down to 9GB during backup and bounce back after the backup completed. All worked as expected.
- I turned on compression for the volume in Windows, and waited for it to finish. When it finished, I had a ton of free space, and both Windows and the vsphere client total and free space numbers agreed. I had lots of free space.
- Next I cranked the number of copies to 4 in GhettoVCB. The first backup went fine as expected, and left me with lots of free space due to the compression. The next backup, which would have made the 4th clone, failed with an insufficient disk space error. Thus, it failed at the same point as if there the volume had no compression. (Changing from thin to EagerZeroedThick made no difference.)
- I then deleted two of the backups, so I was back down to 2 again, which would leave room for a third, and ran the backup. It failed. Huh! Even with the 4th failed backup still there, the volume shows 107GB free.
- Next, I deleted the failed backup, went into the second backup directory, made a sub-directory named junk, and copied the backup files into it. This takes vmkfstools out of the equation for the copy. During the copy, I received the error:

"Windows - Delayed Write Failed

Windows was unable to save all of the data for the file <whatever> The data has been lost. This error may be caused by a failure of your computer hardware or network connections."

Wha...t! Windows and vSphere show tons of space, and even if there were no compression at all, there should have been room. However, this is consistent as to what has been happening to people. When it gets to this point, you can delete the files to make space and you still can't do a backup. Something changes in the VHD to where you can't get the space back?
- OK, well I've tried chkdsk in the past, and that didn't fix it so I'll try decompressing the disk and see if that fixes it. Yep! Sooo...it appears Windows is the rat! But why just since I went to 5.1? I suppose it could be that vSphere corrupts the NFS somehow, but for an error like this, that's not very likely. NFS is just a protocol to interface to the host operating system. It's time to see what Windows has to say for itself. After some Googling, I came up with this:
Error message and events are logged in the System log when you try to compress a large file on an N...
The status of the problem is:
"Microsoft has confirmed that this is a problem in the Microsoft products that are listed in the "Applies to" section"

- Applies to:
Windows 2000 until Server 2008. However, the article was last updated in 2007, so I have no confidence that it doesn't also apply to Microsoft's current operating systems.
- Based on what percent complete it fails at, it appears their definition of large is somewhere around ~50GB. I have others that have a provisioned size of 80GB, but their thin size is in the 36GB range, and they don't cause any problems. It's a thing of beauty when VHD is thin and NTFS compression, however, there is a major fly in the ointment. The only theory I have as to why it didn't happen with 4.0, is because with the new server, I have room for more copies. I may not have the exact combination that causes things to fail, but I do know who the rat is, and it's Windows. It's nice to have this bit of misery solved for myself and others using this approach. I don't know if ZFS can do any better, and I don't like the way it sucks up memory like a whirlpool. The newest Windows might work, with its new compression, and de-dup features, but I have yet to prove it.

Other observations:

- Concerning backup speeds, I get vastly different speeds when running with NTFS compression off vs. on. Across a GB Internet connection I get 70MB/second without compression (pretty much wire speed), and about 26MB/second with compression on. The CPU monitor shows that the compression is using only one thread/vCPU.
- One must watch the backup logs carefully for errors. After the backup fails, vcbGhetto still deletes the oldest backup, and retains the failed backup. This means that if left unattended, it will rotate out all of your good backups.

All

Why Failed to clone disk: There is not enough space on the file system for the selected operation (13).