Deleting snapshots on thin provisioned disks resul...

tmcmurryECI · ‎03-16-2018

This was an interesting development today. In my lab, all of the VMDKs are thinly provisioned. iSCSI connects the backend Synology to the ESXi host (6.5.0 Update 1 - build 7526125). The NAS is VAAI compatible, has been completely reliable, stable, isn't overworked or overtaxed, and has TBs of free space if I need to allocate more.

Prior to messing with the lab, I used the web interface to take a snapshot of the systems I want to work on so I don't have to rebuild or undo the damage if I make a mistake. A very typical use-case. The datastore has .5TB allocated to it and was sitting at 176GB in use of 500GB. After two days of working on 5 VMs (5 VMDKs), it was time to delete the snapshot and commit the changes. I did so, one at a time.

What I didn't realize was as the snapshots were being deleted, all of the VMDKs became fully allocated. By some miracle, there was 26MB remaining out of 500GB possible after the last disk finished. I don't understand why this happened today.

I'm aware I can fix this by taking the affected VMs offline and using vmkfstools to clone the VMDKs to another datastore and move it back - but it's an annoying process that should have never happened in the first place.

Any ideas?

a_p_ · ‎03-16-2018

This sounds almost impossible, at least I've never seen s.th. like this. With only 26MB free disk space on the datastore - after deleting a snapshot - that snapshot must have been less than 26 MB in size, which is very unlikely. Did you refresh the disk usage for the datastore to ensure that it's not a display issue?

Also, please run ls -lisa for the VMs .vmdk files as this command will show provisioned as well as the used disk space.

André

tmcmurryECI · ‎03-16-2018

That's what I thought. I turned off one of the VMs and started moving it to another new LUN. Some addl output..

The consumed space according to the below, 100% agrees with the ESXi UI and what I'm seeing as consumed on the LUN on the NAS.

[root@esx:/vmfs/volumes/595cb657-6e332364-b2d1-64006a5cf50c/ZLab - 2012] vmkfstools -Ph -v10 /vmfs/volumes/esx-datastore2

VMFS-6.81 (Raw Major Version: 24) file system spanning 1 partitions.

File system label (if any): esx-datastore2

Mode: public ATS-only

Capacity 499.8 GB, 649 MB available, file block size 1 MB, max supported file size 64 TB

Volume Creation Time: Wed Jul 5 09:50:15 2017

Files (max/free): 16384/15841

Ptr Blocks (max/free): 0/0

Sub Blocks (max/free): 16384/15892

Secondary Ptr Blocks (max/free): 256/255

File Blocks (overcommit/used/overcommit %): 0/511095/0

Ptr Blocks (overcommit/used/overcommit %): 0/0/0

Sub Blocks (overcommit/used/overcommit %): 0/492/0

Large File Blocks (total/used/file block clusters): 1000/56/944

Volume Metadata size: 1510866944

UUID: 595cb657-6e332364-b2d1-64006a5cf50c

Logical device: 595cb656-0c30fe47-b131-64006a5cf50c

Partitions spanned (on "lvm"):

naa.6001405ea2cb68fd3a37d4153dac61de:1

Is Native Snapshot Capable: NO

OBJLIB-LIB: ObjLib cleanup done.

WORKER: asyncOps=0 maxActiveOps=0 maxPending=0 maxCompleted=0

The space of this thinly provisioned disks are also expanded. This particular VMDK only contains 7.73GB of data.

[root@esx:/vmfs/volumes/595cb657-6e332364-b2d1-64006a5cf50c/ZLab - 2012] ls -lisa

total 18630784

16777988 128 drwxr-xr-x 1 root root 77824 Mar 16 18:38 .

4 1024 drwxr-xr-t 1 root root 81920 Mar 16 17:34 ..

79694596 0 -rw-r--r-- 1 root root 13 Mar 16 18:38 ZLab - 2012-aux.xml

41945860 524288 -rw------- 1 root root 536870912 Jan 11 14:41 ZLab - 2012-cec26d94.vswp

16780036 17876992 -rw------- 1 root root 64424509440 Mar 16 21:10 ZLab - 2012-flat.vmdk

46140164 1024 -rw------- 1 root root 74232 Mar 16 18:39 ZLab - 2012.nvram

20974340 0 -rw------- 1 root root 577 Mar 16 18:38 ZLab - 2012.vmdk

12585732 0 -rw-r--r-- 1 root root 43 Mar 16 18:38 ZLab - 2012.vmsd

8391428 0 -rwxr-xr-x 1 root root 3044 Mar 16 18:38 ZLab - 2012.vmx

58723076 0 -rw------- 1 root root 0 Mar 14 14:52 ZLab - 2012.vmx.lck

50334468 0 -rw------- 1 root root 3089 Jan 11 14:42 ZLab - 2012.vmxf

62917380 0 -rwxr-xr-x 1 root root 3051 Mar 16 18:38 ZLab - 2012.vmx~

37751556 1024 -rw-r--r-- 1 root root 297852 Jan 17 16:00 vmware-1.log

67111684 1024 -rw-r--r-- 1 root root 271359 Mar 16 18:39 vmware.log

25168644 112640 -rw------- 1 root root 115343360 Jan 11 14:41 vmx-ZLab - 2012-3468848532-1.vswp

54528772 112640 -rw------- 1 root root 115343360 Mar 14 14:52 vmx-ZLab - 2012-3468848532-2.vswp

After moving one of the VMs from datastore2 to the new datastore3, the fully allocated size of the VMDK was maintained even though it is showing up as thin. While the ESXi interface shows it's a thin disk, it is fully allocated still, even after a move. The NAS reports the same amount of allocated disk space in the LUN as ESXi does.

IT_pilot · ‎03-16-2018

Removing a snapshot is a kind of disk recovery.

Look https://support.cloudrecover.com.au/hc/en-us/articles/235148607-Thin-disks-on-NFS-datastores-are-con...

http://it-pilot.ru

a_p_ · ‎03-16-2018

Strange that it shows only 649MB as being available on the datatore. According to

16780036 17876992 -rw------- 1 root root 64424509440 Mar 16 21:10 ZLab - 2012-flat.vmdk

the virtual disk has been provisioned with 60GB, but currently consumes ~17GB, i.e. nothing unusual for a thin provisioned virtual disk.

To get a better overview, run RVTools which will show provisioned vs. used disk space in the vInfo tab.

André

tmcmurryECI · ‎03-16-2018

I came across that article as well. But I'm on iSCSI, not NFS. VAAI is enabled on my iSCSI options, so this should be native functionality as far as the tech is concerned. There are zero NFS services in use (never have been configured).

tmcmurryECI · ‎03-19-2018

Thanks for the tip about RVTools. Unfortunately, it confirmed the issue on the datastore (as noted in the post above). Even though you're correct about the allocated vs consumed space, the datastore is being reported full. When I'm moving these thin provisioned disks from one datastore to another, they end up fully committed. I didn't get a linear improvement in space in the datastore as the VM moved.

Datastore 2 improved from .6GB free to 13.3GB free which confirms its allocation was likely correct. In in the process of moving the 2008R2 VM from DS2 -> DS3, the VMDK shown below as Thin provisioned (True), was expanded to full size upon move. The size of DS3 confirms the command line. The allocation of the disk, despite having only having 7.6GB in it, was fully allocated on DS3 to 60GB.

What's not answered still is why the datastore shows as full, but the size of the VMs doesn't add up.

Below is an example:

[root@esx:/vmfs/volumes/5aac2b85-36933f17-7529-64006a5cf50c/ZLab - 2008R2] ls -lisa

total 63031424

1412 128 drwxr-xr-x 1 root root 73728 Mar 16 21:18 .

4 1024 drwxr-xr-t 1 root root 73728 Mar 16 20:41 ..

33554628 0 -rw------- 1 root root 13 Mar 16 21:18 ZLab - 2008R2-aux.xml

196 62914560 -rw------- 1 root root 64424509440 Mar 16 21:18 ZLab - 2008R2-flat.vmdk

4194500 1024 -rw------- 1 root root 74232 Mar 16 21:18 ZLab - 2008R2.nvram

8388804 0 -rw------- 1 root root 579 Mar 16 21:18 ZLab - 2008R2.vmdk

37748932 0 -rw------- 1 root root 43 Mar 16 21:18 ZLab - 2008R2.vmsd

20971716 0 -rw------- 1 root root 2955 Mar 16 21:18 ZLab - 2008R2.vmx

12583108 0 -rw------- 1 root root 3087 Mar 16 21:18 ZLab - 2008R2.vmxf

25166020 1024 -rw------- 1 root root 298554 Mar 16 21:18 vmware-1.log

29360324 1024 -rw------- 1 root root 276444 Mar 16 21:18 vmware.log

16777412 112640 -rw------- 1 root root 115343360 Mar 16 21:18 vmx-ZLab - 2008R2-2299771517-1.vswp

I

Another other example of a thin provisioned disk that is now fully committed - actual in use is closer to 8GB:

33557316 61364224 -rw------- 1 root root 64424509440 Mar 19 13:48 ZLab - 2012R2-flat.vmdk

FreddyFredFred · ‎03-20-2018

What version of vm hardware are you running and what OS? I was recently doing a little testing to see how thin disks, snapshots and auto unmap on vmfs6 will work for me and while I don't think I tested your specific scenario, here's a few things you might want to look at:

I saw a difference between hardware version 11 and 13 when I copied a file (in windows), forced windows to optimize/defrag the disk (when it's thin it knows not to do a defrag though) and then deleted the snap. In the case of hardware version 11 the disk expanded to full size. When I deleted my copied data (after the snap delete) it shrank by that amount (but never down to the original size of the vmdk without any changes). Hardware version 13 didn't behave like this (ie. it didn't expand the disk to full size when I deleted the snap)

If this is windows, did you disable the optimize disk server or the weekly scheduled task? Might try forcing the optimize if it's disabled to see if it re-thins the disk if it's already at full size

What OS? If this is windows, might be a difference between 2008 R2/2012 R2/2016 in how they behave. I think 2008 R2 was a little different than 2012 R2.

Worst case, if you want to rethin the drive, you can use sdelete (part of sysinternals tools) on windows (or whatever the equivalent in linux is) to write zeros to the free space and then shutdown the vm. Ensure there are not snapshots and at the esxi command line run the following: vmkfstools -K "mydisk.vmdk". That will hole punch the drive and shrink it back down (works for me anyway on iscsi storage backed by vmfs 5 and 6 datastores)

tmcmurryECI · ‎03-20-2018

Fred - thanks. I'm seeing the symptoms you describe, but the affected VM Hardware Level was inverse for me. For the VMs that were affected by the disk becoming fully allocated, they were all VM hardware version 13. The ones unaffected were version 11. Another issue I discovered was "DC2" had sesparse disks still assigned to it despite no snapshots being detected in the GUI. After snapshotting and removing the snapshot (which took well over an hour to reconcile) - the sparse disks were removed.

For the affected VMs, had already begun the process of performing the offline shrink using vmkfstools -K "<diskname>.vmdk" to shrink the affected vDisks from an ESXi SSH session. I'm now up to 128GB free in the datastore from 24MB (or so) late last week - which is closer to where things were before things started to not work correctly.

I had also rebooted the ESXi host, though it didn't do anything special .. didn't fix anything.

I'll take your suggestion for the scheduled tasks to heart. Presently I have not stopped the disk optimization scheduled tasks - and since those servers (2008 / 2008R2 / 2012 / 2012R2 / 2016) are almost 100% server core, I'm going to have to work some GPO magic to set that scheduled task to disabled. This is something I've never performed in a production environment across my client base in either their own local (on-prem) private cloud or in my company's private multi-tenant cloud, either. We've always depended on the tech to work for us and not have to create one-offs. However, this is why there are test environments and this fortunately happened in test.

FreddyFredFred · ‎03-20-2018

If the scheduled task is a pain because of server core, you can just stop the disk optimization service instead. Probably easier and will accomplish the same thing i think.

Whether there's any downside I can't say though. I'm thinking the only downside might be that perhaps windows might not "unmap" all blocks all the time and the optimization will reclaim whatever wasn't unmapped when it was deleted so you might lose a bit of space on the storage side.

I guess this is only really an issue (if the disk optimization is the issue here) if you happen to have a snapshot . If you're on a snapshot and it runs it might screw things up when you delete said snap. If you have optimization disabeld and delete a snapshop, you might lose space because the unmapped blocks in the snap won't get unmapped unless the scheduled task runs (or you run the optimize yourself)

All

Deleting snapshots on thin provisioned disks results in fully allocated vmdk (ESXi 6.5)