This was an interesting development today. In my lab, all of the VMDKs are thinly provisioned. iSCSI connects the backend Synology to the ESXi host (6.5.0 Update 1 - build 7526125). The NAS is VAAI compatible, has been completely reliable, stable, isn't overworked or overtaxed, and has TBs of free space if I need to allocate more.
Prior to messing with the lab, I used the web interface to take a snapshot of the systems I want to work on so I don't have to rebuild or undo the damage if I make a mistake. A very typical use-case. The datastore has .5TB allocated to it and was sitting at 176GB in use of 500GB. After two days of working on 5 VMs (5 VMDKs), it was time to delete the snapshot and commit the changes. I did so, one at a time.
What I didn't realize was as the snapshots were being deleted, all of the VMDKs became fully allocated. By some miracle, there was 26MB remaining out of 500GB possible after the last disk finished. I don't understand why this happened today.
I'm aware I can fix this by taking the affected VMs offline and using vmkfstools to clone the VMDKs to another datastore and move it back - but it's an annoying process that should have never happened in the first place.
Any ideas?
This sounds almost impossible, at least I've never seen s.th. like this. With only 26MB free disk space on the datastore - after deleting a snapshot - that snapshot must have been less than 26 MB in size, which is very unlikely. Did you refresh the disk usage for the datastore to ensure that it's not a display issue?
Also, please run ls -lisa for the VMs .vmdk files as this command will show provisioned as well as the used disk space.
André
That's what I thought. I turned off one of the VMs and started moving it to another new LUN. Some addl output..
The consumed space according to the below, 100% agrees with the ESXi UI and what I'm seeing as consumed on the LUN on the NAS.
[root@esx:/vmfs/volumes/595cb657-6e332364-b2d1-64006a5cf50c/ZLab - 2012] vmkfstools -Ph -v10 /vmfs/volumes/esx-datastore2
VMFS-6.81 (Raw Major Version: 24) file system spanning 1 partitions.
File system label (if any): esx-datastore2
Mode: public ATS-only
Capacity 499.8 GB, 649 MB available, file block size 1 MB, max supported file size 64 TB
Volume Creation Time: Wed Jul 5 09:50:15 2017
Files (max/free): 16384/15841
Ptr Blocks (max/free): 0/0
Sub Blocks (max/free): 16384/15892
Secondary Ptr Blocks (max/free): 256/255
File Blocks (overcommit/used/overcommit %): 0/511095/0
Ptr Blocks (overcommit/used/overcommit %): 0/0/0
Sub Blocks (overcommit/used/overcommit %): 0/492/0
Large File Blocks (total/used/file block clusters): 1000/56/944
Volume Metadata size: 1510866944
UUID: 595cb657-6e332364-b2d1-64006a5cf50c
Logical device: 595cb656-0c30fe47-b131-64006a5cf50c
Partitions spanned (on "lvm"):
naa.6001405ea2cb68fd3a37d4153dac61de:1
Is Native Snapshot Capable: NO
OBJLIB-LIB: ObjLib cleanup done.
WORKER: asyncOps=0 maxActiveOps=0 maxPending=0 maxCompleted=0
The space of this thinly provisioned disks are also expanded. This particular VMDK only contains 7.73GB of data.
[root@esx:/vmfs/volumes/595cb657-6e332364-b2d1-64006a5cf50c/ZLab - 2012] ls -lisa
total 18630784
16777988 128 drwxr-xr-x 1 root root 77824 Mar 16 18:38 .
4 1024 drwxr-xr-t 1 root root 81920 Mar 16 17:34 ..
79694596 0 -rw-r--r-- 1 root root 13 Mar 16 18:38 ZLab - 2012-aux.xml
41945860 524288 -rw------- 1 root root 536870912 Jan 11 14:41 ZLab - 2012-cec26d94.vswp
16780036 17876992 -rw------- 1 root root 64424509440 Mar 16 21:10 ZLab - 2012-flat.vmdk
46140164 1024 -rw------- 1 root root 74232 Mar 16 18:39 ZLab - 2012.nvram
20974340 0 -rw------- 1 root root 577 Mar 16 18:38 ZLab - 2012.vmdk
12585732 0 -rw-r--r-- 1 root root 43 Mar 16 18:38 ZLab - 2012.vmsd
8391428 0 -rwxr-xr-x 1 root root 3044 Mar 16 18:38 ZLab - 2012.vmx
58723076 0 -rw------- 1 root root 0 Mar 14 14:52 ZLab - 2012.vmx.lck
50334468 0 -rw------- 1 root root 3089 Jan 11 14:42 ZLab - 2012.vmxf
62917380 0 -rwxr-xr-x 1 root root 3051 Mar 16 18:38 ZLab - 2012.vmx~
37751556 1024 -rw-r--r-- 1 root root 297852 Jan 17 16:00 vmware-1.log
67111684 1024 -rw-r--r-- 1 root root 271359 Mar 16 18:39 vmware.log
25168644 112640 -rw------- 1 root root 115343360 Jan 11 14:41 vmx-ZLab - 2012-3468848532-1.vswp
54528772 112640 -rw------- 1 root root 115343360 Mar 14 14:52 vmx-ZLab - 2012-3468848532-2.vswp
After moving one of the VMs from datastore2 to the new datastore3, the fully allocated size of the VMDK was maintained even though it is showing up as thin. While the ESXi interface shows it's a thin disk, it is fully allocated still, even after a move. The NAS reports the same amount of allocated disk space in the LUN as ESXi does.
Removing a snapshot is a kind of disk recovery.
Strange that it shows only 649MB as being available on the datatore. According to
16780036 17876992 -rw------- 1 root root 64424509440 Mar 16 21:10 ZLab - 2012-flat.vmdk
the virtual disk has been provisioned with 60GB, but currently consumes ~17GB, i.e. nothing unusual for a thin provisioned virtual disk.
To get a better overview, run RVTools which will show provisioned vs. used disk space in the vInfo tab.
André
I came across that article as well. But I'm on iSCSI, not NFS. VAAI is enabled on my iSCSI options, so this should be native functionality as far as the tech is concerned. There are zero NFS services in use (never have been configured).
Thanks for the tip about RVTools. Unfortunately, it confirmed the issue on the datastore (as noted in the post above). Even though you're correct about the allocated vs consumed space, the datastore is being reported full. When I'm moving these thin provisioned disks from one datastore to another, they end up fully committed. I didn't get a linear improvement in space in the datastore as the VM moved.
Datastore 2 improved from .6GB free to 13.3GB free which confirms its allocation was likely correct. In in the process of moving the 2008R2 VM from DS2 -> DS3, the VMDK shown below as Thin provisioned (True), was expanded to full size upon move. The size of DS3 confirms the command line. The allocation of the disk, despite having only having 7.6GB in it, was fully allocated on DS3 to 60GB.
What's not answered still is why the datastore shows as full, but the size of the VMs doesn't add up.
Below is an example:
[root@esx:/vmfs/volumes/5aac2b85-36933f17-7529-64006a5cf50c/ZLab - 2008R2] ls -lisa
total 63031424
1412 128 drwxr-xr-x 1 root root 73728 Mar 16 21:18 .
4 1024 drwxr-xr-t 1 root root 73728 Mar 16 20:41 ..
33554628 0 -rw------- 1 root root 13 Mar 16 21:18 ZLab - 2008R2-aux.xml
196 62914560 -rw------- 1 root root 64424509440 Mar 16 21:18 ZLab - 2008R2-flat.vmdk
4194500 1024 -rw------- 1 root root 74232 Mar 16 21:18 ZLab - 2008R2.nvram
8388804 0 -rw------- 1 root root 579 Mar 16 21:18 ZLab - 2008R2.vmdk
37748932 0 -rw------- 1 root root 43 Mar 16 21:18 ZLab - 2008R2.vmsd
20971716 0 -rw------- 1 root root 2955 Mar 16 21:18 ZLab - 2008R2.vmx
12583108 0 -rw------- 1 root root 3087 Mar 16 21:18 ZLab - 2008R2.vmxf
25166020 1024 -rw------- 1 root root 298554 Mar 16 21:18 vmware-1.log
29360324 1024 -rw------- 1 root root 276444 Mar 16 21:18 vmware.log
16777412 112640 -rw------- 1 root root 115343360 Mar 16 21:18 vmx-ZLab - 2008R2-2299771517-1.vswp
I
Another other example of a thin provisioned disk that is now fully committed - actual in use is closer to 8GB:
33557316 61364224 -rw------- 1 root root 64424509440 Mar 19 13:48 ZLab - 2012R2-flat.vmdk
What version of vm hardware are you running and what OS? I was recently doing a little testing to see how thin disks, snapshots and auto unmap on vmfs6 will work for me and while I don't think I tested your specific scenario, here's a few things you might want to look at:
I saw a difference between hardware version 11 and 13 when I copied a file (in windows), forced windows to optimize/defrag the disk (when it's thin it knows not to do a defrag though) and then deleted the snap. In the case of hardware version 11 the disk expanded to full size. When I deleted my copied data (after the snap delete) it shrank by that amount (but never down to the original size of the vmdk without any changes). Hardware version 13 didn't behave like this (ie. it didn't expand the disk to full size when I deleted the snap)
If this is windows, did you disable the optimize disk server or the weekly scheduled task? Might try forcing the optimize if it's disabled to see if it re-thins the disk if it's already at full size
What OS? If this is windows, might be a difference between 2008 R2/2012 R2/2016 in how they behave. I think 2008 R2 was a little different than 2012 R2.
Worst case, if you want to rethin the drive, you can use sdelete (part of sysinternals tools) on windows (or whatever the equivalent in linux is) to write zeros to the free space and then shutdown the vm. Ensure there are not snapshots and at the esxi command line run the following: vmkfstools -K "mydisk.vmdk". That will hole punch the drive and shrink it back down (works for me anyway on iscsi storage backed by vmfs 5 and 6 datastores)
Fred - thanks. I'm seeing the symptoms you describe, but the affected VM Hardware Level was inverse for me. For the VMs that were affected by the disk becoming fully allocated, they were all VM hardware version 13. The ones unaffected were version 11. Another issue I discovered was "DC2" had sesparse disks still assigned to it despite no snapshots being detected in the GUI. After snapshotting and removing the snapshot (which took well over an hour to reconcile) - the sparse disks were removed.
For the affected VMs, had already begun the process of performing the offline shrink using vmkfstools -K "<diskname>.vmdk" to shrink the affected vDisks from an ESXi SSH session. I'm now up to 128GB free in the datastore from 24MB (or so) late last week - which is closer to where things were before things started to not work correctly.
I had also rebooted the ESXi host, though it didn't do anything special .. didn't fix anything.
I'll take your suggestion for the scheduled tasks to heart. Presently I have not stopped the disk optimization scheduled tasks - and since those servers (2008 / 2008R2 / 2012 / 2012R2 / 2016) are almost 100% server core, I'm going to have to work some GPO magic to set that scheduled task to disabled. This is something I've never performed in a production environment across my client base in either their own local (on-prem) private cloud or in my company's private multi-tenant cloud, either. We've always depended on the tech to work for us and not have to create one-offs. However, this is why there are test environments and this fortunately happened in test.
If the scheduled task is a pain because of server core, you can just stop the disk optimization service instead. Probably easier and will accomplish the same thing i think.
Whether there's any downside I can't say though. I'm thinking the only downside might be that perhaps windows might not "unmap" all blocks all the time and the optimization will reclaim whatever wasn't unmapped when it was deleted so you might lose a bit of space on the storage side.
I guess this is only really an issue (if the disk optimization is the issue here) if you happen to have a snapshot . If you're on a snapshot and it runs it might screw things up when you delete said snap. If you have optimization disabeld and delete a snapshop, you might lose space because the unmapped blocks in the snap won't get unmapped unless the scheduled task runs (or you run the optimize yourself)