I posted a question about this over at Duncan Eppings blog but of course he cannot comment on futures.
So let me see if I can rustle up some useful dialog here instead.
In a nutshell, the problem is as follows:
While VAAI introduced reclamation of deallocated blocks at the VMFS level through the T10 unmap & write_same primitives, this really only solves half the problem. It does reduce overhead on thinly provisioned storage systems (although not currently by default and automatically), but only for deleted VMDKs.
Thin reclamation won't be properly sorted until VMFS is able to recoup blocks deallocated within guests. The currently supported way of doing this is to run the VM through Converter. I'm sure most of us can agree that this is a non-option for production servers, and anyway imagine doing that for hundreds or thousands of VMs on a regular basis.
Lately, the filesystems themselves are beginning to provide a solution to the problem. EXT4 as of kernel 2.6.27 supports the mount option "discard", which uses the same mechanism as VMFS to reclaim space, eg the T10 unmap & write_same commands. This is primarily in order to be compatible with the SSD trim requirement, but it also means that a thinly provisioned storage system will be told when a block is dereferenced and can be moved into the spare pool. NTFS also needs to support SSDs and implements the TRIM command in Windows 2008 R2 (I believe). Server 8 may be smarter and also built for thin provisioning - or not, they seem to be trying yet again to move into the storage space with their server products..
Anyway; recent and future versions of the two most common operating systems deplyed within guests will come equipped with various ways of autimatically and near-instantly reclaiming deleted blocks, but currently the SCSI controller layer in the VM does not honour UNMAP/TRIM in any useful fashion. And this is a shame since it provides the missing piece of the puzzle for an end-to-end optimal thin provisioning.
Unfortunately, it's not entirely trivial to solve - since there is no 1-to-1 mapping between a logical block in the guest filesystem and the VMFS blocks, you cannot just pass the UNMAP through all the way from the guest filesystem down to the array. Instead, you must track these unmapped blocks and when an entire VMFS block has been deallocated, it can be unmapped from the storage array. Or; Maybe the sub-block addressing in VMFS allows for a partial unmap?
A possible interim solution would be to optionally translate the unmap into a zero block.. but only if there is a way to avoid that becoming an actual write IOp consisting of zeroes hitting the spindles of your storage system. Many storage systems consider an entirely empty block to be equivalend to an unmap (this is how you reclaim disk using vmkfstools -y after all), but wether it is a penalty-free operation is hard to say.
Anyway; if anyone have further insights they wish to share then please do - for me, this is one of the major storage-related issues which need to be tackled, and soon, but I may be in the minority - and maybe making a big deal out of a non-issue?
Message was edited by: schistad; added more information for clarity. Vital parts were left in my head and not in original text 😉
For thin-on-thin reclamation I'll break it down into two parts.
First, you need to zero the free space in your Guest OS. Under Windows this can be done via SDelete which is very time consuming and momentarily completely fills a given drive before free up the space within the guest. However this method fully balloon the VMDK which is not ideal by any means. Diskeeper 12 (aka 2012) supports doing this at the block level though I haven't heard yet how the underlying of this is done but on the couple of test systems this seems to work as well. Looking forward it looks like Windows Server 2012 will have a similar function baked into the OS. On the linux side zerofree is an option.
Secondly at the SAN level. Depending on your SAN you may or may not have the T10 unmap command available. If it is and you still see acceptable performance with the feature enabled within vSphere great, if performance is poor or the feature is not available you can storage vmotion your guests around. While this does work it is very time consuming.
Personally I have done this by hand once on around 100 VM's each time and the space reclaimed in general was used back up within 3 to 4 months. So a fully manual process is not viable. SDelete while does work for the once in a blue moon scenario can cause issues with applications when it completely fills a given volume with a single file containing all zeros. I wish VMware as part of the tools package would include some automatic functionality *hint hint, wink wink*. The SAN feature if you don't already have it may be coming down the pipeline as a software update hopefully, if not the next time you look at a SAN upgrade make sure it's a feature that is there.
The T10 unmap support will hopefully eventually be safe to use on the relevant storage systems.
But as you point out, manually reclaiming space inside the VM really isn't viable currently, but it needs to start happening in some fashion or other.
I can't help but wonder if not this may be one of the many headaches which will be solved by vmvols, sometime in a future release. I can certainly see how it poses a challenge to track deleted blocks when you have two layers of abstraction between the guest FS and the storage system..
Maybe one of the brilliant storage engineers watching these forums are able to offer their insights without disclosing too much?
Sorry for the necro.
I was just thinking about VMs sending the trim command to the virtual disk driver, as it could be used to automatically shrink virtual disk images.
It turns out there is no performance benefit on SSDs (because wear leveling will cause new rights/rewrites to have identical performance, so long as the host disk itself isn't full), while on HDDs you could end up with excess fragmentation.
So, the only benefit is that virtual disk images could automatically shrink, but the benefit to most production machines is minimal - disk usage generally increases with time. It would be a convenience feature for a developer or hobbyist, and that's about it.
currently the SCSI controller layer in the VM does not honour UNMAP/TRIM in any useful fashion. And this is a shame since it provides the missing piece of the puzzle for an end-to-end optimal thin provisioning.
Is it true for the end of 2016?