Re: vSphere 6.5 In-Guest UNMAP not working

a1exp · ‎08-17-2018

Issue

Recovering disk space from thin-provisioned VMDKs is very hit and miss, some virtual machines shrink their VMDK files without issue, other's just never do.

This occurs even if the virtual machines are running on the same host and storage.

In some instances one disk attached to a VM will shrink by a certain amount (Not completely) but another attached disk will refuse to shrink at all.

I can see the ZERO counter increasing in ESXTOP when an optimize-volume is run but the VMDK never shrinks

I've also seen the same issue mentioned in this post where a snapshot delete bloats the VMDKs and they won't shrink: Re: vSphere 6.5 VMFS 6 not reclaiming space after snapshot removal

Environment

I'm using thin-provisioned VMDK files on thick LUNS (VMFS 6.81)

The underlying storage is EMC VNX 5400 and as I'm using thick LUNS I know automatic UNMAP for the array won't work (The DELETE primitive is not supported)

I've got two clusters and this behaviour happens on both, one is patched up to 6.5.0, 7967591, the other is patched to 6.5.0, 9298722.

These are mostly Windows 2012 R2 virtual machines, hardware version 13 for all using VMware PV SCSI adapter.

There's a slight difference in VMWare tools builds on some machines (10272 vs 10287 vs 10305) but there doesn't seem to be any correlation

What I've already tried

I've tried a variety of ways to shrink the disks in Windows, using both optimize-volume, the Optimize drive gui and defrag /L.

I've tried with EnableBlockDelete set on the host to both 0 and 1, makes no difference.

I've tried on virtual machines with CBT enabled and disabled, makes no difference.

Running vsish -e get /storage/scsifw/devices/<DATASTORE>/stats shows me that "total unaligned ats, clone, zero, delete ops" is increasing but I'm not sure why

Is anybody seeing the same issues?

I know that there was supposed to be some alignment issues on earlier builds of 6.5 (That needed things like a 64k NTFS block size) but I thought that was all resolved in later builds.

I've been through every article that Cody Hosterman has written and it's given some clues but nothing that resolves the problem consistently..

SupreetK · ‎08-17-2018

If you want to shrink the size of the vmdk, the only supported (and working) way is to use VMware Converter. However if you want the vmdk used space shown by the ESXi host (or the vCenter) to be the same as of the guest OS, you need to run SDelete followed by 'vmkfstools -K' to punch zeroes on the affected disks. As per the blog https://blog.purestorage.com/in-guest-unmap-enableblockdelete-and-vmfs-6/, EnableBlockDelete does not work well with VMFS-6. It is the auto-unmap that takes care. And for your environment, unmap is not an option as you said.

VMware Knowledge Base

Please consider marking this answer as "correct" or "helpful" if you think your questions have been answered.

Cheers,

Supreet

a1exp · ‎08-20-2018

Hi Supreeet,

Thanks for taking the time to respond, is that VMware KB still applicable?

In the Pure storage article you linked to it clearly says that the VMDK shrinks:

"Now if we look at my VMDK, we will see it has shrunk to 400 MB:"

This is also something I can replicate, if I run a trim operation in my guest OS then on some virtual machines it shrinks the VMDK down after a minute or two.

The issue is that on some virtual machines this doesn't happen at all and nothing I do can get these to shrink their VMDKs.

I'm aware that EnableBlockDelete is deprecated and that I won't get array-level unmap as I'm using thick-provisioned LUNS.

Given that one VM on the same host will shrink it's VMDK after an in-guest TRIM and one won't it's almost like a guest-specific issue.

I thought I'd narrowed down the failures to guests that had a snapshot taken at one point, but testing with a new VM and taking and removing a snapshot let me shrink the VMDK without issue.

I'm aware that there's a fix in 6.7 that allows VMDKs to shrink if a VM has a snapshot but the VMs that won't shrink for me have no current snapshots.

SupreetK · ‎08-20-2018

1) Are there two VMs with the same hardware version and running the same operating system, one shrinks and one doesn't while running on the same host? Are these VMs running on the same version of VMware Tools? I believe you are testing these things on 6.5 U2 host.

2) If yes, are there any errors reported in the guest logs when optimize drives (or fstrim) is run within the guest OS?

Cheers,

Supreet

a1exp · ‎08-20-2018

1.) Yes, I've just tested this again and I have two virtual machines running the same OS on the same VM host and with the same version of VMware tools. These are both running on a host running VMware ESXi, 6.5.0, 9298722. One will shrink it's drives and one won't.

I've done some additional testing and I've also got a VM that will shrink one drive and not the other.

Doing some further testing on this host I have an attached drive that shows as ~700GB in Windows but ~840GB on disk in ESXi.

If I copy 10GB of files to this drive the size in both Windows and VMware goes up by 10GB, once I delete the files they both go back down by 10GB.

The VMware disk never recovers the rest of the space though and still shows at ~840GB on disk when it should be ~700GB.

Anecdotally, "newer" VMs (One's have been created most recently) seem to be better at recovering the space.

It's almost like there's something blocking the recovery of all the space on drives that have been in use a while.

Following this line of enquiry, I disabled VSS on the drive in question (~95GB), this dropped the drive in Windows down to ~614GB and to ~650GB in ESXi so this adds weight to the theory that there's something that blocks the recovery of additional spare space.

I tried disabling VSS on another VM with a drive that wouldn't shrink and that allowed the VMDK to shrink, however trying to repeat this on yet another VM made no difference.

It looks like VSS contributes to the issue but isn't 100% the cause. (non of the drives that are affected have page files on them)

2.) There's just the error about slab consolidation:

The volume DATA (E:) was not optimized because an error was encountered: Neither Slab Consolidation nor Slab Analysis will run if slabs are less than 8 MB. (0x8900002D)

I understand that this is a by-product of using optimize-volumes and that running with defrag /L skips the slab consolidation whilst still trimming the space.

If I look at the results of optimize-volume the amount of space that's been re-trimmed tallies with what is expected and the error appears on both working and non-working VMs.

SupreetK · ‎08-20-2018

My curiosity is growing exponentially with every message of yours

The VM that has one drive working fine and the other one doesn't, are both the disks residing in the same datastore? Is 'Allocation Unit Size' same for both the NTFS partitions?

Cheers,

Supreet

a1exp · ‎08-21-2018

Yes, on the VM that had one drive shrink and one drive not, they're both on the same datastore with the same allocation unit size (4k).

The other affected VM's are split across various datastores and two SANs but they're identical SANs and the datastores are all identical as far as VMFS version/size/underlying storage etc

I've been focusing on a specific VM this morning, it had a snapshot for a few months, which was then deleted but the VMDKs have bloated to full size at some point.

There's no differencing VMDK so I'm fairly confident there's no remnanats of a snapshot left and I've disabled VSS but nothing I can do can get these VMDKS to shrink down.