I have a small fleet of ESXi 6.0 hypervisors running a mix of Linux and Windows VMs. The ESXi hosts are Dell C6100s and mostly configured with an SSD boot / ESXi disk, and two spinning disk datastores. There is no RAID, instead we maintain very regular backups of the VMs on an external system. The spinning disks are mostly Western Digital datacentre (gold) drives and usually have very good performance.
Recently I found that a former VM, which was shut down but hadn't yet been deleted, was occupying 4TB on a 5.5TB datastore, looks as if at some point when the VM was in use we took a snapshot and then didn't remove it, leading to a delta file growing huge over time. As we're moving VMs onto that HV I wanted to delete the out of use VM to free space. It was already shut down (and had been for months) so I did the following:
1. In vSphere made sure I'd properly identified the data storage location for the VM's disks.
2. SSH'd to the HV and changed into the correct directory, double checking against vSphere.
3. In vSphere, removed the VM from the inventory.
4. In the SSH session, issued time rm * to delete the files.
I expected the delete to take "a while" as on a previous occasion removing a redundant VM via vSphere's "Remove From Disk" has been seen to take some time, hence the time command to let us know how long it takes, but it's now been running for almost 24 hours, and has locked all access to that disk to the point the other VMs on it are now unreachable - at first they stopped pinging, though they now ping but don't respond on any of their operational ports. Also attempting to right-click on any VMs on the affected disk in vSphere just produces a timeout error.
The HV as a whole is still up and VMs on the other datastore are operating happily, and the load average shown by uptime is reasonably low, most recently:
18:00:38 up 153 days, 11:03:24, load average: 0.57, 0.81, 0.55
I know generally with ESXi the best thing to do is just let it run to completion, so haven't attempted to cancel the command.
I'm left wondering:
1. Why does an rm command in ESXi take so long in general - what is it actually doing under the hood?
2. Is there any faster way to remove redundant files from an ESXi datastore?
Deleting a thin provisioned 4TB vmdk on a 5.5TB datastore can easily be a longwinded job consisisting of billions of single operations - keep in mind that every fragments of the 4tb vmdk needs to unreferened and cleaned.
If you want to cheat - reduce the size before deleting it : this makes a big difference !!!
Or even better - reformat the datastore
Thanks for the response and understood about the complexity of the operations. We always create our VMs as thick provisioned, however I'm guessing an un-noticed delta file is effectively thin provisioned, so I can imagine it is quite conveluted.
The rm has now been running for three weeks solid, and other VMs on that datastore are still completely inaccessible. Oddly, if I'm reading the time given by ps properly, the actual command has only been running on the CPU for 1.24 seconds:
9946205 9946205 rm 1.242968
(via ps -cT, excess data snipped)
I realise that disk operations don't involve the processor much but it still seems a very low figure. Is there any way to see how much progress it's made? While I'd hate to interrupt it if it's five minutes from completion, I can't really let it run for months.
It also seems odd that it's entirely locked the disk for this long. Is there any reliable way, other than a physical reboot of the HV, to stop the process? And if we do reboot, what is the likely state of the datastore to be - will the 4TB file be gone, or still present? If possible I think I'd prefer to stop the process, move the other VMs off that disk, and reformat it, or possibly replace it.