"our "VM overreserved" is running around 3.6TB"
You should check whether anything other than vswp Objects are Thick-provisioned or OSR=100(Object Space Reservation)
This can be checked from your Storage Policies that are in use but also via the output from #esxcli vsan debug object list .
"some VMs have > 32GB of RAM."
"- We don't over-commit RAM, so presumably regardless of the above, this is a fairly low-risk change for us"
If you are not over-provisioning physical memory, you could also consider allocating memory reservations - Thick-provisioned vswp Objects are sized as allocated memory minus reservation so the more reserved the less space they consume (+ it doesn't have to be full reservation if you are over-provisioning).
"- With only a single copy of swap - what happens if the storage in that host dies and that object is inaccessible? Presumably the VM must crash as it can't access the pages of RAM that have been swapped out?"
"sparse" or thin-provisioned vswp doesn't reduce the FTT of the Objects, it just doesn't reserve the space required if your VM needed to swap data to vswp due to memory contention, basically it makes them Thin-provisioned so if you had issues with no available space then potentially these could be impacted causing the VM to be stunned/crash (thus why reservations might be beneficial).
"- When disabling this cluster-wide, presumably it will result in a large amount of data being removed from VSAN, which will necessitate a disk rebalance. Are there any other considerations?"
This requires setting this on all hosts (as vswp attributes are dictated from host policies not normal Storage Policies) and power-cycling the VMs to take effect is also required - vswp Objects are transient and are deleted when a VM is powered-off so no/little rebalance should be required, they are also relatively small and thus should be fairly well distributed amongst the clusters disks.
This is super useful, thanks very much.
I tried to run "esxcli vsan debug object list" but I get an unknown command error. Is there something I need to enable first?
Happy to help and you are most welcome.
esxcli vsan debug was only added in ~6.6 (6.5 U1) - if you are on an earlier version this is not available.
You can always use RVC e.g. from cluster level (and no resource pools in use) >vsan.vm_object_info ./resoucePools/vms/*
Or generate the data on a host using:
# python /usr/lib/vmware/vsan/bin/vsan-health-status.pyc > /tmp/healthOut.txt