Solved: vsan physical disk capacity full

andvm · ‎05-29-2019

Hi,

One of the physical disks forming the vSAN is full and shows with an error within the vSAN capacity health check.

This is the likely cause of a VM which is complaining that its vmdk has no free space (yellow bar with click retry after adding more space on the datastore) - VMware Knowledge Base

There is a vSAN re-sync going on which is almost complete but the specific physical disk usage has not changed (still fully used). Should this fail to free up disk space from the specific physical disk, is there any manual intervention that can be done?

As additional info - There is around 12% free disk space on the vSAN datastore, one host forming the same vSAN had been placed in maintenance and its data had been fully migrated.

Thanks

TheBobkin · ‎05-30-2019

Hello andvm,

"migrated a few others to another storage"

Apologies for not mentioning this least intrusive (but not fastest) option - a lot of the time in support this isn't an option due to the amount of our customers (the ones that have space issues anyway!) that have vSAN-only storage and/or only single vSAN per site.

"Since % free disk is not enough, what should be checked before placing a host in MM with FDM to ensure no impact on the vSAN environment?"

Check the Health check for space usage 'After 1 additional host failure' it should indicate the free space after one host is MM with FDM - if it is 90%+ then I would advise freeing up some space before the MM, fairly sure later builds show better metrics for usage calculated with MM modes when selecting the MM option. Again it comes down to where you have the free space not just how much - this gets more complicated if you have any relatively large Objects and these are already using large chunks of seperate Fault Domains (and thus limiting where the evacuated data replica can be moved to). In 6.7 we have much much improved logic for coping with these factors especially striping LSOM-Objects adaptively to fit in whatever spaces are available.

"Am I right that vSAN will do automatic re-sync when any physical disk is over 80% of usage (vsan 6.2)"

Yes with the default settings (ClomRebalanceThreshold - can be get/set via esxcfg-advcfg or Adv host settings in GUI) - one thing to note is that generally if a disk becomes 100% fully during this sort of operation it is not just due to the initial placement of the MM components but also growth of components currently on those disks (but also 6.2 logic was not as smart as current).

"can I remove and force this to be re-created with ver3 until the upgrade to ver 5 is properly planned?"

Yes, legacy virsto will work for this - just don't forget to change it back to the default for the build version later.

Edit: 6.6 and 6.7 have better logic for smart resync but from my experience it works better in 6.7, good info here:

Intelligent Rebuilds using Enhanced Rebalancing | Intelligent Rebuilds in vSAN 6.6 | VMware

Bob

View solution in original post

TheBobkin · ‎05-29-2019

Hello andvm,

This can be a very problematic issue and I would advise opening a Support Request with VMware support immediately if you have not done so already.

"This is the likely cause of a VM which is complaining that its vmdk has no free space (yellow bar with click retry after adding more space on the datastore)"

The VMs can't write to disk if even one of their data-components resides on the full capacity-disk - stop clicking this until you free up adequate space and have the situation under control.

"There is a vSAN re-sync going on which is almost complete but the specific physical disk usage has not changed (still fully used)."

Specifically what is the reason for resync e.g. is it attempting reactive-rebalance to move data off the full disk or is it still trying to move data from putting node in maintenance mode with FDM? (e.g. the host won't have entered MM yet).

In later version it should indicate the resync intent in the Health checks.

"Should this fail to free up disk space from the specific physical disk, is there any manual intervention that can be done?"

Yes, delete any test or unneeded VMs you have in inventory, identify anything in inventory that wasn't removed (e.g. unregistered VMs that are no longer needed), consolidate snapshots (start with relatively small vmdk snapshots/disks and don't attempt to do more than 3-4 at once or you may just slow it down), if there is anything intentional or otherwise with Thick/OSR=100/proportionalCapacity=100 then consider thinning these but don't do this unless you know what you are doing or you could incur more resync (as a result of deep-reconfig of the Object(s)), changing some un-important data to FTT=0 could be a last option but again not something to be done unless you understand SPBM (e.g. if you try to change an FTT=1,FTM=RAID5 Object to FTT=0 it will temporarily create a new FTT=0,FTM=RAID1 Object and only remove the RAID5 Object once complete).

"There is around 12% free disk space on the vSAN datastore"

It's not about how much you have, it's where you have it - if you have 0% free on a disk then it can't update the components on that disk and everyone else will end up waiting on it.

"one host forming the same vSAN had been placed in maintenance and its data had been fully migrated."

How many nodes with how many Disk-Groups in the cluster, what FTM(Fault Tolerance Method) and FTT is in use?

Why did you put a host in MM with FDM when you had an inadequate free space?(considering you should always have adequate overhead)

If it is not in MM due to some failure/critical maintenance on it then you consider taking it out of MM (and really should have done MM with Ensure Accessibility otherwise).

Bob

andvm · ‎05-30-2019

Thank you for the detailed reply, my feedback below:

vSAN resync was attempting reactive-rebalance (Host was already in MM)

Host placed in MM with FDM due to preparation for it to be upgraded firmware/ESXi

Deleted unneeded VM's, migrated a few others to another storage and vSAN resync eventually migrated data from the full disk

6 nodes, 1 disk group and FTT 1 (default vSAN storage policy)

A few questions:

Since % free disk is not enough, what should be checked before placing a host in MM with FDM to ensure no impact on the vSAN environment?

Am I right that vSAN will do automatic re-sync when any physical disk is over 80% of usage (vsan 6.2)

Following a host upgrade (and change of Storage controller from RAID0 to HBA mode) the newly re-created disk group on the host has vsan disk format ver 5 while the rest are on ver 3, can I remove and force this to be re-created with ver3 until the upgrade to ver 5 is properly planned?

Regarding my last point, it looks like this is possible by changing an advanced setting on the host before re-creating the disk group, need to force it to ver 3 to be the same as on the other hosts https://kb.vmware.com/s/article/2146221

TheBobkin · ‎05-30-2019