Re: Not enough space to power on VM

Bruticusmaximus · ‎08-18-2019

I have a 3 node VSAN cluster. Total capacity is 22TB. I have 4.9TB free. I'm using an FTT of 1. I'm not able to power on a 500GB thin provisioned VM because of a lack of disk space. The VM is using 405GB on the disk. Vmware tells me that there should be enough free space to power on the VM however, it's v5.5 so, they won't troubleshoot this.

- there are no bad disks

- there are no snapshots on any VMs

- there are no orphaned VMs

- I delete old log files

Any ideas as to what can be going on?

Any ways for me to free up more space?

If I let the VM sit for a few hours, it will power on even though the free space is still 4.9TB

Thank you in advance

TheBobkin · ‎08-18-2019

Hello Bruticusmaximus,

It's not just a matter of how much space you have free on vsanDatastore but where you have the free space and that there is sufficient free space on enough Fault Domains (nodes here) to create the .vswp. In this case the required space would be 2x the size of VMs allocated memory (minus reservation) + 16MB on 3rd node for witness metadata.

However, you should clarify the exact reason for failure to power-on the VM, there can be a number of causes including:

1. Failure to create the .vswp (as per above).

2. Failure to write to any of the VMs vmdk due to the capacity-tier disks that these reside on being out of space.

3. Unsupported configurations with -flat.vmdk or other 'file' data residing in the VMs namespace.

4. Cluster is partitioned at the time of attempted power on (but this would likely be obvious from other symptoms e.g. VMs becoming inaccessible).

5. Component limit has been reached (the max for this was far lower in 5.5 compared to modern versions of vSAN).

Cause can be clarified from the error message in vSphere, the vmware.log of the VM, vmkernel.log and clomd.log from the host the VM is registered on.

Please log into the vCenter managing this cluster and share the output from vsan.disks_stats <pathToCluster> and vsan.check_limits <pathToCluster> when the VM is failing to power on.

https://www.virten.net/2015/01/manage-virtual-san-with-rvc-complete-guide/

Bob

Bruticusmaximus · ‎08-19-2019

Vmware did the calculations and said "Yup, there's enough space to power on the VM". That being said, they first tried to tell me that the space issue was within the OS of the VM until I said "How would ESX know if the OS has a full disk? The VM isn't even running Vmware Tools". While they did get back to me very quickly and the tech was very nice, I'm not sure their heart was in it once they saw it was 5.5.

1. Failure to create the .vswp (as per above).

The VM is 4GB with no reservation.

2. Failure to write to any of the VMs vmdk due to the capacity-tier disks that these reside on being out of space.

Vmware checked this out and calculated that there should be enough space

3. Unsupported configurations with -flat.vmdk or other 'file' data residing in the VMs namespace.

The error message specifically calls out the one thin provisioned disk.

4. Cluster is partitioned at the time of attempted power on (but this would likely be obvious from other symptoms e.g. VMs becoming inaccessible).

We haven't had any other issues that I'm aware of

5. Component limit has been reached (the max for this was far lower in 5.5 compared to modern versions of vSAN).

I'm not sure what this means but there's only 8 VMs running in this cluster. All very low workload VMs. File server, DNS, DHCP, Print server.

Thanks for your input. I'm going to run those commands and post it here.

TheBobkin · ‎08-19-2019

You have disks there that are completely full - if data-components of a thin-provisioned vmdk reside on these then this won't be able to power-on (and you would see other signs for a powered-on VM e.g. message stating can't write to disk and/or becoming read-only). Why this would work at one time and not another could be that reactive-rebalance was able to move something else off the disk and thus freeing up space or other transient data was removed (e.g. VM power-cycled and thus its .vswp gets removed or snapshots consolidated)

The disks that show 0% used are cache-tier devices, this is entirely expected as we don't store data-components on these (also, I think you mean 'rebalance' not 'resync', plus I am fairly sure Proactive Rebalance doesn't exist in 5.5).

The other thing to note (from the Reserved %) is the vast majority of the data in this cluster looks to be Thick-Provisioned either from using disk-primitives (e.g. Thick Eager Zeroed) or from using a Storage Policy with 'Object Space Reservation=100' in its rules.

You should considering thinning some of the Objects that have data residing on the worst impacted disks (the ones that say 99.92% used). Figuring out which Data Objects these are isn't exactly straight-forward (in 6.2 and later it is 4 clicks in the UI), so the likely easiest way to determine this yourself is to just print all the VM info out via RVC and copy it into a notepad and ctrl+F for the capacity-tier disks in question (make sure your session window is extended to a couple thousand lines):

# vsan.vm_object_info <pathToVMs>*

(e.g. from the cluster level it would be # vsan.vm_object_info ./resourcePools/vms/* )

As an aside: I would advise obfuscating your info there (or editing your comment to remove output) as this is a public website and there is lots of potentially sensitive information there (host names, company name, cluster location etc.).

Bob

Bruticusmaximus · ‎08-19-2019

Good point about the info I posted. Thanks.

I may manually kick off a rebalance.

Usually, to go from thick provisioned to thin provisioned, I'd do a storage vmotion. With only 1 datastore, how can I easily get a thick provisioned VM to thin provisioned?

TheBobkin · ‎08-19-2019

"I may manually kick off a rebalance."

I don't think you can - as I said above, fairly sure this was only added in 6.0.

"Usually, to go from thick provisioned to thin provisioned, I'd do a storage vmotion. With only 1 datastore, how can I easily get a thick provisioned VM to thin provisioned?"

This depends whether it is Thick from disk primitive setting or Storage Policy or unintentionally Thick from using the C# Client for migration which is not SPBM aware. If it is the former then you could use vmkfstools -i to clone vmdks with Thin flag (and -W vsan) but unfortunately this requires the VM to be powered off, if it is the latter then clone the Storage Policy with Object Space Reservation=100, change the rule to OSR=0 and apply it to some data - this shouldn't cause a resync but err on the safe side and apply it to a few objects at a time and validate it is not causing a resync as this will temporarily use more space. If it is the last possibility in that the VM is not Thick at primitive level, has a OSR=0 Storage Policy (and is compliant) but the vmdk shows as proportionalCapacity=100 in RVC (check via vsan.vm_object_info or vsan.object_info) then let me know and I can provide a solution.

Bob