Re: vSAN 6.5.0d out of space scenario

Finikiez · ‎02-19-2018

Hi!

Last day I got a bad situation with vSAN which I want to share.

This is not a production environment. However.

We have:

- 9 ESXi hosts 6.5.0d build 5969303 in single cluster

- 1 VCSA 6.5.0 build 6816762

- one vSAN datastore on top of 9 ESXi hosts

No external storage arrays.

We got a situation when vSAN datastore run out of disk space. At the same time we experienced:

1. As vcenter server runs inside cluster vpxd stopped and we lost web client.

2. SSH on VC was disabled as well.

3. It wasn't possible to login through VAMI page with Authentication failure error. (But login and password were right 100%).

4. We couldn't connect to Host Client on all hosts, login failed with time out.

5. Login though DCUI was very-very slow (took about 10 minutes to login). And it wasn't possible to enable SSH on hosts. Task hanged forever.

Luluckily we had SSH enabled on 3 hosts. From the logs I saw that hostd hanged. But restarting management agents on this hosts didn't help a lot. Still we couldn't login with host client.

As well 'ls' inside /vmfs/volumes/vSandatastore hanged

What helped us:

1. As we had SSH enabled I could enabled thin swap

esxcfg-advcfg -s 1 /VSAN/SwapThickProvisionDisabled

2. Then I powered off and powered on VMs on this 3 hosts. This freed some space so vSAN could finish resync.

Since I did this steps I managed to login to hosts with Host Client and restart vcenter server and fnish other tasks.

Conclusion:

1. As for me it looks really strang that full vSAN datastore causes troubles with access to hosts. Even it's full I expect to see hostd properly working.

2. And I have doubts that if SSH was disabled on all hosts we could recover from this issue.

sarikrizvi · ‎02-19-2018

Yes It's because of vSAN full space issue.

1. Check disk state in RVC.

vsan.disks_stats ~cluster~

2. Check vpxd logs and try to find error why it was stopped .

/var/log/vmware/vpx/vpxd.log

and share screen-shot if possible.

Finikiez · ‎02-19-2018

1. Check disk state in RVC.
vsan.disks_stats ~cluster~

2. Check vpxd logs and try to find error why it was stopped .
/var/log/vmware/vpx/vpxd.log

I know why VPXD stopped, it's obviously.

The main problem that I couldn't manage VC in that situation, because hostd on hosts was unresponsive what caused broken host client on hosts. Also I had SSH on VC disabled so I couldn't do anything with vcenter server at all.

I totaly confused that full vSAN datastore causes such problems with management of all infrastructure.

sarikrizvi · ‎02-19-2018

It's vSAN bug, whenever any disk-group filled and reached 99% , That will affected all other Disks in all Disk Groups got full 99% due to which vSAN became full and create bad situation like yours.

To fix this you have to add more Host with enough capacity or powered off unwanted VMs to get complete resync .

All

vSAN 6.5.0d out of space scenario