1 person found this helpful
While it is not ideal to shut down vSAN nodes while there is an active resync, doing this moderately cleanly certainly beats them losing power and going down hard.
Thus in this case the best option is to power down all VMs and put all nodes in Maintenance Mode with 'No Action' and power them down gracefully - try and ensure that all hosts are put in MM within a reasonable time-frame of each other and ensure ALL hosts are booted before taking any of them out of MM (together) after the outage.
Resyncing will resume following this.
To enter with MM with 'No Action' option and exit via CLI:
# esxcli system maintenanceMode set -e true -m noAction
# esxcli system maintenanceMode set -e false
Thank you! I wasn't able to find this info anywhere, yes I am doing the MM with No Action. I have 4 nodes so currently logging into each one and shutting down, so they are going down fairly close to each other anyway... though I was thinking of forking off 3 processes to do the first 3 at the same time, my ups-monitoring system runs on the forth, so when that goes down the VM just dies with it, but it needs to go down last of course. Thanks again for your help, much appreciated.
I actually had this happen to me a month or so ago with resync still occuring when we were shutting down our cluster for a power cut. I also ended up issuing the same commands listed to force maintenance mode before UPSes ran out of power.
On power up after power restoration, i ran into this issue Diskgroups fail to mount due to heap exhaustion (2150566) for one our nodes and we couldn't get the diskgroup to read at all. It would get stuck at VSAN Initializing for 14 hrs before getting a heap error. We never got that diskgroup to read again even after applying new heapsize + another 14 hr reboot, it gave us a new error about diskgroup being invalid and not being able to initialize it. We ended up just restoring the data lost from that diskgroup from backups as it was going to be faster than almost any other resolution through support..
Short story is if you shut down the cluster during a resync, it will be in a degraded state. I believe If i had not encountered the heap exhaustion issue on startup I would've been fine, but with it it caused some data loss since I was already in a partially unprotected state.
This isn't the first time I've seen the heap exhaustion issue too, I really think VMWare should set the defaults much higher or put in logic to detect if large disk groups are being used and to set the heap size higher and alert that a restart is necessary for the host.
"I wasn't able to find this info anywhere"
Sorry to hear that!
A major point of me deciding to help on vSAN Communities is the unfortunate fact that some vSAN troubleshooting info can be very hard to find publicly online, 99% of the time I research questions on here using external resources only so these resources are out there, just need to know the context and/or syntax :
Another thing I would like to point out is to check via DCUI/out-of-band-management that the hosts DO actually go down fully once power-off has been initialised - I have often seen that vSAN layers can sometimes take longer than regular ESXi to go down depending on the state of the host/cluster so using as much of the UPS-battery time to allow for this may be more beneficial than other things such as keeping vCenter/other VMs up until last minutes (get everything into MM and down as close together as possible is key).
"since I was already in a partially unprotected state"
What happened with the cluster preceding this and what build was the cluster running on?
"I really think VMWare should set the defaults much higher or put in logic to detect if large disk groups are being used and to set the heap size higher"
Unfortunately the heaps and disk-group heaps involved here are not so linear - for a start increasing this base LSOM heap will consume an additional 1.75GB memory per DG, sure this is relatively little for most typical set-ups, but there are set-ups where this footprint may add a significant proportional memory reservation/consumption which isn't ideal.
It is also my understanding that it is not just about the size of the capacity-disks (in older builds this was more so an issue) but the state of the cache-tier fill and the amount of data of the data-components placed for resync/rebuild on that DG (and whether the capacity-tier drives have any other issues such as reaching 100% full).
A lot of changes are iteratively being made in the background to how vSAN handles much of the factors involved in these corner situations - most of these changes don't appear in the external change logs on updates unless associated with more common issue of which disk mount hasn't really been one (and in my experience these only seem to occur as a result of other issues).
This was on a VSAN 6.2 (ESXi 6.0U3) cluster, 6 nodes, all objects FTT=1. It happened to hit 82% capacity on the day of a planned power outage and the cluster was rebalancing space when we had to force it into maintenance mode and power off.
On power on, one node took almost half a day to boot (stuck on VSAN initializing), the cluster was put back online and there were already some objects inaccessible (figured not to worry, let resync finish and things should be fine). However a few hours in, we had a node drop it's disk group (with the memory heap error message), at this point we saw even more things go inaccessible. Resync at this point went nuts and we had 60TB+ of resync traffic (which would take days). We kind of gave up waiting at this point (we would need to wait days for resync to finish, reboot the problem node to see if the disk group would come back) when we could spend a few hours to just restore the lost data.
So we opted to down the problem node and do data restorations. After data restorations, we tried to bring the downed node + disk group back online just to see if VSAN would be able read the old data even though we didn't need it. However nothing we tried would get the disk group back, at this point we didn't need the DG, so we didn't bother with VMWare support and just deleted/recreated it.
Which brings up something unrelated, can the VSAN disk initializing on boot show a better status so if it's going to be 12+ hours I have some idea of that? Perhaps even displaying status every 10 minutes or so would be okay, right now even in 6.6 through alt-f12 to view vmkernel.log status it will literally show no change for hours even though it is processing disk log entries. I understand it's not supposed to take 12 hrs+ (i see most nodes take no longer to VSAN initialize than 15 minutes) but each time I've engaged with support about this topic, have just been told it happens with larger disk groups and cache sizes. I'm using 2TB cache disks on hybrid with 10% flash/capacity ratio. While rare I've seen this happen enough times during upgrades and also after some cluster shutdowns/startups and I have no idea what causes such a long difference in boot times and the not knowing the initalization status becomes problematic as I don't expect that cluster startups would take this long (did it crash?, did it get stuck?, is it still doing work?)
Just saw new release notes for ESXi 6.5 Patch2:
During the reboot of a vSAN enabled ESXi host, the host screen displays a message VSAN: Initializing SSD: <...> Please wait... and it does not change although processes run in the background so it might seem like initialization hangs. With this fix, periodical status messages on the vmkernel.log will be available to monitor background work.
Yay! no more wondering and waiting if it's stuck or really doing something.
ESXi/vSAN is fairly vocal when there is an issue so generally one can safely assume any job in progress *is* working in the absence of any indication of an issue - but sure this is a nice addition as I have seen a few cases of impatience getting the better of people and rebooting back to square one of a longer than expected DG-init due to disk issues.
Link to patch notes for anyone interested as there are a couple of other minor vSAN fixes: