vSAN Cache getting stuck on ESXi' boot

mune3b · ‎04-17-2019

Greetings,

I am facing this hanging state in hypervisor. The displayed message is

vSAN Cache 52d1b6d6-61cc-aaec-afc5-16321f89c298: Log Recovery: 282588 of -282532 (99%) blocs. 34s so far, -

It stays in this state for quite a long time. Any help would be appreciated.

TheBobkin · ‎04-17-2019

Hello mune3b

Please provide more information with regard to what you consider 'a long time' here - have you left it to try complete for hours, days?

What build of ESXi is installed?

Are you rebooting this node following some issue with disks/controller and/or power-outage?

If you press Alt+F12 it will display the vmkernel logging for further information.

Bob

mune3b · ‎04-17-2019

About 5 hours from now, it's still in this state. 6.7 version is installed.

Are you rebooting this node following some issue with disks/controller and/or power-outage?

It has been rebooted as the vCenter was hanging in there. The VMs associated with vSAN wasn't able to be powered-on.

I'll do Alt+F12 shortly

mune3b · ‎04-17-2019

Can you point it towards the root cause of it? Why is it happening?

TheBobkin · ‎04-17-2019

Hello mune3b,

If this is a Production environment and/or you need assistance with this right away I would advise openeing a Support Request with GSS vSAN team.

"Can you point it towards the root cause of it? Why is it happening?"

There are a multitude of potential causes of why a Disk-Group cannot complete PLOG recovery so I am not going to make assumptions nor conjecture relating to what is the cause without any substantial information/background.

Bob

mune3b · ‎04-18-2019

This is Dev/Testing environment. Is there any a way around for it?

TheBobkin · ‎04-18-2019

Hello mune3b

Yes there are ways *around* it but the impact of this depends on the state of the data without the data from this node/whichever Disk-Group cannot complete recovery (if more than one in use and/or only one stalling).

If the data is all healthy without the data from this node then it would be possible to boot the host with vSAN modules disabled, destroy the Disk-Group(s) (wipe partitions on the vSAN devices), reboot the host normally then recreate the Disk-Group(s). I covered the procedure for this here:

Re: How to erase vSAN disk?

Disclaimer for anyone reading this potentially out of context: perform this at your own risk, your cluster and your data are your responsibility alone.

However if the data is NOT healthy without this nodes data then you should really consider other options such as trying to figure out what the problem is - PLOG recovery can fail for a number of reasons including inadequate LSOM memory (typical in nested hosts or physical with small amount of RAM), SSD corruption which can be common with nested/consumer-grade disks etc. .

Bob

erjaki · ‎10-10-2020

Hi

One of the server RAMs may have a problem, replace the new RAM and try again

TheBobkin · ‎10-10-2020

Hello erjaki,

Welcome to Communities.

Unaware of why you are necroing this thread - OP asked this question a year and a half ago and thus I highly doubt they are still struggling with this issue (they also have not logged on since Oct 2019).

Also, from dealing with such issues for a living I strongly doubt there is any possibility that the issue was due to a bad DIMM module.

Bob

All

vSAN Cache getting stuck on ESXi' boot