vSAN EPD Status Abnormal

KildeDK · ‎01-22-2020

Heya all,

I just enabled vSAN, and its currently failing hard.

EPD won't start on my ESXI-01 server (see picture below)

If I try to start the service manually via SSH, i get below error:

Any tips on how to fix this? There is extremely little to be found when using google

Thanks in advance

TheBobkin · ‎01-22-2020

Hello KildeDK,

This is the relevant kb article for this health check:

VMware Knowledge Base

But it appears that you already found this and it looks like restarting the service is not working for the same reason that it did not start automatically in the first places (e.g. /scratch partition which is required for this to start is not available).

I would advise investigating whether /scratch is full (or out of inodes) and/or there is some other reason it cannot be written to.

Bob

Edit: I will look into doing something from my side regarding documentation of further troubleshooting/investigation should restarting the service not resolve the issue.

KildeDK · ‎01-22-2020

Hi Bob, thank you for reply

I'm still very new to vSphere in general, how would I check if the /scratch folder is full?

TheBobkin · ‎01-22-2020

Hello KildeDK,

Welcome to vSphere and vSAN so.

Start by SSHing to the host and check does it even exist e.g.:

# cd /scratch

This should then show where scratch is pointed by the directory path changing e.g. if it is pointed to /tmp then it will look like:

[root@esx-01:/tmp/scratch]

If it is stored on a persistent or vfat partition then check if any of these are out of space (e.g. Use% is 100%):

# df -h

If it is stored in a Ramdisk (e.g. like /tmp) then check the available space on these:

# vdf -h

So basically it is case of first validating /scratch exists (corner-case being those without this like diskless Auto-Deploy hosts which are not supported for use with vSAN) and then validating that it has free space and inodes - if it doesn't then figure out what is consuming the space (e.g. something logging to the same parent directory) and free up space and redirect whatever was filling it to somewhere else more appropriate where it won't cause problems.

Most of the above and further related troubleshooting steps are covered here:

VMware Knowledge Base

Bob

Cioby · ‎01-09-2023

One thing that I noticed is that even the scratch partition was not full ( location was in tmp\...) there was not enough space in there because the esxi was installed on an Sd card. I needed to provide a partition big enough and change the scratch location. After restart everything worked fine. I hope this will help you.

ManivelR · ‎01-10-2023

Hi All,

Yes.I also faced the same kind of issue.

I sorted it out in the same way for VSAN.

with command line

vim-cmd hostsvc/advopt/update ScratchConfig.ConfiguredScratchLocation string /vmfs/volumes/f9d66946-bdd7efac/scratchlogs/ESXi-01

f9d66946-bdd7efac-->This is my external NFS server id,created a folder called "Scratchlog/ESXi-01"

After configuring/reboot of ESXi,that error message was gone in the skyline health(VSAN-Monitor)

Thank you,

Manivel RR

Cioby · ‎02-25-2023

Hello,

Can you please provide a litter bit of deep dive about the consequences of EPD service not running properly ( short and long term) ? From my understanding it is like a garbage collector for objects in vSAN. This means that we might encounter problems like object leaking or not being disposed properly?

Thank you!

TheBobkin · ‎02-25-2023

@Cioby, if this service is not running then it is imperative that the root cause of why be determined.

Short term (e.g. hours-few days) of it not running is generally not going to cause issues but this also assumes the environment doesn't have a huge amount of data-churn (e.g. 100-1000s of objects/VMs being created and deleted on a daily basis.)

Long term (weeks-months) can result in serious issues as CMMDS is not designed to manage millions of references and if enough DISCARDED_COMPONENT entries are amassed (e.g. high 100k-millions) then this can cause issues such as constant cluster membership flapping and all data inaccessible until resolved.

What build version are the ESXi hosts you are asking about on? Asking as there are more recent issues that can result in EPD issues which should be assessed and resolved with assistance of VMware GS if occurring (https://kb.vmware.com/s/article/88815)

Note as well that the Skyline Health check for EPD and the other daemons can throw false positives if there is an issue with vsanmgmtd and/or a general ESXi issue on the node (e.g. boot device is hosed).