We had a host in a 3-node vSAN cluster (VC is 7.0.0d and all hosts are at 7.0b) that quit responding (while the VMs continued to run). After some troubleshooting, we got it back up, but I moved off the VMs and rebooted it just for good measure. After a reboot, the host came back up, but now vSAN is having issues...
Specifically, vSAN Health is reporting that "EPD Status" under vSAN Daemon Liveness is "Abnormal". Everything else is reporting as "Healthy".
When SSHed into the host, I get the following:
Running /etc/init.d/epd status gives a response of:
epd is not running
CLOMD is running, however.
If I try starting epd using /etc/init.d/epd start, I get:
INIT: EPD uses a ramdisk for the db file
INIT: No persistent storage found to backup the DB into.
Thinking one of my disks might be full, I tried checking the disk space. Very oddly, running df -h (or, really, any variant of the df command) gives:
VmFileSystem: Slow refresh failed: Cannot open volume: /vmfs/volumes/5efa1a50-890ef4b7-dce3-001b21baacac
Error when running esxcli, return status was: 1
Error getting data for filesystem on '/vmfs/volumes/5efa1a50-890ef4b7-dce3-001b21baacac': Cannot open volume: /vmfs/volumes/5efa1a50-890ef4b7-dce3-001b21baacac, skipping.
I never actually get to see the free space of the physical disks.
Running vdf -h gives a very long output (attached), but I notice no scratch partition.
Any thoughts? Is epd as a separate service a new thing with vSphere 7? If so, that's likely why I cannot find a ton of troubleshooting information via Google.
EPD has been a vSAN service for quite a while but was only added to the vSAN-specific daemon health checks relatively recently.
EPD requires /scratch to be available to start - can you check the following location is available?
# cd /scratch/
If not then you should be aiming to identify where this is configured and/or why it is not available.
Not getting a return from df -h is likely more concerning - do you get a return from this when run from other hosts in the cluster?
Is this a lab or a production environment?
Do you see anything else significant alerted in the vSAN Health UI?
What is this the path to? (This could be some other device timing out df command)
Thanks for updating us with the outcome.
Probably for the best to reinstall (though only after validating the install media was functional and not the cause of issues) as from what you posted above there were clearly issues beyond just /scratch being available.
But yes, same in vSAN 7.0 as it was in previous versions - EPD requires /scratch to be available to write DB, if not then it won't start.
Such issues can be confirmed from /var/log/epd.log .
Well, this is concerning... A little over a week later, we are starting to have EPD issues with one of the other nodes in that vSAN cluster. This kinda points me away from thinking it was a hardware issue and instead that it might be a vSphere 7 bug...
Please if you can open a Support Request with vSAN GSS for deeper analysis - if I had a euro for every time I initially thought something might be a bug that ended up being a configuration or hardware issue I would be extremely wealthy by now.
If all of these servers are booting from SD-cards that were purchased around the same time (and more specifically if they have something that claims it shouldn't/doesn't write to this drive (and/or write much) but does) then they could be burning out in a similar timeframe. Anyway, I wouldn't really start speculating without some form of evidence (one way or the other).