VMware Cloud Community
psilan
Contributor
Contributor

vSAN lab hosed after a unexpected power outage. EPD not starting, USB Stick (I know) absent

Hi all 🙂

 

Hoping somebody can help with me my lab issue after I had a server hang after a VM had high load for a day. I rebooted both hosts (not my witness host on another NUC) and I am left with a broken vSAN.

 

This lab was built to prepare for VxRail at work about a year ago, and somewhat neglected other than security updates.

 

2-node vSAN running on NUC11i7. 7.03 (e maybe)

NVMe Flash, SATA capacity. 32GB SanDisk ultra 32GB for boot. Scratch pointed to vSan datastore.

After I booted up, all of my VMs were invalid. Including my vCenter on the vSan.

Cluster health had some issues;

 

Health Test Name                                    Status
--------------------------------------------------  ------
Overall health                                      red (Network misconfiguration)
Network                                             red
  Hosts with connectivity issues                    red
Cluster                                             red
  vSAN daemon liveness                              red
Physical disk                                       yellow
  Operation health                                  yellow
Performance service                                 yellow
  Performance service status                        yellow

 

I believe networking is ok. I checked everything and it's all tagged correctly and vmkpings are successful.

Daemon liveness pointed to EPD service;

 

[root@nuc101:/var/log]  /etc/init.d/epd status
epd is not running
[root@nuc101:/var/log] /etc/init.d/epd start
INIT: EPD uses a ramdisk for the db file
INIT: No persistent storage found to backup the DB into.
[root@nuc101:/var/log] df -h

Filesystem    Size   Used Available Use% Mounted on
NFS41        70.4T  53.6T     16.8T  76% /vmfs/volumes/datastore200_NAS (1)
vfat       1023.8M 211.9M    811.9M  21% /vmfs/volumes/BOOTBANK1
vfat       1023.8M 211.9M    811.9M  21% /vmfs/volumes/BOOTBANK2
vsan          0.0B   0.0B      0.0B   0% /vmfs/volumes/vsanDatastore
[root@nuc101:/var/log]

 

The above is also missing the VMFS-L. Which I think is due to a corrupted USB stick.

 

VMFS-L 26.5G 1.6G 24.9G 6% /vmfs/volumes/LOCKER-6********

 

 (paste from other host)

 I can open /scratch but it is missing epd-store*.* items.

 

[root@nuc101:~] cd scratch/
[root@nuc101:/tmp/_osdatao370q0xf] ls
cache      core       downloads  locker     log        store      tmp        var        vdtc       vmware

 

Cluster UUIDs and memberships all look good.

My question

  1. Can I recover any of these VMs? I don't care if the vSan is destroyed.
  2. Why, if one node is fine, is the vSan still broken?

I have backups for most VMs, but not my most wanted one. Because it had GPU passthrough, Veeam couldn't snapshot it.

If I backup and restore ESX from my working host, to the broken host - is there any point in this? Because the vSAN is still down, this wouldn't fix it I'm guessing.

 

Thanks all!

Reply
0 Kudos
5 Replies
TheBobkin
Champion
Champion

@psilan 

"Scratch pointed to vSan datastore." - you shouldn't do that, thoroughly unsupported, that is the likely reason EPD isn't working as it backs DB into /scratch which is not available - you can repoint /scratch to somewhere available (can even create a RAMdisk for it etc.) and restart EPD but this very likely isn't the cause of your availability issues here.

 

"vsan 0.0B 0.0B 0.0B 0% /vmfs/volumes/vsanDatastore"
Are any/all of the nodes in Maintenance Mode? (and by that I mean vSAN Decom state, not necessarily just ESXI MM)

Checkable via:
(run on each node)
# esxcli vsan cluster get
or
(run on any node if cluster is formed)
# cmmds-tool find -t NODE_DECOM_STATE -f json

What is the returned output for the following?:
# esxcli vsan health cluster get -t "Network"
# esxcli vsan health cluster get -t "Hosts with connectivity issues"
# esxcli vsan health cluster get -t "Operation health"

psilan
Contributor
Contributor

Hi TheBobkin, thanks for the response.

 

Yea that's why I underlined scratch. Bit of a mistake! 🙂

Yes I agree, I don't think EPD is the fault, as the 'working' node doesn't have this issue, and should still be able to access the vsandatastore - and it can't.

 

No nodes were in maintenance mode. I have enabled/disabled to check.

vsan cluster get all looked completely normal on both nodes.

DECOM_STATE was all healthy and looked OK on both nodes.

Unfortunately I tested a cluster leave  and I am unable to join the node back into the vsan UUID. I did run the operational health command however before I removed the node to attempt a rejoin and everything looked perfect.

 

Last night, before I left the cluster, on both nodes (even the one working perfectly) had a few errors under esxi gui in datastore > monitor. I think because I had dedupe and compression enabled this makes this unrecoverable...

xxxxxx is under propagated permanent error.
xxxxxx is under permanent failure.
xxxxxx has gone offline.
xxxxxx is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.

Strange since the disks on both hosts are perfectly fine - but those errors might have been the beginning of the end for my datastore.

Reply
0 Kudos
TheBobkin
Champion
Champion

@psilan , With such an issue (assuming cluster is fully-formed, no nodes in Decom state) I would start with looking at what is the state of what components and where (e.g. was there absent-stale components on node1 when node2 had a disk failure resulting in those components being degraded), this information can be generated from CMMDS in many ways but what is easier is to just run the following to put this data in a file and use less/cat/grep to look at some objects (feel free to share some example object output):

# esxcli vsan debug object list --all > /tmp/objout123

psilan
Contributor
Contributor

Nice command, i'll give that a try and investigate further 🙂 Thank you for your help.

Reply
0 Kudos
jamiegravatte
VMware Employee
VMware Employee

Hi @psilan, did this answer your question? If so, please check the "verified solution" button in order to better help your peers find this information!

Thanks!

Jamie - Digital Support Team

Reply
0 Kudos