Solved: Stale vSAN disk group Operation Health error

lspin · ‎04-13-2023

We recently had a few vSAN cache disks fail in 3 of our ESXi hosts (at separate times).

To replace, we placed the hosts in maintenance mode with "Full data migration", removed the capacity disks from the disk group, then removed the disk group. Once we replaced the failed cache SSD, we recreated the disk group with the remaining 7 capacity SSDs.

However, now we are seeing the vSAN "Operation Health" alarm showing the following for each host.

Host = (host IP address) line for each ESXi host where we replaced cache SSD and recreated disk group.

Disk = "Unknown"

Overall health = "Red !"

Metadata health = "Red !"

Operational health = "Red !"

In CMMDS/VSI = "No/No"

Operational State Description = "Unknown disk health state"

UUID = UUIDs of the old disk groups we removed when replacing the cache SSD disks.

Attached is the rvc output when we try to identify the UUID locations. None of our VMs utilizing the vSAN cluster show as inaccessible or orphaned. Any idea how we can get rid of what seems to be phantom vSAN disk groups?

TheBobkin · ‎04-14-2023

@lspin, Odd one, but seen it a few times.

Were the disks hot-swapped/hot-removed or with the server powered off?

If not done with server off then there is potentially still some process trying to do something with that disk that is no longer there, if you can test with one node, put it in Maintenance Mode and cold reboot it (e.g. from iLO/iDRAC not from vSphere client).

View solution in original post

TheBobkin · ‎04-13-2023

@lspin, Just to confirm - when you run either of the following on the ESXi hosts, do you see any disks showing with just UUID (no naa/t.10/mpx/eui)?:

# vdq -Hi

# esxcli vsan storage list

If you do the remove them with

# esxcli vsan storage remove -u UUIDHere

If you don't then it is perhaps cached reference in vsan-health on vCenter or less likely in vsanmgmtd on the nodes.

Both of these can be easily restarted:

On vCenter:

# service-control --stop vmware-vsan-health

# service-control --start vmware-vsan-health

On ESXi:

# /etc/init.d/vsanmgmtd restart

lspin · ‎04-13-2023

@TheBobkin I see.

# vdq -Hi command outputted all the disks without the disk group UUID. I was able to match up the listed naa. disks with their perspective disk group UUID.

# esxcli vsan storage list command does list the one UUID as follows (displays the same for all affected hosts, different UUIDs):

Unknown
Device: Unknown
Display Name: Unknown
Is SSD: false
VSAN UUID: 527ce5f5-5716-1a0b-d15b-3abbba6cc027
VSAN Disk Group UUID:
VSAN Disk Group Name:
Used by this host: false
In CMMDS: false
On-disk format version: -1
Deduplication: false
Compression: false
Checksum:
Checksum OK: false
Is Capacity Tier: false
Encryption Metadata Checksum OK: true
Encryption: false
DiskKeyLoaded: false
Is Mounted: false
Creation Time: Unknown

If I understand correctly, the " # escli vsan storage remove -u UUIDHere " should work in removing the UUID from the host?

I will have to run this command against the UUID displayed as unknown in the vSAN Operation Health for each identified host, correct?

Once I remove the UUID on the hosts do I need to do both suggested restarts or just the ESXi " # /etc/init.d/vsanmgmtd retart " on the host?

lspin · ‎04-14-2023

Hi @TheBobkin ,

I tried the # esxcli vsan storage remove -u UUIDHere command on all 3 hosts to remove the offending UUID. Then ran the restarted the vmware-vsan-health service on the vCenter and also restarted /etc/init.d/vsanmgmtd restart daemon on the ESXi hosts as well.

However, the alarm is still present in the vsan skyline health.

Also, the Unknown UUIDs are still displayed when I run the esxcli vsan storage list command on the hosts as well. They still have no naa. disks associated with them.

TheBobkin · ‎04-14-2023

@lspin, Odd one, but seen it a few times.

Were the disks hot-swapped/hot-removed or with the server powered off?

If not done with server off then there is potentially still some process trying to do something with that disk that is no longer there, if you can test with one node, put it in Maintenance Mode and cold reboot it (e.g. from iLO/iDRAC not from vSphere client).

lspin · ‎04-17-2023

It was hot swapped with the host in full data migration m-mode. I will run the test on one of the hosts later today and let you know how it goes. I will do a full data migration first.

lspin · ‎04-17-2023

@TheBobkin cold reboot worked. Thanks for you help!

All

Stale vSAN disk group Operation Health error

vSAN

vsan disk failure

VSAN SSD