ESXi 6.7 PSOD in VSAN module - invalid cache data?

vmsysadmin20111 · ‎07-07-2019

Hi all,

having a problem with a single node VSAN host (homelab setup). The host had a power outage and now boots all the way up until it PSODs at the very end of the boot process with:

Failed at bora/modules/vmkernel/virsto/map/vs_map_cache.c:324 -- VMK_ASSERT(((mbh->state == MB_INVALID_DATA

Is it possible to recover the data on VSAN partition or all data is lost? Thanks in advance!

PSOD right after the Log Recovery...

vmsysadmin20111 · ‎07-07-2019

Some additional info: I was able to boot the host with VSAN disabled by adding to the boot parameters "jumpstart.disable=vsan,lsom,plog,virsto,cmmds".

The disks and the vsan cluster appear to be intact.

Would it help to delete the partitions on the NVME cache disk and re-add it to the VSAN cluster? Or there might be some other things to try before attempting to delete the cache partition?

[root@esx03:~] esxcli vsan cluster get

Cluster Information

Enabled: true

Current Local Time: 2019-07-07T17:59:14Z

Local Node UUID:

Local Node Type: NORMAL

Local Node State: DISCOVERY

Local Node Health State: HEALTHY

Sub-Cluster Master UUID:

Sub-Cluster Backup UUID:

Sub-Cluster UUID: 52949dd0-8cb7-f3c4-3f5d-54461b2d65d3

Sub-Cluster Membership Entry Revision: 0

Sub-Cluster Member Count: 0

Sub-Cluster Member UUIDs:

Sub-Cluster Member HostNames:

Sub-Cluster Membership UUID:

Unicast Mode Enabled: false

Maintenance Mode State: OFF

Config Generation: 7d82f990-57d8-4bf6-965e-1df18c8d1ac5 12 2019-07-06T03:47:14.88

[root@esx03:~] esxcli vsan storage list

naa.6b8ca3a0ed773d0023abd26814ad2eeb

Device: naa.6b8ca3a0ed773d0023abd26814ad2eeb

Display Name: naa.6b8ca3a0ed773d0023abd26814ad2eeb

Is SSD: true

VSAN UUID: 5268e51a-a5ba-2a6c-5887-952e845fc964

VSAN Disk Group UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a

VSAN Disk Group Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

Used by this host: false

In CMMDS: false

On-disk format version: 7

Deduplication: true

Compression: true

Checksum: 16614367127871676916

Checksum OK: true

Is Capacity Tier: true

Encryption Metadata Checksum OK: true

Encryption: false

DiskKeyLoaded: false

Is Mounted: true

Creation Time: Fri Dec 21 17:22:58 2018

naa.6b8ca3a0ed773d0023abd278159d8547

Device: naa.6b8ca3a0ed773d0023abd278159d8547

Display Name: naa.6b8ca3a0ed773d0023abd278159d8547

Is SSD: true

VSAN UUID: 5271c531-3fe7-dca2-097e-9cb6abb82cd3

VSAN Disk Group UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a

VSAN Disk Group Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

Used by this host: false

In CMMDS: false

On-disk format version: 7

Deduplication: true

Compression: true

Checksum: 3856994230136079572

Checksum OK: true

Is Capacity Tier: true

Encryption Metadata Checksum OK: true

Encryption: false

DiskKeyLoaded: false

Is Mounted: true

Creation Time: Fri Dec 21 17:22:58 2018

t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

Device: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

Display Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

Is SSD: true

VSAN UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a

VSAN Disk Group UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a

VSAN Disk Group Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

Used by this host: false

In CMMDS: false

On-disk format version: 7

Deduplication: true

Compression: true

Checksum: 12214449757955944003

Checksum OK: true

Is Capacity Tier: false

Encryption Metadata Checksum OK: true

Encryption: false

DiskKeyLoaded: false

Is Mounted: true

Creation Time: Fri Dec 21 17:22:58 2018

[root@esx03:~] partedUtil getptbl /dev/disks/t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

gpt

62260 255 63 1000215216

1 2048 6143 381CFCCC728811E092EE000C2911D0B2 vsan 0

2 6144 1000215182 77719A0CA4A011E3A47E000C29745A24 virsto 0

[root@esx03:~] partedUtil getptbl /dev/disks/naa.6b8ca3a0ed773d0023abd278159d8547

gpt

121534 255 63 1952448512

1 2048 6143 381CFCCC728811E092EE000C2911D0B2 vsan 0

2 6144 1952448478 77719A0CA4A011E3A47E000C29745A24 virsto 0

[root@esx03:~] partedUtil getptbl /dev/disks/naa.6b8ca3a0ed773d0023abd26814ad2eeb

gpt

121534 255 63 1952448512

1 2048 6143 381CFCCC728811E092EE000C2911D0B2 vsan 0

2 6144 1952448478 77719A0CA4A011E3A47E000C29745A24 virsto 0

Brad2911 · ‎02-11-2022

Were you able to resolve this? I am experiencing the same issue.

Thanks

TheBobkin · ‎02-17-2022

@Brad2911 You mentioned that you are "experiencing the same issue" as OP - do you mean just in the generic sense that you have a recursive PSOD on ESXi boot or that you have the exact same backtrace displayed etc. ?

The backtrace OP shared indicates corruption on Cache-tier of a Disk-Group - this can be for a variety of reasons both logical and physical - the short-term solution for this is as OP likely did (assuming data is accessible without the impacted Disk-Group) of rebooting the node with vSAN modules disabled, removing the partitions from the impacted Cache-tier device, rebooting (normal mode), removing the remainders of and recreating the Disk-Group. If the node has multiple Disk-Groups then which Disk-Group is impacted can be identified by monitoring the logs before the PSOD (Alt+F12) and/of if the device naa/UUID is listed in the backtrace.

The long-term solution for such an issue is much more varied - if it re-occurs on the same devices then there is likely a physical reason e.g. bad Cache-tier device or some misconfiguration (e.g. random/unsupported controller and/or driver/firmware not faithfully passing data unaltered to disks), though it can also be something logical that is fixed in code in a later build of ESXi/vSAN.

Brad2911 · ‎02-17-2022

Hi Bob,

Thanks for the response.

I took a photo of the error - below. Sorry for the photo - the machine I could bring up the remote console on was one of the machines I lost...

This is a home lab, and I had a good backup of the data, so it wasn't a huge deal, but I did have 2 machines get inaccessible objects, and was poking around to see if they would recover / be recoverable. Some posts I found on possibly recovering the data looked like way more work than restoring the data, so I opted to do as you noted, remove the partitions from the impacted Cache-tier device, and recreate the Disk-Group. I found the log drive on the vCenter was full when I was investigating solutions, and I cleared that up too. Who knew - this thing doesn't get much attention because normally it just works.

This cluster is ESXi 6.5 U3, vSAN 6.5 - 4 x Dell R710 hosts, each host has 2 disk groups: 1 all flash, 1 hybrid. Both groups' Cache-tier is an NVME drive on a PCI expansion card. When I first put the NVME drives in as Cache-tier devices they would overheat and just disappear. I never lost any machines / data when this happened. I added heat sinks to the NVME drives and have been stable since.

The 2 machines that were lost were hammering the vSAN pretty good with lots of file copying jobs. The Hybrid Disk Group on one of the hosts had the issue. Not sure if that had something to do with the stress I gave it or not. Once I get back to it, I will be doing that same job on those 2 machines again. I will let you know if I run into another issue, but no news is good news. 🙂

Next upgrade is a 10 gbe backbone for the cluster...

Thanks,

Brad

Brad2911 · ‎02-17-2022

I forgot to add that I replaced all of the fans in the hosts with Noctua fans as the hosts sit right next to my desk and the fan noise before the switch was intolerable. Now they are whisper quiet.

All

ESXi 6.7 PSOD in VSAN module - invalid cache data?