vmsysadmin20111
Enthusiast
Enthusiast

ESXi 6.7 PSOD in VSAN module - invalid cache data?

Hi all,

having a problem with a single node VSAN host (homelab setup). The host had a power outage and now boots all the way up until it PSODs at the very end of the boot process with:

Failed at bora/modules/vmkernel/virsto/map/vs_map_cache.c:324 -- VMK_ASSERT(((mbh->state == MB_INVALID_DATA

Is it possible to recover the data on VSAN partition or all data is lost? Thanks in advance!

pastedImage_0.png

PSOD right after the Log Recovery...

pastedImage_0.png

Tags (1)
0 Kudos
5 Replies
vmsysadmin20111
Enthusiast
Enthusiast

Some additional info: I was able to boot the host with VSAN disabled by adding to the boot parameters "jumpstart.disable=vsan,lsom,plog,virsto,cmmds".

The disks and the vsan cluster appear to be intact.

Would it help to delete the partitions on the NVME cache disk and re-add it to the VSAN cluster? Or there might be some other things to try before attempting to delete the cache partition?

[root@esx03:~] esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2019-07-07T17:59:14Z

   Local Node UUID:

   Local Node Type: NORMAL

   Local Node State: DISCOVERY

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID:

   Sub-Cluster Backup UUID:

   Sub-Cluster UUID: 52949dd0-8cb7-f3c4-3f5d-54461b2d65d3

   Sub-Cluster Membership Entry Revision: 0

   Sub-Cluster Member Count: 0

   Sub-Cluster Member UUIDs:

   Sub-Cluster Member HostNames:

   Sub-Cluster Membership UUID:

   Unicast Mode Enabled: false

   Maintenance Mode State: OFF

   Config Generation: 7d82f990-57d8-4bf6-965e-1df18c8d1ac5 12 2019-07-06T03:47:14.88

[root@esx03:~] esxcli vsan storage list

naa.6b8ca3a0ed773d0023abd26814ad2eeb

   Device: naa.6b8ca3a0ed773d0023abd26814ad2eeb

   Display Name: naa.6b8ca3a0ed773d0023abd26814ad2eeb

   Is SSD: true

   VSAN UUID: 5268e51a-a5ba-2a6c-5887-952e845fc964

   VSAN Disk Group UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a

   VSAN Disk Group Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

   Used by this host: false

   In CMMDS: false

   On-disk format version: 7

   Deduplication: true

   Compression: true

   Checksum: 16614367127871676916

   Checksum OK: true

   Is Capacity Tier: true

   Encryption Metadata Checksum OK: true

   Encryption: false

   DiskKeyLoaded: false

   Is Mounted: true

   Creation Time: Fri Dec 21 17:22:58 2018

naa.6b8ca3a0ed773d0023abd278159d8547

   Device: naa.6b8ca3a0ed773d0023abd278159d8547

   Display Name: naa.6b8ca3a0ed773d0023abd278159d8547

   Is SSD: true

   VSAN UUID: 5271c531-3fe7-dca2-097e-9cb6abb82cd3

   VSAN Disk Group UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a

   VSAN Disk Group Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

   Used by this host: false

   In CMMDS: false

   On-disk format version: 7

   Deduplication: true

   Compression: true

   Checksum: 3856994230136079572

   Checksum OK: true

   Is Capacity Tier: true

   Encryption Metadata Checksum OK: true

   Encryption: false

   DiskKeyLoaded: false

   Is Mounted: true

   Creation Time: Fri Dec 21 17:22:58 2018

t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

   Device: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

   Display Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

   Is SSD: true

   VSAN UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a

   VSAN Disk Group UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a

   VSAN Disk Group Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

   Used by this host: false

   In CMMDS: false

   On-disk format version: 7

   Deduplication: true

   Compression: true

   Checksum: 12214449757955944003

   Checksum OK: true

   Is Capacity Tier: false

   Encryption Metadata Checksum OK: true

   Encryption: false

   DiskKeyLoaded: false

   Is Mounted: true

   Creation Time: Fri Dec 21 17:22:58 2018

[root@esx03:~] partedUtil getptbl /dev/disks/t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500

gpt

62260 255 63 1000215216

1 2048 6143 381CFCCC728811E092EE000C2911D0B2 vsan 0

2 6144 1000215182 77719A0CA4A011E3A47E000C29745A24 virsto 0

[root@esx03:~] partedUtil getptbl /dev/disks/naa.6b8ca3a0ed773d0023abd278159d8547

gpt

121534 255 63 1952448512

1 2048 6143 381CFCCC728811E092EE000C2911D0B2 vsan 0

2 6144 1952448478 77719A0CA4A011E3A47E000C29745A24 virsto 0

[root@esx03:~] partedUtil getptbl /dev/disks/naa.6b8ca3a0ed773d0023abd26814ad2eeb

gpt

121534 255 63 1952448512

1 2048 6143 381CFCCC728811E092EE000C2911D0B2 vsan 0

2 6144 1952448478 77719A0CA4A011E3A47E000C29745A24 virsto 0

0 Kudos
Brad2911
Contributor
Contributor

Were you able to resolve this?  I am experiencing the same issue.

Thanks

0 Kudos
TheBobkin
VMware Employee
VMware Employee

@Brad2911 You mentioned that you are "experiencing the same issue" as OP - do you mean just in the generic sense that you have a recursive PSOD on ESXi boot or that you have the exact same backtrace displayed etc. ?

 

The backtrace OP shared indicates corruption on Cache-tier of a Disk-Group - this can be for a variety of reasons both logical and physical - the short-term solution for this is as OP likely did (assuming data is accessible without the impacted Disk-Group) of rebooting the node with vSAN modules disabled, removing the partitions from the impacted Cache-tier device, rebooting (normal mode), removing the remainders of and recreating the Disk-Group. If the node has multiple Disk-Groups then which Disk-Group is impacted can be identified by monitoring the logs before the PSOD (Alt+F12) and/of if the device naa/UUID is listed in the backtrace.

 

The long-term solution for such an issue is much more varied - if it re-occurs on the same devices then there is likely a physical reason e.g. bad Cache-tier device or some misconfiguration (e.g. random/unsupported controller and/or driver/firmware not faithfully passing data unaltered to disks), though it can also be something logical that is fixed in code in a later build of ESXi/vSAN.

Brad2911
Contributor
Contributor

 

Hi Bob,

Thanks for the response.

I took a photo of the error - below.  Sorry for the photo - the machine I could bring up the remote console on was one of the machines I lost...

This is a home lab, and I had a good backup of the data, so it wasn't a huge deal, but I did have 2 machines get inaccessible objects, and was poking around to see if they would recover / be recoverable.  Some posts I found on possibly recovering the data looked like way more work than restoring the data, so I opted to do as you noted, remove the partitions from the impacted Cache-tier device, and recreate the Disk-Group.  I found the log drive on the vCenter was full when I was investigating solutions, and I cleared that up too.  Who knew - this thing doesn't get much attention because normally it just works.

This cluster is ESXi 6.5 U3, vSAN 6.5  - 4 x Dell R710 hosts, each host has 2 disk groups: 1 all flash, 1 hybrid.  Both groups' Cache-tier is an NVME drive on a PCI expansion card.  When I first put the NVME drives in as Cache-tier devices they would overheat and just disappear.  I never lost any machines / data when this happened.  I added heat sinks to the NVME drives and have been stable since.

The 2 machines that were lost were hammering the vSAN pretty good with lots of file copying jobs.  The Hybrid Disk Group on one of the hosts had the issue.  Not sure if that had something to do with the stress I gave it or not.  Once I get back to it, I will be doing that same job on those 2 machines again.  I will let you know if I run into another issue, but no news is good news.  🙂

Next upgrade is a 10 gbe backbone for the cluster...

Thanks,

Brad

 

 

 

20220211_121139 (2).jpg

0 Kudos
Brad2911
Contributor
Contributor

I forgot to add that I replaced all of the fans in the hosts with Noctua fans as the hosts sit right next to my desk and the fan noise before the switch was intolerable.  Now they are whisper quiet.

0 Kudos