Hi all,
having a problem with a single node VSAN host (homelab setup). The host had a power outage and now boots all the way up until it PSODs at the very end of the boot process with:
Failed at bora/modules/vmkernel/virsto/map/vs_map_cache.c:324 -- VMK_ASSERT(((mbh->state == MB_INVALID_DATA
Is it possible to recover the data on VSAN partition or all data is lost? Thanks in advance!
PSOD right after the Log Recovery...
Some additional info: I was able to boot the host with VSAN disabled by adding to the boot parameters "jumpstart.disable=vsan,lsom,plog,virsto,cmmds".
The disks and the vsan cluster appear to be intact.
Would it help to delete the partitions on the NVME cache disk and re-add it to the VSAN cluster? Or there might be some other things to try before attempting to delete the cache partition?
[root@esx03:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2019-07-07T17:59:14Z
Local Node UUID:
Local Node Type: NORMAL
Local Node State: DISCOVERY
Local Node Health State: HEALTHY
Sub-Cluster Master UUID:
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52949dd0-8cb7-f3c4-3f5d-54461b2d65d3
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 0
Sub-Cluster Member UUIDs:
Sub-Cluster Member HostNames:
Sub-Cluster Membership UUID:
Unicast Mode Enabled: false
Maintenance Mode State: OFF
Config Generation: 7d82f990-57d8-4bf6-965e-1df18c8d1ac5 12 2019-07-06T03:47:14.88
[root@esx03:~] esxcli vsan storage list
naa.6b8ca3a0ed773d0023abd26814ad2eeb
Device: naa.6b8ca3a0ed773d0023abd26814ad2eeb
Display Name: naa.6b8ca3a0ed773d0023abd26814ad2eeb
Is SSD: true
VSAN UUID: 5268e51a-a5ba-2a6c-5887-952e845fc964
VSAN Disk Group UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a
VSAN Disk Group Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500
Used by this host: false
In CMMDS: false
On-disk format version: 7
Deduplication: true
Compression: true
Checksum: 16614367127871676916
Checksum OK: true
Is Capacity Tier: true
Encryption Metadata Checksum OK: true
Encryption: false
DiskKeyLoaded: false
Is Mounted: true
Creation Time: Fri Dec 21 17:22:58 2018
naa.6b8ca3a0ed773d0023abd278159d8547
Device: naa.6b8ca3a0ed773d0023abd278159d8547
Display Name: naa.6b8ca3a0ed773d0023abd278159d8547
Is SSD: true
VSAN UUID: 5271c531-3fe7-dca2-097e-9cb6abb82cd3
VSAN Disk Group UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a
VSAN Disk Group Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500
Used by this host: false
In CMMDS: false
On-disk format version: 7
Deduplication: true
Compression: true
Checksum: 3856994230136079572
Checksum OK: true
Is Capacity Tier: true
Encryption Metadata Checksum OK: true
Encryption: false
DiskKeyLoaded: false
Is Mounted: true
Creation Time: Fri Dec 21 17:22:58 2018
t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500
Device: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500
Display Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500
Is SSD: true
VSAN UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a
VSAN Disk Group UUID: 52b36e14-771e-6711-c074-88ab86b3ac9a
VSAN Disk Group Name: t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500
Used by this host: false
In CMMDS: false
On-disk format version: 7
Deduplication: true
Compression: true
Checksum: 12214449757955944003
Checksum OK: true
Is Capacity Tier: false
Encryption Metadata Checksum OK: true
Encryption: false
DiskKeyLoaded: false
Is Mounted: true
Creation Time: Fri Dec 21 17:22:58 2018
[root@esx03:~] partedUtil getptbl /dev/disks/t10.NVMe____Samsung_SSD_970_PRO_512GB_______________2E37B28159382500
gpt
62260 255 63 1000215216
1 2048 6143 381CFCCC728811E092EE000C2911D0B2 vsan 0
2 6144 1000215182 77719A0CA4A011E3A47E000C29745A24 virsto 0
[root@esx03:~] partedUtil getptbl /dev/disks/naa.6b8ca3a0ed773d0023abd278159d8547
gpt
121534 255 63 1952448512
1 2048 6143 381CFCCC728811E092EE000C2911D0B2 vsan 0
2 6144 1952448478 77719A0CA4A011E3A47E000C29745A24 virsto 0
[root@esx03:~] partedUtil getptbl /dev/disks/naa.6b8ca3a0ed773d0023abd26814ad2eeb
gpt
121534 255 63 1952448512
1 2048 6143 381CFCCC728811E092EE000C2911D0B2 vsan 0
2 6144 1952448478 77719A0CA4A011E3A47E000C29745A24 virsto 0
Were you able to resolve this? I am experiencing the same issue.
Thanks
@Brad2911 You mentioned that you are "experiencing the same issue" as OP - do you mean just in the generic sense that you have a recursive PSOD on ESXi boot or that you have the exact same backtrace displayed etc. ?
The backtrace OP shared indicates corruption on Cache-tier of a Disk-Group - this can be for a variety of reasons both logical and physical - the short-term solution for this is as OP likely did (assuming data is accessible without the impacted Disk-Group) of rebooting the node with vSAN modules disabled, removing the partitions from the impacted Cache-tier device, rebooting (normal mode), removing the remainders of and recreating the Disk-Group. If the node has multiple Disk-Groups then which Disk-Group is impacted can be identified by monitoring the logs before the PSOD (Alt+F12) and/of if the device naa/UUID is listed in the backtrace.
The long-term solution for such an issue is much more varied - if it re-occurs on the same devices then there is likely a physical reason e.g. bad Cache-tier device or some misconfiguration (e.g. random/unsupported controller and/or driver/firmware not faithfully passing data unaltered to disks), though it can also be something logical that is fixed in code in a later build of ESXi/vSAN.
Hi Bob,
Thanks for the response.
I took a photo of the error - below. Sorry for the photo - the machine I could bring up the remote console on was one of the machines I lost...
This is a home lab, and I had a good backup of the data, so it wasn't a huge deal, but I did have 2 machines get inaccessible objects, and was poking around to see if they would recover / be recoverable. Some posts I found on possibly recovering the data looked like way more work than restoring the data, so I opted to do as you noted, remove the partitions from the impacted Cache-tier device, and recreate the Disk-Group. I found the log drive on the vCenter was full when I was investigating solutions, and I cleared that up too. Who knew - this thing doesn't get much attention because normally it just works.
This cluster is ESXi 6.5 U3, vSAN 6.5 - 4 x Dell R710 hosts, each host has 2 disk groups: 1 all flash, 1 hybrid. Both groups' Cache-tier is an NVME drive on a PCI expansion card. When I first put the NVME drives in as Cache-tier devices they would overheat and just disappear. I never lost any machines / data when this happened. I added heat sinks to the NVME drives and have been stable since.
The 2 machines that were lost were hammering the vSAN pretty good with lots of file copying jobs. The Hybrid Disk Group on one of the hosts had the issue. Not sure if that had something to do with the stress I gave it or not. Once I get back to it, I will be doing that same job on those 2 machines again. I will let you know if I run into another issue, but no news is good news. 🙂
Next upgrade is a 10 gbe backbone for the cluster...
Thanks,
Brad
I forgot to add that I replaced all of the fans in the hosts with Noctua fans as the hosts sit right next to my desk and the fan noise before the switch was intolerable. Now they are whisper quiet.