VMware Cloud Community
croit55
Contributor
Contributor

ESXi PSOD

Hi,

so for the past few days, I have been troubleshooting a specific issue that we are encountering with our VMware ESXi 8.0U1a installed on HPE ProLiant DL385 Gen10+. The host is connected to the vCenter server but not a part of a cluster.

The main error is the NOT_IMPLEMENTED, and from what I have found ( Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956) ) it basically says that some of the components are requesting from vmkernel some activity that it has not been designed to do. Other discussions on this error have not been helpful in my case since I have already tried to reinstall and upgrade the ESXi itself.

The error traceback is as follows:

2023-07-15T10:42:52.185Z cpu0:2097242)@BlueScreen: NOT_IMPLEMENTED bora/vmkernel/main/world.c:2294

2023-07-15T10:42:52.185Z cpu0:2097242)Code start: 0x420017400000 VMK uptime: 11:13:33:15.324

2023-07-15T10:42:52.185Z cpu0:2097242)0x453882d1bc00:[0x420017514d31]PanicvPanicInt@vmkernel#nover+0x1f5 stack: 0x100

2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bcb0:[0x4200175153a0]Panic_NoSave@vmkernel#nover+0x4d stack: 0x453882d1bd10

2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bd10:[0x4200175158ad]Panic_OnAssertAt@vmkernel#nover+0xba stack: 0x8f600000000

2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bd90:[0x42001756855f]Int6_UD2Assert@vmkernel#nover+0x260 stack: 0x0

2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1bdc0:[0x420017561067]gate_entry@vmkernel#nover+0x68 stack: 0x0

2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1be80:[0x420017547136]World_DestroyHeap@vmkernel#nover+0x4e stack: 0x4310dc600000

2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1bea0:[0x420017547251]WorldGroupCleanup@vmkernel#nover+0xe6 stack: 0x453882d1bef0

2023-07-15T10:42:52.188Z cpu0:2097242)0x453882d1bec0:[0x4200174f1dee]InitTable_Cleanup@vmkernel#nover+0x27 stack: 0x430f4ec01220

2023-07-15T10:42:52.188Z cpu0:2097242)0x453882d1bee0:[0x42001754cd46]World_TryReap@vmkernel#nover+0x3d3 stack: 0x45389e01f000

2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1bfa0:[0x420017517582]ReaperWorkerWorld@vmkernel#nover+0xaf stack: 0x453882c9f100

2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1bfe0:[0x420017828eca]CpuSched_StartWorld@vmkernel#nover+0x7b stack: 0x0

2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1c000:[0x4200174d788b]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0

2023-07-15T10:42:52.191Z cpu0:2097242)base fs=0x0 gs=0x420040000000 Kgs=0x0

2023-07-15T10:42:52.116Z cpu0:2097242)Heap: 2746: Unable to complete wait for non-empty heap (worldGroup.2101762): Timeout

 

Besides that, sometimes we get notifications (errors?) from the lsi_mr3 driver installed on the HBA controlling our local array of disks:

 

2023-07-15T10:28:04.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.

2023-07-15T10:28:05.638Z cpu9:2097729)lsi_mr3_0000:c4:00.0: mfiReadMaxEvents: 378: Event:From SeqNum 18714 to 18714. Count 1

2023-07-15T10:28:05.638Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.

2023-07-15T10:28:22.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: mfiReadMaxEvents: 378: Event:From SeqNum 18715 to 18715. Count 1

2023-07-15T10:28:22.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.

 

I would really be grateful if some of you have any clue for what else I could try to do, before opening a support request with VMware.

 

Thank you once again in advace!

Reply
0 Kudos
9 Replies
SiddSalman
Contributor
Contributor

This issue under investigation 

Heap: 2746: Unable to complete wait for non-empty heap (worldGroup.2101762): Timeout

VMware and HPE are investigating the cause of the issue.

Please log a case with HPE and VMware.

 

I work for HPE

Reply
0 Kudos
pashnal
Enthusiast
Enthusiast

Hi , 

 

Can you try the below workaround and check if this helps .  increase the value of storageMaxDevices to 1024 as this issue occurs because devfs heap is full.

To increase the value using vSphere Client, go to Software > Advanced settings > VMKernel vmkernel.boot.StorageMaxDevices.

 

Thanks , 

Pramod Ashnal 

Pls mark this comment as solution provided and give a thumbs up if you have got your solution !!

 

Reply
0 Kudos
croit55
Contributor
Contributor

Hi,

yes, I have already tried that recommendation but unfortunately it has not been helpful.

Reply
0 Kudos
maksym007
Expert
Expert

PSOD is in most cases hardware related. 

so update iLO/iDRAC and BIOS. 

patch ESXi to the latest version. update Drivers and Firmware of all PCI cards. 

check if CPU and RAM are ok.

GregoryCann
Contributor
Contributor

Ok, thanks. I will check it. I will go to Software > Advanced settings > VMKernel > vmkernel.boot.StorageMaxDevices and if I face any issue, I will ask.

Reply
0 Kudos
croit55
Contributor
Contributor

Did this fix the issue in your case?

Reply
0 Kudos
microy
Contributor
Contributor

We have the exact same PSOD. 
ESXi, 8.0.1, 22088125
ProLiant DL325 Gen10 Plus, AMD EPYC 7542 32-Core Processors

Happened 3 times now.. 

VMware Support points towards HPE

croit55
Contributor
Contributor

Yes, us too faced this PSOD multiple times. Sometimes it happens every 2-3 days, but now it has been okay for over 50 days. Can you please give update if HPE has any useful information on this.

Norbertel
Contributor
Contributor

We have the exact same PSOD.
VMware ESXi, 8.0.2, 22380479
ProLiant DL385 Gen11, AMD EPYC 9474F 48-Core Processor

Happened already a few times on our cluster with 4 servers.
I hope Vmware and HPE find a solution together.

Reply
0 Kudos