VMware Cloud Community
croit55
Contributor
Contributor

ESXi PSOD

Hi,

so for the past few days, I have been troubleshooting a specific issue that we are encountering with our VMware ESXi 8.0U1a installed on HPE ProLiant DL385 Gen10+. The host is connected to the vCenter server but not a part of a cluster.

The main error is the NOT_IMPLEMENTED, and from what I have found ( Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956) ) it basically says that some of the components are requesting from vmkernel some activity that it has not been designed to do. Other discussions on this error have not been helpful in my case since I have already tried to reinstall and upgrade the ESXi itself.

The error traceback is as follows:

2023-07-15T10:42:52.185Z cpu0:2097242)@BlueScreen: NOT_IMPLEMENTED bora/vmkernel/main/world.c:2294

2023-07-15T10:42:52.185Z cpu0:2097242)Code start: 0x420017400000 VMK uptime: 11:13:33:15.324

2023-07-15T10:42:52.185Z cpu0:2097242)0x453882d1bc00:[0x420017514d31]PanicvPanicInt@vmkernel#nover+0x1f5 stack: 0x100

2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bcb0:[0x4200175153a0]Panic_NoSave@vmkernel#nover+0x4d stack: 0x453882d1bd10

2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bd10:[0x4200175158ad]Panic_OnAssertAt@vmkernel#nover+0xba stack: 0x8f600000000

2023-07-15T10:42:52.186Z cpu0:2097242)0x453882d1bd90:[0x42001756855f]Int6_UD2Assert@vmkernel#nover+0x260 stack: 0x0

2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1bdc0:[0x420017561067]gate_entry@vmkernel#nover+0x68 stack: 0x0

2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1be80:[0x420017547136]World_DestroyHeap@vmkernel#nover+0x4e stack: 0x4310dc600000

2023-07-15T10:42:52.187Z cpu0:2097242)0x453882d1bea0:[0x420017547251]WorldGroupCleanup@vmkernel#nover+0xe6 stack: 0x453882d1bef0

2023-07-15T10:42:52.188Z cpu0:2097242)0x453882d1bec0:[0x4200174f1dee]InitTable_Cleanup@vmkernel#nover+0x27 stack: 0x430f4ec01220

2023-07-15T10:42:52.188Z cpu0:2097242)0x453882d1bee0:[0x42001754cd46]World_TryReap@vmkernel#nover+0x3d3 stack: 0x45389e01f000

2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1bfa0:[0x420017517582]ReaperWorkerWorld@vmkernel#nover+0xaf stack: 0x453882c9f100

2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1bfe0:[0x420017828eca]CpuSched_StartWorld@vmkernel#nover+0x7b stack: 0x0

2023-07-15T10:42:52.189Z cpu0:2097242)0x453882d1c000:[0x4200174d788b]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0

2023-07-15T10:42:52.191Z cpu0:2097242)base fs=0x0 gs=0x420040000000 Kgs=0x0

2023-07-15T10:42:52.116Z cpu0:2097242)Heap: 2746: Unable to complete wait for non-empty heap (worldGroup.2101762): Timeout

 

Besides that, sometimes we get notifications (errors?) from the lsi_mr3 driver installed on the HBA controlling our local array of disks:

 

2023-07-15T10:28:04.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.

2023-07-15T10:28:05.638Z cpu9:2097729)lsi_mr3_0000:c4:00.0: mfiReadMaxEvents: 378: Event:From SeqNum 18714 to 18714. Count 1

2023-07-15T10:28:05.638Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.

2023-07-15T10:28:22.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: mfiReadMaxEvents: 378: Event:From SeqNum 18715 to 18715. Count 1

2023-07-15T10:28:22.646Z cpu9:2097729)lsi_mr3_0000:c4:00.0: megasas_hotplug_work: 498: event code: 0x5e.

 

I would really be grateful if some of you have any clue for what else I could try to do, before opening a support request with VMware.

 

Thank you once again in advace!

25 Replies
SiddSalman
Contributor
Contributor

This issue under investigation 

Heap: 2746: Unable to complete wait for non-empty heap (worldGroup.2101762): Timeout

VMware and HPE are investigating the cause of the issue.

Please log a case with HPE and VMware.

 

I work for HPE

Reply
0 Kudos
pashnal
Enthusiast
Enthusiast

Hi , 

 

Can you try the below workaround and check if this helps .  increase the value of storageMaxDevices to 1024 as this issue occurs because devfs heap is full.

To increase the value using vSphere Client, go to Software > Advanced settings > VMKernel vmkernel.boot.StorageMaxDevices.

 

Thanks , 

Pramod Ashnal 

Pls mark this comment as solution provided and give a thumbs up if you have got your solution !!

 

Reply
0 Kudos
croit55
Contributor
Contributor

Hi,

yes, I have already tried that recommendation but unfortunately it has not been helpful.

Reply
0 Kudos
maksym007
Expert
Expert

PSOD is in most cases hardware related. 

so update iLO/iDRAC and BIOS. 

patch ESXi to the latest version. update Drivers and Firmware of all PCI cards. 

check if CPU and RAM are ok.

GregoryCann
Contributor
Contributor

Ok, thanks. I will check it. I will go to Software > Advanced settings > VMKernel > vmkernel.boot.StorageMaxDevices and if I face any issue, I will ask.

Reply
0 Kudos
croit55
Contributor
Contributor

Did this fix the issue in your case?

Reply
0 Kudos
microy
Contributor
Contributor

We have the exact same PSOD. 
ESXi, 8.0.1, 22088125
ProLiant DL325 Gen10 Plus, AMD EPYC 7542 32-Core Processors

Happened 3 times now.. 

VMware Support points towards HPE

croit55
Contributor
Contributor

Yes, us too faced this PSOD multiple times. Sometimes it happens every 2-3 days, but now it has been okay for over 50 days. Can you please give update if HPE has any useful information on this.

Norbertel
Contributor
Contributor

We have the exact same PSOD.
VMware ESXi, 8.0.2, 22380479
ProLiant DL385 Gen11, AMD EPYC 9474F 48-Core Processor

Happened already a few times on our cluster with 4 servers.
I hope Vmware and HPE find a solution together.

Reply
0 Kudos
KingNST
Contributor
Contributor

Has anyone tried downgrading to ESXI 7? A client of mine is having the exact same issue, and neither VMWare or HPE seem to have any answers.

Reply
0 Kudos
DanRobinsonHP
Enthusiast
Enthusiast

Are all of you having this issue running MegaRAID cards?
HPE MR216 / MR416 / MR408 ?

That was a change we made from Gen10 to Gen10 Plus was to switch the default RAID Card vendor to LSI (Broadcom).

Reply
0 Kudos
croit55
Contributor
Contributor

Yes, we are using MR416i cards.

Reply
0 Kudos
TallonZek
Contributor
Contributor

Hello: we got the same PSOD yesterday, and are also running HPE ProLiant DL385 Gen10+ with a MR416i-a.  All drivers running on the September SPP.  Have you got anywhere with HPE support?

Reply
0 Kudos
TallonZek
Contributor
Contributor

We got another PSOD on a different host with identical hardware.  Here's our latest info from VMWare support:

This issue is caused due to object being leaked in the world heap of the "smad" process and when we try to cleanup this world it results in PSOD

HPE Server's are impacted by this and may crash with PSOD with the Backtrace mentioned 

Currently there is no Resolution
HPE Engineering Team is working on a code Fix in their ILO Driver to Resolve the issue

At this time it would be best to contact HPE to see if they have a updated ILO Driver, however the info I have was just published internally today so I would not expect they have anything just yet.

We already updated ILO to latest prior to the PSOD.  HPE support gave me a cryptic promise that their developers are looking at it and there is no ETA.  Anyone else get better info on this?

Reply
0 Kudos
virtualqc
Enthusiast
Enthusiast

Did you use the HP ESXi custom ISO or standatd ESXi 8.0U1a, If you did not use the custom ISO you should switch to it, you can download it on HPE website. 

Another possible cause is that the ESXi host has some incompatible or unsupported third-party software installed that interferes with the installation

TallonZek
Contributor
Contributor

We're using HPE's version.

Reply
0 Kudos
lamax1976
Contributor
Contributor

Exact same behaviour here : 4 hosts HPE DL385 Gen10 Plus + 2 hosts DL385 Gen10 Plus v2 | VMware ESXi, 8.0.1, 22088125.

One PSOD every 4 days since we did upgrade to vSphere 8.

Opened cases on HPE and VMware support and they did say it is linked to bug between vSphere and the ILO that we have to wait for vSphere 8.0 U3 to have it fixed. 

Reply
0 Kudos
NateNateNAte
Hot Shot
Hot Shot

I had the same issue with 6.5 years ago.  It was fixed with 6.5U3 upgrade, but it was a cascading failure tied to how the database (SQL Enterprise at the time) was configured, but there was a runaway bug that led to essentially a SQL log buffer overflow....and PSOD. 

Good times.  

I'm surprised that is still an issue though.  

Reply
0 Kudos
BC_Daniel
Contributor
Contributor

HPE DL385 Gen10 Plus with no Controller (FC-SAN-Connection)

Update everything to newest version, but failing again afterwards.

Cluster with 3 hosts failed within 30 min one after the other.

overall 5 PSoD until today.

 

HPE and VMware does not have any solution.

 

 

Reply
0 Kudos