PSOD no Heartbeat in ESXi 5.5 (PCPU 12: no heartbe...

PatricioZ · ‎09-22-2021

Hello, a few days ago one of our Dell PowerEdge R730 servers which has ESXi 5.5 installed. The server suddenly froze and showed a purple screen with the following error: PSOD ESXi 5.5

PCPU 12: no heartbeat (2/2 IPIs received)
cr0=0x80010031 cr2=0xb925310 cr3=0x156df9000 cr4=0x42768
*PCPU22:34002/rhttpproxy-work
PCPU 0: VVVVVVVVVVVVSUSSSSHSSSSUS
Code start: 0x418008800000 VMK uptime: 594:06:57:43.407
Saved backtrace from: pcpu 12 Heartbeat NMI
0x41238a9dda00: [0x418008bae6b8]MemNode_NUMANodeMask2MenNodeMask@vmkernel#nover+0x48 stack: 0x=
0x41238a9ddb30: [0x418008b7ce5b]MemDistributeNUMAPolicy@vmkernel#nover+0x107 stack: 0x136be09b
0x41238a9ddca0: [0x418008b7c0cc]MemDistribute_Alloc@vmkernel#nover+0xlfc stack: 0x41238a9ddd4c
0x41238a9ddd00: [0x418008a9e15f]SchedKmem_Alloc@vmkernel#nover+0x67 stack: 0x203a353400000200
0x41238a9ddd80: [0x4180088246a1]vmk_MemPoolAlloc@vmkernel#nover+0x181 stack: 0x41095c541f01
0x41238a9ddeb0: [0x418008f63646]fusion_get_seq_num@<None>#<None>+0xb6 stack:0x91
0x41238a9ddf20: [0x418008f5b804]megasas_hotplug_work@<None>#<None>+0xc0 stack: 0x0
0x41238a9ddfd0: [0x418008827c6f]VmkTimerQueueWorldFunc@vmkernel#nover+0x40b stack: 0x0
0x41238a9ddff0: [0x418008a55452]CpuSched_StartWorld@vmkernel#nover+0xfa stack: 0x0
base fs=0x0 gs=0x418045800000 Ksg=0x0
2021-09-11T20:08:39.270Z cpu12:33447)NMI: IPI received. Was eip(base):ebp:cs [0x148b7(0x418008800000):0x41238a9dd970:0x

-----

After this, the server had to be restarted to get it working again.

However, I am having trouble identifying the error, what the failure was due to, and how to prevent it from happening again.

I would appreciate if you can help me with this.

Thanks.

christianZ · ‎09-22-2021

Hi,

see following points here:

NumaNode- memory distribution over the numa nodes, is there any vm configuration on that, is there any vm with large ram configured?

megasas_... - have you check for a newer driver for your raid controller?

On Dell servers, you can check your hardware running hw tests (booting F10), here especially on cpus and ram

Just my ideas on that.

Reg

Christian Z.

PatricioZ · ‎02-04-2022

Hello, I am writing again, because after 6 months the server failed again, generating an error on the screen.

This time the error is different but I can't figure out what is causing it.

I have checked the hardware of the Server through the BIOS and it does not indicate any error and I have also ruled out temperature problems since the server is in a refrigerated room and an internal cleaning was carried out only a few months ago.

I would appreciate any help you can give me to solve this problem.

Thanks.

This is an image of the error:

This is the text:

cr0=0x8001003d cr2=0x1475000 cr3=0x73209000 cr4=0x216c
PCPU0:33054/memMap-0
PCPU 0: SSUUVSUSSUVSSSUUSSSSSSS
Code start: 0x418011800000 VMK uptime: 77:19:39:31.997
0x41238479d220:[0x41801188d0a9]PanicvPanicInt@vmkernel#nover+0x575 stack: 0x412300000008
0x41238479d280:[0x41801188d2ed]Panic_NoSave@vmkernel#nover+0x49 stack: 0x41238479d320
0x41238479d290:[0x4180118888f8]NMICheckLint1Bottom@vmkernel#nover+0x50 stack: 0x41238479d2d0
0x41238479d320:[0x41801182e9ef]BH_DrainAndDisableInterrupts@vmkernel#nover 0xf3 stack: 0x41238479d3
0x41238479d360:[0x4180118641c3]IDT_IntrHandler@vmkernel#nover+0x1af stack: 0x41238479d480
0x41238479d370:[0x4180118f1064]gate_entry@vmkernel#nover+0x64 stack: 0x4018
0x41238479d480:[0x418011ba655a]Power_HaltPCPU@vmkernel#nover+01fe stack: 0x0
0x41238479d4f0:[0x418011a50 a69]CpuSchedIdleLoopInt@vmkernel#nover+0x4bd stack: 0x412300000002
0x41238479d650:[0x418011a56b40]CpuSchedDispatch@vmkernel#nover+0x1630 stack: 0x6e0
0x41238479d6c0:[0x418011a57e75]CpuSchedWait@vmkernel+0x245 stack: 0x1412300000001
0x41238479d740:[0x418011a587d4]CpuSched_TimeWait@vmkernel#nover+0xec stack: 0x0
0x41238479dfd0:[0x418011a587d4]PagCacheAd justSize@vmkernel#nover+0x448 stack: 0x0
0x41238479dff0:[0x418011a55452]CpuSched_StartWorld@vmkernel#nover+0xfa stack:0x0
phase fs=0x0 gs=0x418040000000 Kgs=0x0
Coredump to disk. Slot 1 of 1.
Diskdump: Failed: Couldn’t dump header: 0xbad0001
file configured to dump data.
Debugger waiting(world 33054) -- no port for remote debugger. “Escape” for local debugger.

e_espinel · ‎02-06-2022

Hello.

I recommend you to update the Firmware levels of the Dell PowerEdge R730, this should be part of the maintenance of the equipment at least once a year, until new levels are available. If you update the Firmware you should perform again the internal diagnostics of memory and CPU.

It is also recommended to update the VMware vSphere to its latest patch levels. Attached is a link where you can get them.

https://customerconnect.vmware.com/patch

Another way would be to upgrade to version 6.0, if your hardware allows it and if you can upgrade the license from version 5 to version 6.

Enrique Espinel
Senior Technical Support on IBM, Lenovo, Veeam Backup and VMware vSphere.
VSP-SV, VTSP-SV, VTSP-HCI, VTSP
Please mark my comment as Correct Answer or assign Kudos if my answer was helpful to you, Thank you.
Пожалуйста, отметьте мой комментарий как Правильный ответ или поставьте Кудо, если мой ответ был вам полезен, Спасибо.

All

PSOD no Heartbeat in ESXi 5.5 (PCPU 12: no heartbeat)

Esxi 5.5