I just updated one of my ESXi servers from 5.0 to 5.5. Everything seemed to go OK during the update, until the ESXi server rebooted, when it came up with a PSOD.
If I cold boot the server from a power-off state, it boots fine with no errors and comes up and reports "compliant" to vCenter.
If I reboot it via vCenter, it comes up with the PSOD again.
The details of the PSOD are below. Has anyone else seen anything similar, or know what the issue is?
I got the PSOD by taking a cameraphone pic of the PSOD screen, running it through OCR, and fixing the errors, so if something doesn't make sense, there may still be some typos in it. I've attached the original PSOD screenshot for reference.
VMware ESXi 5.5.0 [Releasebuild-1331820 x86_64]
PANIC bora/vmkerne1/main/dlmalloc.c:4892 - Usage error in dlmalloc
cr0=0x8001003d cr2=0x164e8080 cr3=0x800f0000 cr4=0x216c
PCPU 0: SHSSSSSHSHS
Code start: 0x41803ba00000 VMK uptime: 0:15:35:38.411
0x412380d1dbf0:[0x41803ba8ccd9]PanicvPanicInt@vmkernel#nover+0x575 stack: 0x412300000008
0x412380d1dc50:[0x41803ba8cfld]Panic_NoSave@vmkernel#nover+0x49 stack: 0x4109509b4000
0x412380d1dc70:[0x41803ba422eflDLM_free@vmkerne1#nover+0x67b stack: 0x20
Ox412380d1dcc0:[0x41803ba58017]Heap_Free@vmkernel#nover+Ox107 stack: 0x410954b060b0
0x412380d1dce0:[firstname.lastname@example.org#220.127.116.11+0x3d stack: 0x410954b0
0x412380d1dd20:[0x41803c1a93e5]email@example.com*18.104.22.168+0x16d stack: Ox412380d1dda0
Ox412380d1dd50:[0x41803cla674a]firstname.lastname@example.org#22.214.171.124+Ox102 stack: Ox0
Ox412380d1ddb0:[0x41803c1ad8a2]email@example.com#126.96.36.199+0x1a6 stack: Ox1
0x412380d1de50:[0x41803c193b0c]firstname.lastname@example.org#188.8.131.52+Ox7c stack: Ox1
0x412380d1de90:[0x41803c194784]email@example.com#184.108.40.206+0x20 stack: 0x412300001018
Ox412380d1ded0:[0x41803c0b06b4]firstname.lastname@example.orgAPI#9.2*Oxa5 stack: 13x41088ec56280
Ox412380d1def0:[0x41003c8b06b4]CloseNetDev@com.vmware.driverAPI#9.2+0x7c stack: 0x4108a0e28840
0x412380dldf30:[0x41803bc3a9231Up1inkAsyncProcessCallsHelperCB@vmkernel#nover+0x223 stack: Ox0
Ox412380dldfd0:[0x41803ba60f8a]helpFunc@vmkernel#nover+0x6b6 stack: Ox0
0x412380d1df10:[0x41803bc53242]CpuSched_StartWorld@vmkernel#nover+0xfa stack: Ox0
base fs=0x0 gs=0x418041800000 Kgs=0x0
Coredump to disk. Slot 1 of 1.
Finalized dump header (12/12) DiskDump: Successful.
Debugger waiting(world 32820) -- no port for remote debugger. "Escape" for local debugger.
It look like you are using CNA, I suspect this PSOD of ESXI host is due to driver issue of CNA, bnx2 driver. Check if you using latest driver for CNA / HBA / NIC, if not update it and Open Support request with hardware vendor and VMware to investigate further.
On further investigation, it turns out that the DL360 G6 isn't on the HCL for ESXi 5.5, even though it's on the list for 5.1.
I wasn't expecting it to have been dropped on a sub-release.
According the VMware Compatibility Guide as well as http://h18004.www1.hp.com/products/servers/vmware/supportmatrix/hpvmware.html the DL360 G6 models are supported for ESXi 5.5. Maybe it's jist a firmware issue!? Did you upgrade the host's firmware already?
a.p is correct DL360 G6 , is supported with ESXI 5.5, what is processor series.
Yeah, the guy that checked the HCL for me selected "DL320" instead of "DL360". Nevermind!
My hardware guy says that the firmware was updated to the latest release last week.
DL360 G6 is Xeon socket 1366 based. I have two similar systems (one with 2 Westmere CPUs and one with 1 such CPU) that had stability issues after upgrading to ESXi 5.5 (have been rock stable under ESXi 5.1). What solved the problem for me was turning CPU C-state support off in the BIOS. Do you mind trying this? I'm collecting evidence that there is a problem with C-state support on older Intel CPUs, would love to hear your feedback!
We have the identical problem with a DELL R710. At the moment DELL is investigating the problem. I'll give an update.
Dell R710 is again an Intel socket 1366-based system. Do you mind trying to switch the CPU C-state support off in the BIOS and report the results?
Dell changed the mainboard at this server today, it should be an issue with the NICs. No effect. I disabled the C-States in BIOS, no effect. I made a new installation of esxi 5.5 on this server, only the configuration I restored (following KB: 2042141). No effect.
So I have the same situation like before.
I think later this week, I can open a support case at VMware. But till now I spend some time I need to work on some other problems.
Changing C-State support had no effect.
I suspect this to be driver issue of CNA you are using, either upgrade the driver if available or file support request with CNA vendor and VMware.
From the stack, it does look like a driver issue. But confirmation has to be made if it is the bnx2 or cna driver who is causing the PSOD. Earlier such DLM_free used to happen on HP servers and the workaround was to update the drivers.
However, just to confirm, are you using the customized ISO provided by HP
Yes, we are using the custom HP ISO.
My hardware guy is looking for updated drivers now.
The installed drivers were already the latest.
Any progress on this? Our DL360 G^ is PSOD'ing as well. Not immediately on boot however it will happily run for about a week and then PSOD.
HP ISO too btw.
There are, at least now, newer bnx drivers. VMware vSphere 5: Private Cloud Computing, Server and Data Center Virtualization
You can refer to my support request 13398046111. We use jumbo frames on the Broadcom NICs for ISCSI. So the solution was to to lower the mtu, turn off offloading features and then turn it back on again. Of curse we use jumbo frames (MTU 9000) again.
No PSOD since then.
Can you give me a URL for that support request? I can't find it.
Some more info from my network guy:
Device Speed Configured Switch MAC Address Obse.. Wake on LAN Supported 1
Broadcom Corporation NC382i Integrated Multi Port PCI Express Gigabit Server Adapter
vmnic1 1000 Full Negotiate vSwitch1 00:26:55:xx:xx:xx 172... Yes
vmnic0 1000 Full Negotiate vSwitch0 00:26:55:xx:xx:xx 172... Yes
Intel Corporation 82571EB Gigabit Ethernet Controller (Copper)
vmnic5 100 Full Negotiate vSwitch3 00:24:81:xx:xx:xx 172... No
vmnic4 100 Full Negotiate vSwitch2 00:24:81:xx:xx:xx 172... No
vmnic3 1000 Full Negotiate vSwitch1 00:24:81:xx:xx:xx 172... No
vmnic2 1000 Full Negotiate vSwitch0 00:24:81:xx:xx:xx 172... No
HP SPP 2013.9.0 shows all firmware up to date.
HP VMware ESXi Release 5.5.0, Build 1331820
After reinstalling everything from scratch, we are still having issues with this machine.
vSwitch0 reboots fine with jumbo frames (MTU=9000).
vSwitch1 reboots fine with standard frames but changing to jumbo frames causes the PSOD when a reboot is triggered from vCenter.
When testing jumbo frames on vmnic1 and vmnic3 separately, the system rebooted without a hitch.