Hey guys – I have four identically configured BL460c G7’s (ESX 4.1 Build 348481) – recently one of them purple screened on a bnx2 broadcom issue. I brought the host back up thinking it was a one-off, but then the following afternoon a couple of the machines lost network connection. Not all of them on the host but most including a windows and unix machine.
I’m thinking the card might be going bad because if it was a driver issue I would be having the problem on all four blades?
Does anyone know how to test the card? Here’s the ethtool output for the bnx2 cards:
… ethtool -i vmnic0
driver: bnx2
version: 2.0.7d-3vmw
firmware-version: 5.2.3
bus-info: 0000:06:00.0
… ethtool -i vmnic1
driver: bnx2
version: 2.0.7d-3vmw
firmware-version: 5.2.3
bus-info: 0000:06:00.1
dump below:
345:20:05:44.376 cpu22:4118)NMP: nmpCompleteRetryForPath: Retry world recovered device "naa.600508b1001cd1a3cb7fcd11a57c0061"
345:20:06:11.426 cpu16:4112)Backtrace for current CPU #16, worldID=4112, ebp=0x417f80087558
345:20:06:11.427 cpu16:4112)0x417f80087558:[0x41802c455685]PanicLogBacktrace@vmkernel:nover+0x18 stack: 0x4100021ac080, 0x417f8
345:20:06:11.427 cpu16:4112)0x417f80087698:[0x41802c4558c4]PanicvPanicInt@vmkernel:nover+0x1ab stack: 0x3000000010, 0x417f80087
345:20:06:11.428 cpu16:4112)0x417f80087778:[0x41802c455d46]Panic_ExceptionMsg@vmkernel:nover+0xa5 stack: 0x41802c947f94, 0x80,
345:20:06:11.428 cpu16:4112)0x417f80087878:[0x41802c455dcc]Panic_Exception@vmkernel:nover+0x83 stack: 0x417f80087ab8, 0x41802c9
345:20:06:11.429 cpu16:4112)0x417f800878c8:[0x41802c42dd81]IDTReturnPrepare@vmkernel:nover+0x254 stack: 0x417f80087af8, 0x41802
345:20:06:11.429 cpu16:4112)0x417f800878d8:[0x41802c4ded47]gate_entry@vmkernel:nover+0x46 stack: 0x4018, 0x4018, 0x41000cc07ed0
345:20:06:11.430 cpu16:4112)0x417f80087af8:[0x41802c947f94]bnx2_poll_work@esx:nover+0x11f stack: 0x417f80087b48, 0x402c4e9321,
345:20:06:11.430 cpu16:4112)0x417f80087b48:[0x41802c949490]bnx2_poll@esx:nover+0x143 stack: 0x417f80087c64, 0x41000cc075e8, 0x4
345:20:06:11.431 cpu16:4112)0x417f80087bc8:[0x41802c85319a]napi_poll@esx:nover+0x10d stack: 0x417fecca9f78, 0x41000cc18810, 0x4
345:20:06:11.431 cpu16:4112)0x417f80087c98:[0x41802c4d77eb]WorldletBHHandler@vmkernel:nover+0x442 stack: 0x417fecbc84a0, 0x0, 0
345:20:06:11.432 cpu16:4112)0x417f80087cf8:[0x41802c4063b6]BHCallHandlers@vmkernel:nover+0xc5 stack: 0x100410002408000, 0x13767
345:20:06:11.432 cpu16:4112)0x417f80087d38:[0x41802c4066b0]BH_Check@vmkernel:nover+0xcf stack: 0x417f80087de8, 0x1000000002, 0x
345:20:06:11.433 cpu16:4112)0x417f80087e48:[0x41802c5cdee5]CpuSchedIdleLoopInt@vmkernel:nover+0x6c stack: 0x417f80087e88, 0x418
345:20:06:11.433 cpu16:4112)0x417f80087e58:[0x41802c5d3fce]CpuSched_IdleLoop@vmkernel:nover+0x15 stack: 0x10, 0x4, 0x10, 0x4, 0
345:20:06:11.434 cpu16:4112)0x417f80087e88:[0x41802c432c57]Init_SlaveIdle@vmkernel:nover+0x11e stack: 0x0, 0x200000000, 0x0, 0x
345:20:06:11.434 cpu16:4112)0x417f80087fe8:[0x41802c6a5668]SMPSlaveIdle@vmkernel:nover+0x45f stack: 0x0, 0x0, 0x0, 0x0, 0x0
345:20:06:11.435 cpu16:4112)[45m[33;1mVMware ESX 4.1.0 [Releasebuild-348481 X86_64][0m
You are welcome. I am glad to help when I have free time. The screen shot clearly shows the cpu was in the idle loop at the time of the psod. This means the cpu was not executing any code at the time, this indicates hardware (including firmware and bios). The device driver would be software that would be executed so it may not be the device driver but it is still hard to tell at this point. Definitely push HP to look at fw/bios as well not just a blatent hardware error, it might be subtle and may not show up right away.
Hi,
Could be memory problem.
Thanks
Sa
well I guess anything could be a memory problem lol - what makes you think so?
I ran the full HP Diagnostics off the smart start CD and everything passed. Now i'm running Memtest86+ and sofar so good - 53% through pass 1. Takes awhile with 64GB
I'm thinking maybe I should put the host in a seperate cluster, move a few test machines there and do some network load testing?
Can you engage VMware support to decode the core dump properly? The log file is only part of the story and it indicates hardware which can include firmware and bios or device driver, BH = bottom half handler which is part of the device driver. The real decode will show what the world id 4112 actually was. If it was idle then that indicates hardware. If it was a function in the kernel or the device driver that could be a bug. Don't assume that because it only happened to one host it can't be a driver issue, it could be a bug in the driver triggered by some set of events only this server running a specific set of VM's experienced. One thing to note is if you leave the host completely idle over the weekend does it psod again? or does it require a load to crash. Do all the psods have the same kernel stack trace or does it move around? that is another way to determine if it is software or hardware. It could also be something the NIC's are talking to such as a port on the switch or a bad cable or a setting on the network switch. Its too early to tell from this info alone.
I recommend all my customers who get PSOD's to open support cases with both VMware and with the hardware vendor. There may be things that only support can tell you such as a firmware or bios bug or a known issue with a device driver or perhaps there really is bad hardware and it isnt showing up in the diagnostics you have.
In my experience ESX is one of the most sensitive hardware diagnostic tools out there 😉 It is quite sensitive to issues that even true hardware diagnostic tools can't detect or other operating systems. Makes it harder but the sooner you can get the proper support engaged the more chance you have to getting it resolved. Make sure to take screenshots or even take a picture with your camera of the actual purple screen of death it has info that you may not be able to get from the logs sometimes. Also make sure to record them all not just one since it can have detailed information that is not obvious to people who don't look at PSOD's every day like some of us do 🙂
Hanna, fantastic reply, thanks for that. I did open a case with VMWare, and I have a screenshot of the PSOD! I took a shot at decoding the MCE - but not much luck there. I haven’t opened a case with HP (But I will) and I’ll go back and see if VMWare can provide me with the full decode, and idle / operation info - that's valuable insight.
You are welcome. I am glad to help when I have free time. The screen shot clearly shows the cpu was in the idle loop at the time of the psod. This means the cpu was not executing any code at the time, this indicates hardware (including firmware and bios). The device driver would be software that would be executed so it may not be the device driver but it is still hard to tell at this point. Definitely push HP to look at fw/bios as well not just a blatent hardware error, it might be subtle and may not show up right away.