ESXi purple screen, help identify hardware error

torchredfrc · ‎06-24-2015

Can anyone identify the possible root cause of these purple screen errors? It appears to me that PCPU4,5,6,7 (CPU #2) may be failing but before I start swapping hardware I would like to understand the error a little better.

torchredfrc · ‎06-24-2015

Meant to add, server is the following:

Asus RS162-E4/RX4

2x E5405 Intel Xeon processors; 4 core 2.0 GHz

32 GB ECC memory

Asus ZCR raid add-on card

4x HDD running in RAID5

Array is reporting good in the ZCR BIOS, system will run for near a week then purple screen again.

UmeshAhuja · ‎06-24-2015

Hi,

Not sure but might be similar problem that I have occurred earlier.

Where my ESXi host PCPU got locked at the time of reboot because In some HP servers experience a situation where the PCC (Processor Clocking Control or Collaborative Power Control) communication between the VMware ESXi kernel (VMkernel) and the server BIOS does not function correctly.

As a result, one or more PCPUs may remain in SMM (System Management Mode) for many seconds. When the VMkernel notices a PCPU is not available for an extended period of time, a purple diagnostic screen occurs.

Looking at your ESXi host screen , Same issue look like where your PCPU files seems to be corrupt , Try swapping the CPU or else you need to re-install the ESXi host

Thanks n Regards
Umesh Ahuja

If your query resolved then please consider awarding points by correct or helpful marking.

cykVM · ‎06-24-2015

You may dig through: VMware KB: Decoding Machine Check Exception (MCE) output after a purple screen error

But to me it looks like there's something wrong with the 2nd CPU. I would check if the cooling is probably not working properly (after some runtime).

Also check if the heatsink is properly installed. Finally you may also try to swap both CPUs and see if the PCPUx errors change to PCPU1...4.

Alistar · ‎06-25-2015

Hello,

I did a quick debug of the MCE you got (0xf200001044100e0f) and the debug output is:

Observer: Generic while processing Generic Error during Other transaction on Generic Cache. Request Did Not Time Out.

It is highly probable that your second processor's (PCPU #2) cache is corrupt. You can also see that in error stack that it always after CPU Scheduler trying to allocate resources. Contact your Hardware vendor and have them replace the CPU, then run some extended stress testing on the host. You can also check by stress testing it in its current state and waiting for it to fail.

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/

All

ESXi purple screen, help identify hardware error