Can anyone identify the possible root cause of these purple screen errors? It appears to me that PCPU4,5,6,7 (CPU #2) may be failing but before I start swapping hardware I would like to understand the error a little better.
Meant to add, server is the following:
Asus RS162-E4/RX4
2x E5405 Intel Xeon processors; 4 core 2.0 GHz
32 GB ECC memory
Asus ZCR raid add-on card
4x HDD running in RAID5
Array is reporting good in the ZCR BIOS, system will run for near a week then purple screen again.
Hi,
Not sure but might be similar problem that I have occurred earlier.
Where my ESXi host PCPU got locked at the time of reboot because In some HP servers experience a situation where the PCC (Processor Clocking Control or Collaborative Power Control) communication between the VMware ESXi kernel (VMkernel) and the server BIOS does not function correctly.
As a result, one or more PCPUs may remain in SMM (System Management Mode) for many seconds. When the VMkernel notices a PCPU is not available for an extended period of time, a purple diagnostic screen occurs.
Looking at your ESXi host screen , Same issue look like where your PCPU files seems to be corrupt , Try swapping the CPU or else you need to re-install the ESXi host
You may dig through: VMware KB: Decoding Machine Check Exception (MCE) output after a purple screen error
But to me it looks like there's something wrong with the 2nd CPU. I would check if the cooling is probably not working properly (after some runtime).
Also check if the heatsink is properly installed. Finally you may also try to swap both CPUs and see if the PCPUx errors change to PCPU1...4.
Hello,
I did a quick debug of the MCE you got (0xf200001044100e0f) and the debug output is:
Observer: Generic while processing Generic Error during Other transaction on Generic Cache. Request Did Not Time Out.
It is highly probable that your second processor's (PCPU #2) cache is corrupt. You can also see that in error stack that it always after CPU Scheduler trying to allocate resources. Contact your Hardware vendor and have them replace the CPU, then run some extended stress testing on the host. You can also check by stress testing it in its current state and waiting for it to fail.