Erik67
Contributor
Contributor

ESXi crashes with PSOD, what hardware to replace

I am running four nearly identical servers with Intel S3210SHLC and quad core Xeons. After six weeks of service, one of them suddenly crashed with a PSOD. It was the newest of the four. By resetting the server, it ran for another week and crashed again. I replaced the memory on the server, but it crashed again. Then I swapped the whole motherboard with CPU with I test server I use myself, and the problem follows the motherboard/cpu. The motherboards have identical BIOS versions. The server are using local storage (ahci) and is booting from USB-sticks. The attached file is a picture of the third PSOD from the customer site.

I know that the hardware is not on the whql, but three of four servers runs flawlessly and I have narrowed the problem down to either the motherboard or CPU.

My question is: Do I need to replace the motherboard or the CPU? Both are under warranty.

Erik

0 Kudos
6 Replies
Lightbulb
Virtuoso
Virtuoso

Go with replacing the CPU first. If problem persists swap out the MB (It is possible there could be a socket issue.)

0 Kudos
weinstein5
Immortal
Immortal

I woudl start with the CPU - it is the mosy likeley cuprit but if you want to be safe do them both -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
Erik67
Contributor
Contributor

Thank you for your sugestion. What information on the PSOD leeds you to believe that the CPU is at fault? In my experience CPUs very seldom fails. It is usually the MB.

Attached is the latest PSOD. This has a different error message. Could the be related to the USB controller on the MB? This happened during boot and it booted fine after moving the USB stick to an different USB port. (I was using an internal port on the MB) I have to load it with some VMs to see if it will stay up.

Erik

0 Kudos
Lightbulb
Virtuoso
Virtuoso

From the First PSOD it was the CPU errors that made me think CPU. If they are both under warranty swap out both test and determine which component it is. It is bound to be one of the two Smiley Happy

In my experience I have had more MB issues that CPU (Memory more offten that both of those but you already tried that) but I have caught a bad CPU now and again.

Good luck.

0 Kudos
kooltechies
Expert
Expert

Hi,

This PSOD is entirely different than the first one. First one looks like more of a CPU issue as you can see the errors through the vcpu0 which means the userworld i.e the VM running on vcpu0. The second PSOD is in a different userworld and caused by the sfcbd daemon which is the CIM agent running on the EESX box , the CIM helps you in getting the hardware details to the VI client.

I will also suggest you to go through the normal CPU change first , but you can do it this way replace this machines CPU and place a new CPU. Run CPU burn utility on the CPU taken out of the first machine to check if it really have any issues.

Thanks,

kooltechies(samir)

P.S : Please consider awarding points if you think its helpful.

Blog : http://thinkingloudoncloud.com || Twitter : @kooltechies || P.S : If you think that the answer is correct/helpful please consider rewarding points.
0 Kudos
Erik67
Contributor
Contributor

The first PSOD was from the customer server and happened during normal operation. The second was from after the MB/CPU was placed in my lab server and during the boot process. I then moved the USB stick to a different USB port, started all 14 VMs available on the server and waited. After about two hours it crashed again with an PSOD similar to the first.

Du to the recommendations that the CPU needed to be replaced, I borrowed a Celeron E1500 CPU from a new computer (dual 2,2 GHz) and placed it in the server. Due to the lack of VT support in the Celeron, I can not run the three 64-bit VMs, but the 11 other VMs should be plenty to stress the server. After running for about an hour, it looks like every VM is started and I can log on to them. If this setup runs for 24 hours, the CPU is returned to Intel.

Erik

0 Kudos