VMware Cloud Community
kroerig
Contributor
Contributor

Interpreting PSOD

Hello Experts,

can someome please help me to find out what's wrong with my ESXi 4.1 server.

I attached a photo.

I cannot find a vmkdump, I try to generate it manually.

Thx,

Klaus

Tags (2)
0 Kudos
6 Replies
idle-jam
Immortal
Immortal

Hi,

if you refer to here, you would definately find the root cause.

http://kb.vmware.com/kb/1005184

0 Kudos
a_p_
Leadership
Leadership

Welcome to the Community,

I'd suggest you run a hardware diagnostic on your server. If the system manufacturer does not provide such a tool, then at least run a memory check. A lot of errors are caused by defective memory. (http://kb.vmware.com/kb/831)

André

0 Kudos
kroerig
Contributor
Contributor

Hi,

I tried to decode the MCA, but I think I didn't get it. Perheps someone could help me. I attach the kernel-dump.

Would be nice if we could do this together. This is the first time I do this.

Thx for your help.

Some System information for you:

System: ICO (Intel S5000PSL)

CPU: Dual Intel Xeon E5410 @2.33 Ghz

RAM: 16 GB

Running VMs: 6 (all Windows 2003 R2 32bit)

I told my colleagues to run Memtest asap.

Klaus

0 Kudos
Troy_Clavell
Immortal
Immortal

The dumps aren't always the easiest to decode.  MCE's are typically hardware related.  Usually DIMMs or CPU.  The first thing I would do is ensure all your firmware is current.  Then take the VMkernel dump and send it to your hardware vendor.

I would also suggest running a vm-support dump which will incluse the VMkernel dump and send to your hardware vendor as well.

0 Kudos
kroerig
Contributor
Contributor

Hi,

I did memtest 86+ for 72h. No Errors.

Now I'm searching for a (freeware) CPU stress test util.

Do you know a goog one?

Thx,

Klaus

0 Kudos
fakber
VMware Employee
VMware Employee

Hi,

If you look carefully at the second line of the PSOD screen itself, it explains to a certain degree what the problem is.

Your server has experienced what is known as a Machine-Check exception (#MC).  Near the bottom of the screen you should see the information from the registers of the Machine-Check Architecture of the CPU that generated the exception.

Going back to the second line, we also see that VMware ESX has decoded the "MCA Error Code" of the status code.  In that it says that a Bus and Interconnect error was seen.  Memory tests alone may not help in identifying the problematic hardware.  Use the data from the screen and provided it to your hardware vendor to review so they can take the required action to correct the hardware problem that caused this crash in the first place.

I hope this helps.

Faisal Akber