Solved: Re: MCE error - Purple screen of Death - HP DL360 ...

sslaz · ‎08-10-2009

We have a new HP DL360 G6 running vsphere (ESX) 4.0. It is populated with two Intel Nehalem 5770 CPUs.

Twice now we have received Machine check errors, with Purple screen of death: i.e:

Aug 9 02:36:02 testsys vmkernel: 10:16:49:10.774 cpu8:4767)MCE: 866: MCE on cpu8 bank8: Status:0x88000040000200cf Misc:0x5dd880400001100 Addr:0x0: Valid.Misc valid.

and

Jul 29 06:27:59 testsys vmkernel: 20:15:27:34.025 cpu10:4281)MCE: 866: MCE on cpu10 bank8: Status:0x8c0000400001009f Misc:0x5dd880400005840 Addr:0x578975940: Valid.Misc

valid.Addr valid.

Jul 29 06:27:59 testsys vmkernel: 20:15:27:34.025 cpu9:4288)MCE: 866: MCE on cpu9 bank8: Status:0x8c0000400001009f Misc:0x0 Addr:0x0: Valid.Misc valid.Addr valid.

Jul 29 06:27:59 testsys vmkernel: 20:15:27:34.025 cpu8:4279)MCE: 866: MCE on cpu8 bank8: Status:0x8c0000400001009f Misc:0x0 Addr:0x0: Valid.Misc valid.Addr valid.

We don't have a support contract with HP, (Just warranty support). They asked us to take the thing down and run full diagnostics on it (ugh) as well as ensure we are at current firmware (we were). Though why we have to run diag on the disks for this kind of thing escapes me.

Well, the diagnostics are happily cranking away. I'm just glad that we did not put this system into production yet.

Questions:

Has anyone else had problems with Machine check errors with Intel's new Nehalem chips under ESX?

Does anyone know how I can map the cpu number in the vmkernel message back to a physical cpu? HP actually said that we should "swap cpus" To diagnose. In my mind this is clearly a hardware problem and they should figure out which cpu it is themselves and replace.

Clearly, we need some kind of a support contract, since warranty support for this is somewhat inadequate. This is our first HP server (past experience with sun). And so far, I'm none too happy.

I guess this is off topic for Vmwre ESX, but if anyone has suggestions as to how much we have to pay to get a decent level of support I'd like to know.

mcowger · ‎09-30-2009

The errors you see in the log combined with the machine check strongly suggest bad memory.

--Matt

VCP, vExpert, Unix Geek

--Matt VCDX #52 blog.cowger.us

View solution in original post

sslaz · ‎08-10-2009

oops. 5570 cpus

gregh123 · ‎09-30-2009

Hi ssalz,

These log messages appear to be from "/var/log/vmkernel". They do not show any fatal errors. Fatal errors would end up in the crash dump, but not in /var/log/vmkernel. The logs posted show errors that are not fatal. They lack the "UC" (uncorrected) bit and also lack the "PCC" (processor context corrupt) bit.

These messages are just telling you that the hardware corrected memory errors.

Status:0x88000040000200cf says it was a "memory scrubbing error"0x8c0000400001009f says it was a "memory read error"

But, more importantly for you, you need to see what the error was when it actually crashed.

mcowger · ‎09-30-2009

The errors you see in the log combined with the machine check strongly suggest bad memory.

--Matt

VCP, vExpert, Unix Geek

--Matt VCDX #52 blog.cowger.us

fakber · ‎09-30-2009

> Does anyone know how I can map the cpu number in the vmkernel message

back to a physical cpu? HP actually said that we should "swap cpus" To

diagnose. In my mind this is clearly a hardware problem and they should

figure out which cpu it is themselves and replace.

> Aug 9 02:36:02 testsys vmkernel: 10:16:49:10.774 cpu8:4767)MCE: 866:

MCE on cpu8 bank8: Status:0x88000040000200cf Misc:0x5dd880400001100

Addr:0x0: Valid.Misc valid.

> Jul 29 06:27:59 testsys vmkernel: 20:15:27:34.025 cpu10:4281)MCE: 866:

MCE on cpu10 bank8: Status:0x8c0000400001009f Misc:0x5dd880400005840

Addr:0x578975940: Valid.Misc valid.Addr valid.

> Jul 29 06:27:59 testsys vmkernel: 20:15:27:34.025 cpu9:4288)MCE: 866:

MCE on cpu9 bank8: Status:0x8c0000400001009f Misc:0x0 Addr:0x0:

Valid.Misc valid.Addr valid.

> Jul 29 06:27:59 testsys vmkernel: 20:15:27:34.025 cpu8:4279)MCE: 866:

MCE on cpu8 bank8: Status:0x8c0000400001009f Misc:0x0 Addr:0x0:

Valid.Misc valid.Addr valid.

sslaz,

You're correct in saying that it is a hardware issue. MCE's are generated by the Machine Check Architecture (MCA) of the CPU when it sees an issue. To determine which package (physical socket) is reporting the issue, I have highlighted from the logs you included above. From that you need to look at how many cores are in each package and whether or not HyperThreading (HT) is enabled and in use.

Thus if you have a 4 core - 4 socket system and you are not using HT, then CPU 8 - 11 are in the third physical socket. So you can say "CPU 3" on the board is reporting the issues you're seeing.

I hope this helps.

Faisal Akber

sslaz · ‎09-30-2009

ding ding ding! MCgower, we have a winner!! We think that it was bad memory. On the other hand HP ended up sending us a replacement cpu as well for good measure, so we will never be sure.

By the way, someone said this would be in the third slot? actually this is a two slot machine - with only two cpus.

I still think HP was off base asking us to run full diags on the disks, when it was clearly not that.

sslaz · ‎10-04-2009

oh, by the way, they said that since we have a 2 socket system, the first .5 of the vcpus are running on the first physical cpu, and the second .5 on the second.

Thus if we have vcpus 0 - 15, then 0-7 run on cpu0, and 8-15 run on physical cpu1.

This was a rule of thumb per the tech support.

The errors all occured on the second physical cpu, and on the same memory bank (8). Thus the problem was probably memory, but possibly cpu1.

gregh123 · ‎10-05-2009

> The errors all occured

on the second physical cpu, and on the same memory bank (8). Thus the

problem was probably memory, but possibly cpu1.

One should note that "bank 8" refers to a particluar set of CPU registers (MSRs), and not to a particular bank of memory. The problem could've been in the memory attached to that physical CPU.

All

MCE error - Purple screen of Death - HP DL360 G6 ESX4.0