VMware Cloud Community
Brado23
Contributor
Contributor

Frequent PSOD's on DL380

Hi,

I have been battling with a problem where an ESX server crashes with a PSOD at any interval

up until 38 days uptime (this is the most I have achieved, lol). The PSOD reports a Machine

Check Exception which I understand is an unrecoverable hardware error. Even with this being

the case I would like to know what part of the hardware is causing the issue and replace

that part. The hardware I am using is as follows:

HP DL380 G3, 6GB RAM (6x1GB chips), Dual Xeon 3.06GHz CPU's, NC7170 Dual Port NIC in Slot 3

(top slot), 6402 Smart Array in slot 1 (bottom slot). System is configured with 2 logical

drives - one in RAID1 for ESX installation, one in RAID5 for VM files. Both were running

off the onboard 5i controller on the same SCSI channel until last night (I know this was

far from optimal so no need to point this out unless you know it is definitely the source

of my issues Smiley Happy). I now have the RAID1 set for ESX installation running off the 5i alone

on it's own channel, with the RAID5 set for VM's running off it's own channel on the 6402,

but haven't had the system up long enough to see if the issue still exists.

It has the latest system rompaq installed and the HP Firmware Maintenance CD 7.8 was run

over the hardware and all components were upgraded before the server was built.

I built the system with ESX 3.0.1 originally where I first experienced the problem, and had

the same issue after upgrading to 3.0.2 and then 3.5. As it appears to be a hardware issue

I wasn't expecting it to be corrected in any of the newer releases, but upgraded for other

reasons. I have also gone through several HP Management Agents versions from 7.7 to current

which is 7.9.1. I have tried removing the storage agents and performance agents before I

understood what a machine check exception was but this obviously didn't fix the issue

either.

I have run HP Smartstart Diagnostics over the server and no hardware errors are reported.

After viewing the vmkdump output (which I will post later in this message), I figured it

was an error with the RAM so I ran memtest86 on the system for 8 hours and no errors were

reported.

I originally had an NC3130 NIC in the server and thought that may be the problem after I

had the server reboot after pulling one of the network cables from the server whilst it was

running, but after replacing that card with the NC7170 the PSOD's still occur (I don't

think pulling the cables out causes a reboot anymore however).

I have attached a copy of the vmkdump output (last 100 lines). Like I said I thought it was

a RAM issue after seeing the ECC/Parity error messages, but both HP Diags and memtest86 reports that the RAM is OK.

If anyone has any ideas on what it could be from the output, or advise me how to troubleshoot this further it would be very much appreciated. Thanks.

Reply
0 Kudos
8 Replies
kimono
Expert
Expert

Sound hard to diagnose, perhaps CPU.

Warranty/Support Call? Can you take it out of the cluster and rebuild it?

/kimono/

/kimono/
Reply
0 Kudos
Brado23
Contributor
Contributor

Took your suggestions and started looking at it from a CPU perspective and noticed some thing interesting. After looking through other vmkdumps from previous crashes the following 2 lines always are associated with cpu0 and cpu1....

0:09:01:32.433 cpu1:1092)WARNING: MCE: 313: Physical Address 0x445b9c44 generated machine check error(ECC/Parity)

0:09:01:32.433 cpu0:1090)WARNING: MCE: 313: Physical Address 0x445b9c44 generated machine check error(ECC/Parity)

I assume cpu0 and cpu1 represent the first physical CPU with hyperthreading, and cpu2 and cpu3 represent the second physical CPU with hyperthreading. Is this correct? If so, I might try replacing the first CPU in the server and see what happens.

Maybe a CPU cache issue?

Thanks

Reply
0 Kudos
jhanekom
Virtuoso
Virtuoso

I realise it's hard to get downtime for that long a period, but I've always been told to run memtest for at least 48 hours.

I'm worried that, on a G3 machine, 6GB of RAM is a lot to cover in 8 hours with all the different tests memtest runs. Were you able to see how many full passes were performed when you tested?

You may also want to check the Integrated Management Log (hardware-based log) of the system to see if there are any clues there. This can usually be viewed from within the Insight Agents homepage (https://host:2381) by clicking on Logs.

Reply
0 Kudos
kimono
Expert
Expert

if you have Insight installed, but don't run the homepage:

/opt/compaq/utils/hplog -v

/opt/compaq/utils/hpimlview

from the service console are two other ways to do it.

/kimono/
Reply
0 Kudos
jhanekom
Virtuoso
Virtuoso

Now that's useful! Thanks kimono - not having used HP's for running Linux much before, I wasn't aware of those tools. I'm definitely going to include that in my weekly health checks now. (Viewing the log through the GUI was far too cumbersome in the past.)

Reply
0 Kudos
ewannema
Enthusiast
Enthusiast

There is a SMP version of memtest that runs checks against the memory from the point of view of each processor. I received it from HP during a support call so I don't know what the publicly available location is. Also, I have had situations where memtest reported no problems but the diagnostic fault lights on the chassis lit up and we had the memory replaced.

http://wannemacher.us
Brado23
Contributor
Contributor

8hrs was one complete pass with memtest. It was about 3 or 4% on the 2nd pass when I stopped it. I couldn't afford an outage any longer on that occasion.

Nothing bad is reported in the IML.

I'm really interested in that other version of memtest if someone has a download link for it. EDIT: Damn! Just checked the memtest86 site and they released a new version on the 27th Dec 2007 which supports SMP. Wish I had of checked before I ran the test. Need another outage now Smiley Sad

Thanks for the other responses guys. Much appreciated.

Anyone know the answer to my cpu0/1 cpu 2/3 question?

Reply
0 Kudos
Daniel_Gurrola
Contributor
Contributor

I have been working on a CPU Panic issue which appears to have been addressed for me in the following: http://kb.vmware.com/selfservice/viewContent.do?language=en_US&externalId=1081

Not sure if you have done the things mentioned there or not. I just saw this post and thought this may help - probably too late, but maybe not for those that use google to search vmware support forums like me.

Reply
0 Kudos