Highlighted
Contributor
Contributor

PSOD - Esx4.1 HP Proliant DL 385 G7

I've a very strange problem on seven new HP PROLIANT DL385 G7 (latest firmware and upgarde) with ESX 4.1 installed, in a random manner, in different time and unexpectedly system crash and happens PSOD, follow the first line:

- Uncorrected ECC error in L3 Cache LRU on CPU 0 cache index 0x25b PCPU0 in world 4120:idle

I'm afraid that hardware is the concern ...

Thanks

PSOD is attached to this thread

0 Kudos
78 Replies
Highlighted
Immortal
Immortal

The patch I referenced is specifically for the AMD processor 6100 series and I used BL465c G7 . Use the HCL for yourself. http://vmware.com/go/hcl

-- David -- VMware Communities Moderator
0 Kudos
Highlighted
Contributor
Contributor

That seems to mention the Sept BIOS. I don't see a newer one. I already have that loaded for months and unfortunately it didn't fix.

0 Kudos
Highlighted
Contributor
Contributor

That note appears to be for 4.0, not 4.1 . In any case all updates available are applied.

Also it seems old enough that my support at VMware would have mentioned it during the support call (i would hope so anyway).

This is ESXi 4.1 installable.

0 Kudos
Highlighted
Contributor
Contributor

Hmm, i do see it for DL385 G7 (not my blade) but it doesn't seem to make sense and the date in the release seems old...?

0 Kudos
Highlighted
Immortal
Immortal

It applies to both 4 and 4.1. I would run the update scan to see whether it is installed or applies. Can't hurt to check

-- David -- VMware Communities Moderator
0 Kudos
Highlighted
Immortal
Immortal

The HCL entry for the servers which use the 61xx CPUs do reference a required patch for ESX/ESXi 4.0 Update 1. This is the first ESX patch level which supported the 61xx CPUs, which is why the HCL says it is required.

This patch predates discovery of this issue. and cannot be installed on ESX/ESXi 4.1 (which already has the support).

If you encounter the specific issue described in http://kb.vmware.com/kb/1030509 please file a support request with VMware Technical Support and with your server hardware vendor. VMware recommends that you contact your server vendor directly for more information on how to proceed.

0 Kudos
Highlighted
Immortal
Immortal

The HCL should be changed to remove the reference to 4.1 if that is the case.

Edit It looks like the HCL has been changed.

-- David -- VMware Communities Moderator
0 Kudos
Highlighted
Immortal
Immortal

The HCL reference to the DL 385 G7 does still show the reference to the patch. It needs to be updated.

-- David -- VMware Communities Moderator
0 Kudos
Highlighted
Contributor
Contributor

Thanks i have opened an SR, spoken to tech support and our BCS primary contact and sent off an email to our HP Enterprise support rep who has forwarded this to the appropriate expert.

I will see where this goes.

Thanks,

Ron

0 Kudos
Highlighted
Enthusiast
Enthusiast

I've had this issue open with HP and VMware for quite some time.

HP just posted two important information items!

#1 - they have a BIOS fix for the G7 series.

#2 - the release notes indicate patches for ESX which resolve the issue, VMware ESX 4.0 Update 3 (Horray for ESX Classic !!) and also, ESXi 4.0 U1.

We are hoping that this resolves the frequent crash issues we have seen.   Please read for details:

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=37...

Problems Fixed:

Resolved an issue in which uncorrectable  memory errors (or other fatal system errors) will not be logged to the  Integrated Management Log (IML) when using some revisions of VMware ESX  Server.  These errors will result in a fatal error (Purple Screen of  Death - PsoD) under Vmware ESX, but there will not be any indication of  the error type (including no indication of an uncorrectable memory error  or what DIMM has failed). The VMware ESX Server issue which causes this  is addressed in VMware ESX 4.1 U1 and VMware ESX 4.0 U3.

Resolved an  extremely intermittent issue that can result in uncorrectable memory  errors being reported.  This will result in a system reset and an error  logged to the Integrated Management Log (IML).  If using certain  revisions of VMware ESX Server, this issue will result in a fatal error  (Purple Screen of Death - PsoD). This is a result of a processor issue  and is NOT an issue with the DIMM or the system board hardware.  The  processor issue is completely resolved by the work-around included in  this revision of the System ROM.  The issue does not occur if the  Minimum Processor Idle State option in the ROM-Based Setup Utility  (RBSU) is configured for No C-states.

Resolved an issue where the  system will not properly handle an Online Spare memory failover event.  HP recommends that customers who configure the Advanced Memory  Protection option for Online Spare Mode upgrade to this system ROM  version to ensure proper operation.

0 Kudos
Highlighted
Contributor
Contributor

This is great news (even though i wish HP or VMware had mentioned it as part of my current open issues as someone there must have been aware of this).

I see the ESXi 4.1 updates came out last night also so they are ready to go.

I will be looking to apply this (given change control approvals) as soon as possible.

Thanks very much for your post! Smiley Happy

Ron

0 Kudos
Highlighted
Enthusiast
Enthusiast

I may as well post this advisory from HP here also, it is similar enough.

Another BIOS update is required.

  • Getting the BIOS updated using ESXi is quite a bother.   If you experience a challenge in getting these updated, perhaps you should mention this to both HP and VMware at your next opportunity. 

  • Doing this same through ESX classic is straight forward and fast, just run the linux varient of the BIOS ROM flash.

Advisory Link:

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02259129&lang=en&cc=us&taskI...

From the Advisory:

Advisory: (Revision) ProLiant G7-Series Servers - Some ProLiant  G7-Series Servers Using the AMD Opteron 6000-Series Processors May Log  an Additional Critical Error Message in the Integrated Management Log  (IML) When Sourcing an Uncorrectable Memory Event

Some ProLiant AMD G7-series servers using AMD Opteron 6000-series  processors may display an "Internal Chipset Error Detected" message  during system boot and log one or more "An Unrecoverable System Error  has occurred" entries into the Integrated Management Log (IML) when an  Uncorrectable Memory Error event occurs.

[much deleted, please review the link above]

RESOLUTION

The Systems ROM for the  affected AMD Opteron-series servers have been updated to prevent the  invalid "Internal Chipset Error detected" message from being displayed  on the screen and also to prevent the erroneous IML entry from being  logged.

.... upgrade to System ROM Version 2011.01.29 (or later):

[downloads are available from HP by following the link above.]

0 Kudos
Highlighted
Contributor
Contributor

We are using ESXi 4.1 U1 and using the USBKey BIOS update its quite easy.

Just format a USB stick using the USBkey executable part of the download.

Put the usb stick on your workstation.

ILO with Integrated Remote Console and click the checkbox connecting the USB key to the server. (It has a little usb stick icon).

Restart the server and it will boot into the BIOS flash. Proceed as normal to complete flashing a BIOS.

The bad is that it really takes quite awhile to update a lot of hosts.

Pretty easy.

Ron

0 Kudos
Highlighted
Enthusiast
Enthusiast

I'm glad that's working well for you.

If you considered a larger enironment (think over 1000 ESXi servers)  now what?  Log into ILO 1000 times?   I'd rather not.

If you consider ESX classic, we can script the linux based online bios flash to go out to each server, check a few things, apply the flash, and then do a rolling reboot through the clusters to apply it.    Compared to logging into ILO, which is a great too for what it is, this is much easier.

I'm sorry to see ESX Classic go because VMware is really  tying my hands.

I'm looking for a good replacement solutions to apply critical firmware like the ones listed in this thread.    I'm open to suggestions!


Thanks.

0 Kudos
Highlighted
Contributor
Contributor

I thought thats what you might be getting at.

Either VMA or vCLI is what you would have to look at.

VMA is the replacement for the console pretty much.

We don't have 1000 hosts so i haven't taken the time yet to investigate further.

Let me know if you find a way to use these tools to do this.

Another option is PowerCLI but i am thinking that will not do it.

Regards,

Ron

0 Kudos
Highlighted
Enthusiast
Enthusiast

I don't see how either of those would work for a firmware flash.  I'm prepared to be enlightened if someone has a good idea.

0 Kudos
Highlighted
Contributor
Contributor

I have a customer with 4 DL360G7 that have PSOD issues. I have already applied the firmware fix which seems to help. I still get random Machine Check Errors. Power save mode is on, does that seem to be an issue? It only seems to happen under very heavy use.

Server Name  

Product Name   ProLiant DL360 G7

UUID     34353935-3239-584D-5131-313630575A54

Server Serial Number     MXQ1160WZT

Product ID           595492-002

System ROM      P68 05/05/2011

Backup System ROM      08/16/2010

Integrated Remote Console        .NET    Java

License Type      iLO 3 Standard

iLO Firmware Version     1.28 Jan 13 2012

IP Address          192.168.0.128

iLO Hostname    DL360G7ILO.

Status

System Health   [OK]  OK

Server Power    [ON]  ON

UID Indicator     [UID OFF]  UID OFF

TPM Status         Not Present

iLO Date/Time   Tue Aug 21 14:12:59 2012

Internal Parts:

493802-001         CPQ PRL DL360G6/DL360G7 PCI-E RISER BOARD

397740-001         HP PCI-E 2-PORT FC-4GB HBA   

602512-001         HP PROLIANT DL360G7 I/O SYSTEM BOARD  Rev 0a sn YJ08MQ5977

501536-001         CPQ 8GB 1X8GB PC3-10600R 2RX4 ECC DIMM (qty 14)

507672-001         HP PROLIANT DL360G6 CPU HEATSINK

594883-001 2.80-GHz Intel® Xeon® processor X5660 (qty 2)

0 Kudos
Highlighted
Enthusiast
Enthusiast

>I have a customer with 4 DL360G7 that have PSOD issues.

Hi, this thread is about the DL385 G7.  Your box is significantly different as it uses intel instead of AMD.  Please start a different thread as the information on this thread is not relevant to your server.

thanks.

0 Kudos
Highlighted
Contributor
Contributor

Earlier I saw where someone had asked if an issue like this had been reported on other platforms. I will post a new thread once I complete some trouble shooting on my own.

0 Kudos