I've a very strange problem on seven new HP PROLIANT DL385 G7 (latest firmware and upgarde) with ESX 4.1 installed, in a random manner, in different time and unexpectedly system crash and happens PSOD, follow the first line:
- Uncorrected ECC error in L3 Cache LRU on CPU 0 cache index 0x25b PCPU0 in world 4120:idle
I'm afraid that hardware is the concern ...
PSOD is attached to this thread
That note appears to be for 4.0, not 4.1 . In any case all updates available are applied.
Also it seems old enough that my support at VMware would have mentioned it during the support call (i would hope so anyway).
This is ESXi 4.1 installable.
The HCL entry for the servers which use the 61xx CPUs do reference a required patch for ESX/ESXi 4.0 Update 1. This is the first ESX patch level which supported the 61xx CPUs, which is why the HCL says it is required.
This patch predates discovery of this issue. and cannot be installed on ESX/ESXi 4.1 (which already has the support).
If you encounter the specific issue described in http://kb.vmware.com/kb/1030509 please file a support request with VMware Technical Support and with your server hardware vendor. VMware recommends that you contact your server vendor directly for more information on how to proceed.
Thanks i have opened an SR, spoken to tech support and our BCS primary contact and sent off an email to our HP Enterprise support rep who has forwarded this to the appropriate expert.
I will see where this goes.
I've had this issue open with HP and VMware for quite some time.
HP just posted two important information items!
#1 - they have a BIOS fix for the G7 series.
#2 - the release notes indicate patches for ESX which resolve the issue, VMware ESX 4.0 Update 3 (Horray for ESX Classic !!) and also, ESXi 4.0 U1.
We are hoping that this resolves the frequent crash issues we have seen. Please read for details:
Resolved an issue in which uncorrectable memory errors (or other fatal system errors) will not be logged to the Integrated Management Log (IML) when using some revisions of VMware ESX Server. These errors will result in a fatal error (Purple Screen of Death - PsoD) under Vmware ESX, but there will not be any indication of the error type (including no indication of an uncorrectable memory error or what DIMM has failed). The VMware ESX Server issue which causes this is addressed in VMware ESX 4.1 U1 and VMware ESX 4.0 U3.
Resolved an extremely intermittent issue that can result in uncorrectable memory errors being reported. This will result in a system reset and an error logged to the Integrated Management Log (IML). If using certain revisions of VMware ESX Server, this issue will result in a fatal error (Purple Screen of Death - PsoD). This is a result of a processor issue and is NOT an issue with the DIMM or the system board hardware. The processor issue is completely resolved by the work-around included in this revision of the System ROM. The issue does not occur if the Minimum Processor Idle State option in the ROM-Based Setup Utility (RBSU) is configured for No C-states.
Resolved an issue where the system will not properly handle an Online Spare memory failover event. HP recommends that customers who configure the Advanced Memory Protection option for Online Spare Mode upgrade to this system ROM version to ensure proper operation.
This is great news (even though i wish HP or VMware had mentioned it as part of my current open issues as someone there must have been aware of this).
I see the ESXi 4.1 updates came out last night also so they are ready to go.
I will be looking to apply this (given change control approvals) as soon as possible.
Thanks very much for your post!
I may as well post this advisory from HP here also, it is similar enough.
Another BIOS update is required.
From the Advisory:
Advisory: (Revision) ProLiant G7-Series Servers - Some ProLiant G7-Series Servers Using the AMD Opteron 6000-Series Processors May Log an Additional Critical Error Message in the Integrated Management Log (IML) When Sourcing an Uncorrectable Memory Event
Some ProLiant AMD G7-series servers using AMD Opteron 6000-series processors may display an "Internal Chipset Error Detected" message during system boot and log one or more "An Unrecoverable System Error has occurred" entries into the Integrated Management Log (IML) when an Uncorrectable Memory Error event occurs.
[much deleted, please review the link above]
The Systems ROM for the affected AMD Opteron-series servers have been updated to prevent the invalid "Internal Chipset Error detected" message from being displayed on the screen and also to prevent the erroneous IML entry from being logged.
.... upgrade to System ROM Version 2011.01.29 (or later):
[downloads are available from HP by following the link above.]
We are using ESXi 4.1 U1 and using the USBKey BIOS update its quite easy.
Just format a USB stick using the USBkey executable part of the download.
Put the usb stick on your workstation.
ILO with Integrated Remote Console and click the checkbox connecting the USB key to the server. (It has a little usb stick icon).
Restart the server and it will boot into the BIOS flash. Proceed as normal to complete flashing a BIOS.
The bad is that it really takes quite awhile to update a lot of hosts.
I'm glad that's working well for you.
If you considered a larger enironment (think over 1000 ESXi servers) now what? Log into ILO 1000 times? I'd rather not.
If you consider ESX classic, we can script the linux based online bios flash to go out to each server, check a few things, apply the flash, and then do a rolling reboot through the clusters to apply it. Compared to logging into ILO, which is a great too for what it is, this is much easier.
I'm sorry to see ESX Classic go because VMware is really tying my hands.
I'm looking for a good replacement solutions to apply critical firmware like the ones listed in this thread. I'm open to suggestions!
I thought thats what you might be getting at.
Either VMA or vCLI is what you would have to look at.
VMA is the replacement for the console pretty much.
We don't have 1000 hosts so i haven't taken the time yet to investigate further.
Let me know if you find a way to use these tools to do this.
Another option is PowerCLI but i am thinking that will not do it.
I have a customer with 4 DL360G7 that have PSOD issues. I have already applied the firmware fix which seems to help. I still get random Machine Check Errors. Power save mode is on, does that seem to be an issue? It only seems to happen under very heavy use.
Product Name ProLiant DL360 G7
Server Serial Number MXQ1160WZT
Product ID 595492-002
System ROM P68 05/05/2011
Backup System ROM 08/16/2010
Integrated Remote Console .NET Java
License Type iLO 3 Standard
iLO Firmware Version 1.28 Jan 13 2012
IP Address 192.168.0.128
iLO Hostname DL360G7ILO.
System Health [OK] OK
Server Power [ON] ON
UID Indicator [UID OFF] UID OFF
TPM Status Not Present
iLO Date/Time Tue Aug 21 14:12:59 2012
493802-001 CPQ PRL DL360G6/DL360G7 PCI-E RISER BOARD
397740-001 HP PCI-E 2-PORT FC-4GB HBA
602512-001 HP PROLIANT DL360G7 I/O SYSTEM BOARD Rev 0a sn YJ08MQ5977
501536-001 CPQ 8GB 1X8GB PC3-10600R 2RX4 ECC DIMM (qty 14)
507672-001 HP PROLIANT DL360G6 CPU HEATSINK
594883-001 2.80-GHz Intel® Xeon® processor X5660 (qty 2)
>I have a customer with 4 DL360G7 that have PSOD issues.
Hi, this thread is about the DL385 G7. Your box is significantly different as it uses intel instead of AMD. Please start a different thread as the information on this thread is not relevant to your server.