I've a very strange problem on seven new HP PROLIANT DL385 G7 (latest firmware and upgarde) with ESX 4.1 installed, in a random manner, in different time and unexpectedly system crash and happens PSOD, follow the first line:
- Uncorrected ECC error in L3 Cache LRU on CPU 0 cache index 0x25b PCPU0 in world 4120:idle
I'm afraid that hardware is the concern ...
Thanks
PSOD is attached to this thread
I believe our BIOS is at A18 (06/10/2010). At the time, Firmware CD 9.10C was not even available from the website. A HP rep gave us a link to download it.
Yes you could download the latest 9.20 but i already did that and it said everything was already up to date. It doesn't have the newer BIOS or ILO which was disappointing.
Your DL385 G7 is pretty much the same and needs the same updates:
It may help or even solve your issue. Hard to tell.
I am also having the same problem with a NEW HP DL385 G7 and an AMD CPU. in my case it is a Single CPU system with 64GB of RAM. It would appear that core 6 is bad. This is the second time this has happened with htis server. It was running 3 Windwos XP test VM's with no workload and it just failed after about 1 day.
Information
Server Name; ProLiant DL385 G7
UUID 30333735-3838-4D32-3230-333930305436
Server Serial Number / Product ID 2M203900T6 / 573088-001
System ROM A18 06/24/2010
Backup System ROM 06/24/2010
Last Used Remote Console Java Integrated Remote Console
License Type iLO 3 Standard license is installed.
iLO 3 Firmware Version 1.10 Jul 26 2010
IP Address 10.10.1.47
iLO3 Hostname ILO2M203900T6.
Status
System Health OK
Server Power ON
UID Indicator UID OFF
TPM Status Not Present
iLO 3 Date/Time Fri Dec 03 15:32:33 2010
¿What amd micro do you have?
6174
6128
¿have you disable any core in the bios setup?
I have the AMD Opteron 6172
I just flashed the BIOS on the system to A18 9/13/2010 from A18 6/24/2010 i sure wish that the 9.20 DVD had the latest BIOS and that HP would have incremented the BIOS to A19 instead of just changing the date. We will see if this helps the system remain stable. I have 3 other identical servers that are runnign the A18 6/24/2010 BIOS and they have not had an issue since installed 3 weeks ago. Only one of my four has a problem.
I also have the 6172. Thanks for trying the new BIOS. You should do the new ILO 1.15 also. Just read what it says it fixes!
Let me know your results.
I think it is important to remember that the latest BIOS/Firmware is not necessarily the RIGHT BIOS/Firmware. Changes to the BIOS or Firmware especially those between CD releases are to correct specific issues. Those issues may not be relevant to your OS and may in fact cause issues. I would go by the Software and Drivers section specifically for ESX(i) for the specific model of server. It is also possible that you may need to downgrade firmware to satisfy OS tested compatibility.
The link to the BIOS earlier in the post specifically lists several Windows versions. No Netware, Solaris, Linux or ESX(i).
Forum Upgrade Notice - the VMware Communities forums will be upgraded the weekend of December 12th. The forum will be in read-only mode from Friday, December 10th 6 PM PST until Sunday, December 12th 2 AM PST.
Well for one thing they are already at the bottom level of BIOS because these blades are so new. And if you look around you will see the BIOS update for all the OS's. With the problems being seen, and in talking to VMware, HP engineering and Emulex about the issues with these blades i feel confident that everyone should upgrade. Look at it like this. The version that came with the blades is version 1.0 . The updates now are fixes for version 1.0 .
I have seen no further issues after updating (as far as this PSOD CPU issue) and look forward to others reporting their experiences. At this time there is no better answer from HP BIOS engineering, or AMD or VMware on this issue.
Here is a link to the ESX version, easy to find:
Although I specifically pointed to the reference link in the earlier post for the BIOS I was making a general statement about BIOS and firmware updates. Far too often we, and I include myself, update without making sure the update applies.
Forum Upgrade Notice - the VMware Communities forums will be upgraded the weekend of December 12th. The forum will be in read-only mode from Friday, December 10th 6 PM PST until Sunday, December 12th 2 AM PST.
Maybe a fix for the DL385 G7 PSOD on L3 CPU Cache.....*****
One of our DL385 G7 servers that had the CPU's replaced went belly up again today. I called VMWare and got in touch with a SE that found an internal VMWare technical document that talks about VMWare VSphere 4.1, 4.0U1, or 4.0U2 failures with PSOD on AMD Opteron 6100-processor based systems. He said that VMWare would be releasing a document soon to address the issue.
There were about 4 symptoms with one of them being the *Uncorrected ECC error in L3 Cache LRU on CPU..blah...blah...blah*
From what I can gather, it appears that these new HP servers with the Opteron 61XX processors have a power-saving feature that puts the processors in a low power state when not in use. The workaround is to go into the BIOS of the server and set the Power Profile to be Maximum Performance. In doing so you also need to look at this HP article and make sure that the corresponding settings are applied as well (they should be).
I set my "problem server" to this new Maximum Performance mode this morning and so far so good. It will probably take a few weeks to validate that this could be the fix. I provide this to you guys as a courtesy and not as a guaranteed fix. At this point, we are willing to throw some chicken bones on the floor and dance around them looking for a fix.
Roll Tide!
Fred
The power management has added a lot of issues for us as well.
The newer BIOS has some fixes in this area and some of the ILO updates may also play in.
Did you try the newer BIOS and ILO by chance?
I am hoping it solves the issue for me.
I am aware of the power management VMware thoughts and HP.
That article is from the same time as the original BIOS.
My situation is a little different because i also have the power management of the blade enclosure to think about and how they play together.
Thanks for the teamwork,
Ron
Hopefully the monthly bill for power and cooling in the Maximum Performance mode won't cause too much grief.
Forum Upgrade Notice - We will be upgrading VMware Communities systems between 10-12 December 2010. During this time, the system will be placed in READ-ONLY mode.
Please see http://kb.vmware.com/kb/1030509 which discusses this issue.
If you experience this issue, file a support request with VMware Support and with your server hardware vendor.
Well its been about 10 days now and still no PSOD on any updated servers. ILO 1.15 and the September BIOS. No BIOS settings changed. Still not totally convinced but so far so good.
BL465c G7 2 procs 12 cores AMD 2.1Ghz.
As of now these updates are applied to about 21 of these blades and no further PSOD's yet but its only been about 2 weeks. Still its promising.
After 2 weeks of no issues we had 3 servers go on us all at once.
It seems to be we have issues when our SQL servers kicks off larger jobs. E.G Backup and restore.
This is now becoming a MASSIVE issue as our production servers are unreliable...
Im about to do some checks on the bios and other bits but I doubt there will be any saving grace here.
Behold, did you implement the High Performance workaround mentioned previously before the crash happened?
Well i had another PSOD on a fully updated server.
I have since reseated the server (to fully reset it as its the only way) and set the power profile to max performance.
Server wasn't terribly busy and i figure its time to add this into the mix since it failed with all current updates applied.
Sept BIOS
Sept HP ESXi 4.1 Image
Emulex be2net 517 firmware and 518 driver.
ILO 1.15
All current ESXi 4.1 updates.
No other problems currently seem to exist. Just this pesky PSOD issue.
I will report if it happens on a server with the power profile updated.
I'm having the same issues everyone. I started another thread. I didn't see this one.
http://communities.vmware.com/message/1665820#1665820
I have five DL385 G7 (2 proc 12 core 128GB DRAM each)
I've had to replace one main board, two 8GB memory sticks, and one processor so far. My office mate is on the phone right now with HP discussing the second PSOD.
We are running BIOS version A18 on all servers. They are all running ESXi 4.1.
--Patrick
Looks like HP have acknowleged this an issue. See this link: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?locale=en_US&objectID=c02641719
Some of the work arounds have already been mentioned in this thread but looks like they (HP) are working on the issue with VMware.
Glen
Thanks for the link Glen. We passed this along to the tech that is helping us. I'll be watching this thread closely. Hopefully a fix is found soon.
--Patrick