VMware Cloud Community
edox77
Contributor
Contributor

PSOD - Esx4.1 HP Proliant DL 385 G7

I've a very strange problem on seven new HP PROLIANT DL385 G7 (latest firmware and upgarde) with ESX 4.1 installed, in a random manner, in different time and unexpectedly system crash and happens PSOD, follow the first line:

- Uncorrected ECC error in L3 Cache LRU on CPU 0 cache index 0x25b PCPU0 in world 4120:idle

I'm afraid that hardware is the concern ...

Thanks

PSOD is attached to this thread

0 Kudos
78 Replies
URDaddy
Contributor
Contributor

I believe our BIOS is at A18 (06/10/2010). At the time, Firmware CD 9.10C was not even available from the website. A HP rep gave us a link to download it.

0 Kudos
ronsexton
Contributor
Contributor

Yes you could download the latest 9.20 but i already did that and it said everything was already up to date. It doesn't have the newer BIOS or ILO which was disappointing.

Your DL385 G7 is pretty much the same and needs the same updates:

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareIndex.jsp?lang=en&cc=us&prodNameId=4132957&...

It may help or even solve your issue. Hard to tell.

0 Kudos
MikeZirbes
Contributor
Contributor

I am also having the same problem with a NEW HP DL385 G7 and an AMD CPU. in my case it is a Single CPU system with 64GB of RAM. It would appear that core 6 is bad. This is the second time this has happened with htis server. It was running 3 Windwos XP test VM's with no workload and it just failed after about 1 day.

Information

Server Name; ProLiant DL385 G7

UUID 30333735-3838-4D32-3230-333930305436

Server Serial Number / Product ID 2M203900T6 / 573088-001

System ROM A18 06/24/2010

Backup System ROM 06/24/2010

Last Used Remote Console Java Integrated Remote Console

License Type iLO 3 Standard license is installed.

iLO 3 Firmware Version 1.10 Jul 26 2010

IP Address 10.10.1.47

iLO3 Hostname ILO2M203900T6.

Status

System Health OK

Server Power ON

UID Indicator UID OFF

TPM Status Not Present

iLO 3 Date/Time Fri Dec 03 15:32:33 2010

0 Kudos
cantabrana
Contributor
Contributor

¿What amd micro do you have?

6174

6128

¿have you disable any core in the bios setup?

0 Kudos
MikeZirbes
Contributor
Contributor

I have the AMD Opteron 6172

I just flashed the BIOS on the system to A18 9/13/2010 from A18 6/24/2010 i sure wish that the 9.20 DVD had the latest BIOS and that HP would have incremented the BIOS to A19 instead of just changing the date. We will see if this helps the system remain stable. I have 3 other identical servers that are runnign the A18 6/24/2010 BIOS and they have not had an issue since installed 3 weeks ago. Only one of my four has a problem.

0 Kudos
ronsexton
Contributor
Contributor

I also have the 6172. Thanks for trying the new BIOS. You should do the new ILO 1.15 also. Just read what it says it fixes!

Let me know your results.

0 Kudos
DSTAVERT
Immortal
Immortal

I think it is important to remember that the latest BIOS/Firmware is not necessarily the RIGHT BIOS/Firmware. Changes to the BIOS or Firmware especially those between CD releases are to correct specific issues. Those issues may not be relevant to your OS and may in fact cause issues. I would go by the Software and Drivers section specifically for ESX(i) for the specific model of server. It is also possible that you may need to downgrade firmware to satisfy OS tested compatibility.

The link to the BIOS earlier in the post specifically lists several Windows versions. No Netware, Solaris, Linux or ESX(i).






Forum Upgrade Notice - the VMware Communities forums will be upgraded the weekend of December 12th. The forum will be in read-only mode from Friday, December 10th 6 PM PST until Sunday, December 12th 2 AM PST.

-- David -- VMware Communities Moderator
0 Kudos
ronsexton
Contributor
Contributor

Well for one thing they are already at the bottom level of BIOS because these blades are so new. And if you look around you will see the BIOS update for all the OS's. With the problems being seen, and in talking to VMware, HP engineering and Emulex about the issues with these blades i feel confident that everyone should upgrade. Look at it like this. The version that came with the blades is version 1.0 . The updates now are fixes for version 1.0 .

I have seen no further issues after updating (as far as this PSOD CPU issue) and look forward to others reporting their experiences. At this time there is no better answer from HP BIOS engineering, or AMD or VMware on this issue.

Here is a link to the ESX version, easy to find:

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareIndex.jsp?lang=en&cc=us&prodNameId=4132827&...

0 Kudos
DSTAVERT
Immortal
Immortal

Although I specifically pointed to the reference link in the earlier post for the BIOS I was making a general statement about BIOS and firmware updates. Far too often we, and I include myself, update without making sure the update applies.






Forum Upgrade Notice - the VMware Communities forums will be upgraded the weekend of December 12th. The forum will be in read-only mode from Friday, December 10th 6 PM PST until Sunday, December 12th 2 AM PST.

-- David -- VMware Communities Moderator
0 Kudos
URDaddy
Contributor
Contributor

          • Maybe a fix for the DL385 G7 PSOD on L3 CPU Cache.....*****

One of our DL385 G7 servers that had the CPU's replaced went belly up again today. I called VMWare and got in touch with a SE that found an internal VMWare technical document that talks about VMWare VSphere 4.1, 4.0U1, or 4.0U2 failures with PSOD on AMD Opteron 6100-processor based systems. He said that VMWare would be releasing a document soon to address the issue.

There were about 4 symptoms with one of them being the *Uncorrected ECC error in L3 Cache LRU on CPU..blah...blah...blah*

From what I can gather, it appears that these new HP servers with the Opteron 61XX processors have a power-saving feature that puts the processors in a low power state when not in use. The workaround is to go into the BIOS of the server and set the Power Profile to be Maximum Performance. In doing so you also need to look at this HP article and make sure that the corresponding settings are applied as well (they should be).

http://bizsupport1.austin.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02249081&lang=en&cc=u...

I set my "problem server" to this new Maximum Performance mode this morning and so far so good. It will probably take a few weeks to validate that this could be the fix. I provide this to you guys as a courtesy and not as a guaranteed fix. At this point, we are willing to throw some chicken bones on the floor and dance around them looking for a fix.

Roll Tide!

Fred

0 Kudos
ronsexton
Contributor
Contributor

The power management has added a lot of issues for us as well.

The newer BIOS has some fixes in this area and some of the ILO updates may also play in.

Did you try the newer BIOS and ILO by chance?

I am hoping it solves the issue for me.

I am aware of the power management VMware thoughts and HP.

That article is from the same time as the original BIOS.

My situation is a little different because i also have the power management of the blade enclosure to think about and how they play together.

Thanks for the teamwork,

Ron

0 Kudos
DSTAVERT
Immortal
Immortal

Hopefully the monthly bill for power and cooling in the Maximum Performance mode won't cause too much grief. Smiley Wink






Forum Upgrade Notice - We will be upgrading VMware Communities systems between 10-12 December 2010. During this time, the system will be placed in READ-ONLY mode.

-- David -- VMware Communities Moderator
0 Kudos
admin
Immortal
Immortal

Please see http://kb.vmware.com/kb/1030509 which discusses this issue.

If you experience this issue, file a support request with VMware Support and with your server hardware vendor.

0 Kudos
ronsexton
Contributor
Contributor

Well its been about 10 days now and still no PSOD on any updated servers. ILO 1.15 and the September BIOS. No BIOS settings changed. Still not totally convinced but so far so good.

BL465c G7 2 procs 12 cores AMD 2.1Ghz.

As of now these updates are applied to about 21 of these blades and no further PSOD's yet but its only been about 2 weeks. Still its promising.

0 Kudos
Behold
Contributor
Contributor

After 2 weeks of no issues we had 3 servers go on us all at once.

It seems to be we have issues when our SQL servers kicks off larger jobs. E.G Backup and restore.

This is now becoming a MASSIVE issue as our production servers are unreliable...

Im about to do some checks on the bios and other bits but I doubt there will be any saving grace here.

0 Kudos
bizdps
Contributor
Contributor

Behold, did you implement the High Performance workaround mentioned previously before the crash happened?

0 Kudos
ronsexton
Contributor
Contributor

Well i had another PSOD on a fully updated server.

I have since reseated the server (to fully reset it as its the only way) and set the power profile to max performance.

Server wasn't terribly busy and i figure its time to add this into the mix since it failed with all current updates applied.

Sept BIOS

Sept HP ESXi 4.1 Image

Emulex be2net 517 firmware and 518 driver.

ILO 1.15

All current ESXi 4.1 updates.

No other problems currently seem to exist. Just this pesky PSOD issue.

I will report if it happens on a server with the power profile updated.

0 Kudos
Hypnotoad
Contributor
Contributor

I'm having the same issues everyone. I started another thread. I didn't see this one.

http://communities.vmware.com/message/1665820#1665820

I have five DL385 G7 (2 proc 12 core 128GB DRAM each)

I've had to replace one main board, two 8GB memory sticks, and one processor so far. My office mate is on the phone right now with HP discussing the second PSOD.

We are running BIOS version A18 on all servers. They are all running ESXi 4.1.

--Patrick

0 Kudos
ThompsG
Virtuoso
Virtuoso

Looks like HP have acknowleged this an issue. See this link: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?locale=en_US&objectID=c02641719

Some of the work arounds have already been mentioned in this thread but looks like they (HP) are working on the issue with VMware.

Glen

0 Kudos
Hypnotoad
Contributor
Contributor

Thanks for the link Glen. We passed this along to the tech that is helping us. I'll be watching this thread closely. Hopefully a fix is found soon.

--Patrick

0 Kudos