Contributor
Contributor

PSOD - Esx4.1 HP Proliant DL 385 G7

I've a very strange problem on seven new HP PROLIANT DL385 G7 (latest firmware and upgarde) with ESX 4.1 installed, in a random manner, in different time and unexpectedly system crash and happens PSOD, follow the first line:

- Uncorrected ECC error in L3 Cache LRU on CPU 0 cache index 0x25b PCPU0 in world 4120:idle

I'm afraid that hardware is the concern ...

Thanks

PSOD is attached to this thread

0 Kudos
78 Replies
Contributor
Contributor

Hello in 2 weeks we are going to update our Esx farm with HP instruction ...

By the way, i had verified for example that one of  my seven ESX just has configuration applied ...

0 Kudos
Contributor
Contributor

What workload was the system doing when you saw the PSOD?

0 Kudos
Contributor
Contributor

It was light I think. I only had six Windows servers running on this host at the time of PSOD.

--Patrick

0 Kudos
Contributor
Contributor

Which servers are you swapping for?

0 Kudos
Contributor
Contributor

Light load. It was pretty much always happening on a Monday morning when people first got in the office. Just set your BIOS Power Profile to Maximum Power. The CPU's go into a power saving state and crash when a VM tries to access them.

Thanks,

Fred

Sent from my iPhone

0 Kudos
Contributor
Contributor

Well no further PSOD issues yet. Its going on a week.

Strangly enough we have some hosts on a DR site that are pretty much idle but never seem to encounter this issue.

It must be intermittent activity.

Just curious but are any of you running the vMA also on this cluster? What hardware version?

0 Kudos
Contributor
Contributor

I think this is a serious AMD issue with the 12 core chips.

I've gotten simila,r but different error than the ECC error you guys are showing. My PSOD refers to a "Failed to ack TLB invalidate", but this has happened on 3 different AMD 12 core chip systems over the course of 4 months.  I have the power saving set to Maximum Performance in BIOS by default so this does not fix this for us.  Hardware is Dell R815.

My guess is AMD did some "funny stuff" outside the standard x86 chip architecture to get this 12 core on a CPU.  I got nothing to back this up, but these chips have been buggy with vSphere within a month of these servers going live.  I've opened few tickets with VMware and on this last ticket had the VMware engineer say there is an issue with these chips on vSphere and all ESX for that matter.  No idea when they are going to have a fix,as VMware is in the investigative stage on this bug.  We are about to try and give the R815 back to Dell and get back to the rock solid stable Intel chips we have running.

0 Kudos
Contributor
Contributor

In my 4 dl 386 g7 no more psod in the last 6 - 8 weeks, How about the rest people in this thread?

I have just check hp.com and there are no new firmware updates.

I find very weird the there are no more posts in this thread with new dl 385 g7 servers having psods.

Thanks

Mikel G. Cantabrana

0 Kudos
Contributor
Contributor

We were on the latest Firmware and bios and also all esx updates and getting the error on a less regular occurance.

Since completing the Bios settings to Max Power we have not had a PSOD for 6-7 weeks. We have refused to close are call at the moment as this is being classed as a workaround not a fix.

0 Kudos
Contributor
Contributor

I forgot to tell you that we have also changed the bios settings to max performance in 3 out of the 4 servers.

We also not get any psod in any of them.

Thanks

0 Kudos
Contributor
Contributor

No PSOD's after setting the BIOS to the max power setting.

Thanks,

Fred

Sent from my iPhone

0 Kudos
Contributor
Contributor

No further PSOD after power settings applied per advisories. See my earlier posts for my updates.

0 Kudos
Contributor
Contributor

Same here. No further problems after deploying the work around. I'd like to get a real fix for this.

0 Kudos
Contributor
Contributor

Well bad news. I got another PSOD. Server was up for about 52 days. We need a fix.

0 Kudos
Contributor
Contributor

HP has released a new BIOS that is supposed to fix the issue

Thanks,

Fred

Sent from my iPhone

0 Kudos
Contributor
Contributor

With the power settings in the BIOS, did you also go into the processor setting and disable the C1E state?

I was told by VMware that this is also part of the fix depending on the hardware you are using.  It's not enough to make sure the power profile is set to Max.  This C1E setting was in my Dell servers BIOS and I have disabled them.  Not sure if HP also has this C1E in its' processor BIOS settings.

0 Kudos
Contributor
Contributor

You wouldn't happen to have a link would you? I don't see it. Maybe its not generally released yet.

Power profile was set per HP advisory for these HP blades: BL465c G7 AMD 12 core 2.1 Ghz.

0 Kudos
Immortal
Immortal

Just a look at the HCL for this server (processor)

1. This server uses a processor series that requires a 4.0 U1 patch  (Release Name - ESX400-201002001, Bulletin ID  ESX400-201002401-BG  /  Release Name ESXi400-201002001, Bulletin ID ESXi400-201002401-BG) or  newer for full support.

The patch applies to 4 and 4.1

-- David -- VMware Communities Moderator
0 Kudos
Contributor
Contributor

I have the latest of everything applied. Also that update seems to refer to the intel processor , DL380.

Think its time to call HP again.

0 Kudos
Contributor
Contributor

Here is the link to the HP ROM update for the DL385 G7 server.

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02712052&dimid=1017907226&di...

Thanks,

Fred

Sent from my iPhone

0 Kudos