I've a very strange problem on seven new HP PROLIANT DL385 G7 (latest firmware and upgarde) with ESX 4.1 installed, in a random manner, in different time and unexpectedly system crash and happens PSOD, follow the first line:
- Uncorrected ECC error in L3 Cache LRU on CPU 0 cache index 0x25b PCPU0 in world 4120:idle
I'm afraid that hardware is the concern ...
PSOD is attached to this thread
Light load. It was pretty much always happening on a Monday morning when people first got in the office. Just set your BIOS Power Profile to Maximum Power. The CPU's go into a power saving state and crash when a VM tries to access them.
Sent from my iPhone
Well no further PSOD issues yet. Its going on a week.
Strangly enough we have some hosts on a DR site that are pretty much idle but never seem to encounter this issue.
It must be intermittent activity.
Just curious but are any of you running the vMA also on this cluster? What hardware version?
I think this is a serious AMD issue with the 12 core chips.
I've gotten simila,r but different error than the ECC error you guys are showing. My PSOD refers to a "Failed to ack TLB invalidate", but this has happened on 3 different AMD 12 core chip systems over the course of 4 months. I have the power saving set to Maximum Performance in BIOS by default so this does not fix this for us. Hardware is Dell R815.
My guess is AMD did some "funny stuff" outside the standard x86 chip architecture to get this 12 core on a CPU. I got nothing to back this up, but these chips have been buggy with vSphere within a month of these servers going live. I've opened few tickets with VMware and on this last ticket had the VMware engineer say there is an issue with these chips on vSphere and all ESX for that matter. No idea when they are going to have a fix,as VMware is in the investigative stage on this bug. We are about to try and give the R815 back to Dell and get back to the rock solid stable Intel chips we have running.
In my 4 dl 386 g7 no more psod in the last 6 - 8 weeks, How about the rest people in this thread?
I have just check hp.com and there are no new firmware updates.
I find very weird the there are no more posts in this thread with new dl 385 g7 servers having psods.
Mikel G. Cantabrana
We were on the latest Firmware and bios and also all esx updates and getting the error on a less regular occurance.
Since completing the Bios settings to Max Power we have not had a PSOD for 6-7 weeks. We have refused to close are call at the moment as this is being classed as a workaround not a fix.
With the power settings in the BIOS, did you also go into the processor setting and disable the C1E state?
I was told by VMware that this is also part of the fix depending on the hardware you are using. It's not enough to make sure the power profile is set to Max. This C1E setting was in my Dell servers BIOS and I have disabled them. Not sure if HP also has this C1E in its' processor BIOS settings.
You wouldn't happen to have a link would you? I don't see it. Maybe its not generally released yet.
Power profile was set per HP advisory for these HP blades: BL465c G7 AMD 12 core 2.1 Ghz.
Just a look at the HCL for this server (processor)
1. This server uses a processor series that requires a 4.0 U1 patch (Release Name - ESX400-201002001, Bulletin ID ESX400-201002401-BG / Release Name ESXi400-201002001, Bulletin ID ESXi400-201002401-BG) or newer for full support.
The patch applies to 4 and 4.1