I've a very strange problem on seven new HP PROLIANT DL385 G7 (latest firmware and upgarde) with ESX 4.1 installed, in a random manner, in different time and unexpectedly system crash and happens PSOD, follow the first line:
- Uncorrected ECC error in L3 Cache LRU on CPU 0 cache index 0x25b PCPU0 in world 4120:idle
I'm afraid that hardware is the concern ...
Thanks
PSOD is attached to this thread
Hello in 2 weeks we are going to update our Esx farm with HP instruction ...
By the way, i had verified for example that one of my seven ESX just has configuration applied ...
What workload was the system doing when you saw the PSOD?
It was light I think. I only had six Windows servers running on this host at the time of PSOD.
--Patrick
Which servers are you swapping for?
Light load. It was pretty much always happening on a Monday morning when people first got in the office. Just set your BIOS Power Profile to Maximum Power. The CPU's go into a power saving state and crash when a VM tries to access them.
Thanks,
Fred
Sent from my iPhone
Well no further PSOD issues yet. Its going on a week.
Strangly enough we have some hosts on a DR site that are pretty much idle but never seem to encounter this issue.
It must be intermittent activity.
Just curious but are any of you running the vMA also on this cluster? What hardware version?
I think this is a serious AMD issue with the 12 core chips.
I've gotten simila,r but different error than the ECC error you guys are showing. My PSOD refers to a "Failed to ack TLB invalidate", but this has happened on 3 different AMD 12 core chip systems over the course of 4 months. I have the power saving set to Maximum Performance in BIOS by default so this does not fix this for us. Hardware is Dell R815.
My guess is AMD did some "funny stuff" outside the standard x86 chip architecture to get this 12 core on a CPU. I got nothing to back this up, but these chips have been buggy with vSphere within a month of these servers going live. I've opened few tickets with VMware and on this last ticket had the VMware engineer say there is an issue with these chips on vSphere and all ESX for that matter. No idea when they are going to have a fix,as VMware is in the investigative stage on this bug. We are about to try and give the R815 back to Dell and get back to the rock solid stable Intel chips we have running.
In my 4 dl 386 g7 no more psod in the last 6 - 8 weeks, How about the rest people in this thread?
I have just check hp.com and there are no new firmware updates.
I find very weird the there are no more posts in this thread with new dl 385 g7 servers having psods.
Thanks
Mikel G. Cantabrana
We were on the latest Firmware and bios and also all esx updates and getting the error on a less regular occurance.
Since completing the Bios settings to Max Power we have not had a PSOD for 6-7 weeks. We have refused to close are call at the moment as this is being classed as a workaround not a fix.
I forgot to tell you that we have also changed the bios settings to max performance in 3 out of the 4 servers.
We also not get any psod in any of them.
Thanks
No PSOD's after setting the BIOS to the max power setting.
Thanks,
Fred
Sent from my iPhone
No further PSOD after power settings applied per advisories. See my earlier posts for my updates.
Same here. No further problems after deploying the work around. I'd like to get a real fix for this.
Well bad news. I got another PSOD. Server was up for about 52 days. We need a fix.
HP has released a new BIOS that is supposed to fix the issue
Thanks,
Fred
Sent from my iPhone
With the power settings in the BIOS, did you also go into the processor setting and disable the C1E state?
I was told by VMware that this is also part of the fix depending on the hardware you are using. It's not enough to make sure the power profile is set to Max. This C1E setting was in my Dell servers BIOS and I have disabled them. Not sure if HP also has this C1E in its' processor BIOS settings.
You wouldn't happen to have a link would you? I don't see it. Maybe its not generally released yet.
Power profile was set per HP advisory for these HP blades: BL465c G7 AMD 12 core 2.1 Ghz.
Just a look at the HCL for this server (processor)
1. This server uses a processor series that requires a 4.0 U1 patch (Release Name - ESX400-201002001, Bulletin ID ESX400-201002401-BG / Release Name ESXi400-201002001, Bulletin ID ESXi400-201002401-BG) or newer for full support.
The patch applies to 4 and 4.1
I have the latest of everything applied. Also that update seems to refer to the intel processor , DL380.
Think its time to call HP again.
Here is the link to the HP ROM update for the DL385 G7 server.
Thanks,
Fred
Sent from my iPhone