Highlighted
Contributor
Contributor

PSOD - Esx4.1 HP Proliant DL 385 G7

I've a very strange problem on seven new HP PROLIANT DL385 G7 (latest firmware and upgarde) with ESX 4.1 installed, in a random manner, in different time and unexpectedly system crash and happens PSOD, follow the first line:

- Uncorrected ECC error in L3 Cache LRU on CPU 0 cache index 0x25b PCPU0 in world 4120:idle

I'm afraid that hardware is the concern ...

Thanks

PSOD is attached to this thread

0 Kudos
78 Replies
Highlighted
Enthusiast
Enthusiast

Indeed looks like problems with the memory.

Maybe, this helps: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=831

Paul Grevink

Twitter: @PaulGrevink

If you find this information useful, please award points for "correct" or "helpful".

Paul Grevink Twitter: @PaulGrevink http://twitter.com/PaulGrevink If you find this information useful, please consider awarding points for "correct" or "helpful".
0 Kudos
Highlighted
Expert
Expert

KB article "Decoding Machine Check Exception (MCE) output after a purple screen error" might help in resolving your issue.

Regards,

Arun

Regards, Arun Pandey VCP 3,4,5 | VCAP-DCA | NCDA | HPUX-CSA | http://highoncloud.blogspot.in/ If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
0 Kudos
Highlighted
Contributor
Contributor

I'm having a very similar PSOD on my DL385 G7, ESXi 4.1. A photo of it is attached. We have another server with the same exact setup in a different location that is not having the problem.

0 Kudos
Highlighted
Contributor
Contributor

Hi there,

Same error with an Dell R815, at first, I thought it was an isolated error but it repeated itself after one week.

All the hardware diagnostics seem to run fine.

I've attached the screen.

0 Kudos
Highlighted
Contributor
Contributor

We also have a customer with two proliants dl 385 g7 performance model with two 12 cores amd micros and 32 gb memory that is getting the same psod screens, the first one was at night with only backups working, other happen in during a p2v conversion with no more load on the server and the last one also happened at night, I called hp tech support and they ask me to boot from smartstart cd and pass all the tests.

0 Kudos
Highlighted
Contributor
Contributor

The thing in common is AMD CPU.

Anyone got this PSOD with Intel?

0 Kudos
Highlighted
Contributor
Contributor

Hey ,

Now I'm trying with some minor configuration changes. One is that i've setup the system power management to be on full performance( it was on OS controlled) and the second is I disabled the C1E processor state.

Keep you posted how it works.

PS: I'm not vmware employee, just want the system to up and stable so I could start deploying VM's.

0 Kudos
Highlighted
Contributor
Contributor

We have a call raised with VMWare and HP on this and seems to be related to ALL 61xx AMD chipsets. at least on a HP server.

We have 4 385's running at 4.1 at the moment and 3 of them are failing with errors. One is regular and fails everyday. Since we have had this issue we ran a upgrade of the bios on the HP servers to the latest version for ESXi and also installed the latest hotfixes. So far we have managed over 24hours with no failures.

I will keep this updated to our reliability. The issue is that the PSOD shows the issue as a memory/hardware failure and is not very detailed on where the problem lies.

We also have run extensive hardware checks with no issues. Like some on here we do have 1 server that just plods on no issues and is IDENTICAL hardware and VMWare setup. We have not checked Bios setting and hardware layout yet.

0 Kudos
Highlighted
Contributor
Contributor

Well this is with the same ESXi 4.1 and the same processor but a different server:

BL465c G7

2 12core 2.1Ghz

64GB of memory using 8GB memory modules.

Running HP Insight Diagnostics , complete test 2 loops, turns up no issues. This is using SmartStart 8.50 X64.

Attached is my contributing screen.

This is the only server so far.

I will add that this is with iLo 3 1.10 and BIOS ROM is the A19 06/02/2010 and without the November 15 VMware updates for ESXi 4.1 .

0 Kudos
Highlighted
Contributor
Contributor

Just to update - After working with HP, we replaced the affected CPU and have been running solid for a bit over two days now.

0 Kudos
Highlighted
Contributor
Contributor

We have just got in 3 of the new DL 385 G7 servers with the AMD 12 core 2.2 Ghz processors (6174). 2 of these servers are experiencing the issues like you guys are seeing. After running for a random amount of time, we get a pink screen crash with an Uncorrected ECC error in the L3 Cache LRU on CPU0 Cache. We have replaced the CPU's on one of the boxes and it has not crashed since. Our second box had this error yesterday and we are getting HP to replace the CPU's. At this point we are not sure if this is a VMWare / AMD issue or if there is a bad batch of AMD processors out there.

0 Kudos
Highlighted
Contributor
Contributor

Well we just finally had the one server that has not crashed finally crash on us after weeks of running fine.

This is really starting to be an issue for us as this is a live enviroment and I can't afford to remove them from the farm. Really hope we get a solution soon.

0 Kudos
Highlighted
Contributor
Contributor

Has anyone gotten a fix for this issue yet???

0 Kudos
Highlighted
Contributor
Contributor

Thus far no PSOD's on the machines that we have gotten the CPU's replaced on. BTW, we have gone through several bad CPU's on the replacements. We have never had this many problems from HP before.

0 Kudos
Highlighted
Contributor
Contributor

What BIOS are you on? Did any of these machines that got the PSOD have the newer updated BIOS dated 2010.09.30 (15 Oct 2010) ?

See the release notes (http://bit.ly/epe1bI ) for the latest System ROM update.

Its a short version of an HP link.

This would be good to know as i have a couple PSOD but have not yet experienced it again on the server i updated with this BIOS.

0 Kudos
Highlighted
Contributor
Contributor

Hello,

I will upgrade one of the four dl385 g7's to the latest firmware this evening, the rests in two days.

I opened a case in hp support two weeks ago, they ask me to pass diagnostics for the smart array and 7 loops to the insight diagnostics, I will send them the results of this server today,

Did they suggest to change the amd micro?

how did you manage to get the micro changed?

did they send a technician or just the micro?

Thanks

0 Kudos
Highlighted
Contributor
Contributor

Like I originally said, we just purchased 3 new DL385's. One of them was experiencing a random power off issue. We called HP and they sent out a HP CSE (we have known him for years) to help us resolve the power off issue. While he was here one of the other servers had the CPU PSOD crash and we showed it to him. He said that it looked like a bad CPU and he ordered us 2 new ones. Well, we fixed the power issue with the original server and a few days later, it had a CPU PSOD. He ordered us 2 replacement CPU's for it too. Funny thing is, some of the replacement CPU's that we are getting have been opened already. On another order, one of the replacement CPU's was DOA from the start. I am not sure that replacing the CPU's is the ultimate fix, but it has been so far. I think that they could PSOD any day. I have no confidence in the servers at this point. When we first started having these problems, HP had me to download the Firmware CD 9.10C and flash all servers with the latest firmware. I did this and it made no difference.

0 Kudos
Highlighted
Contributor
Contributor

Also, I ran numerous comprehensive system tests with their Insight tool and they always passed with 100%......no issues.

0 Kudos
Highlighted
Contributor
Contributor

Like i asked, what BIOS are you running?

That DVD is old already. I ran the newest 9.20 DVD and it still doesn't have the newer BIOS or ILO 1.15 in it. Useless. The BIOS will need to be updated by USB key and that is what i did. ILO is updated through the Web Adminstration using the BIN file. Don't know if it will help but it might and any clues are good clues. Hasn't happened since i updated the BIOS and we bought 29 BL465c G7 blades and we have only seen this on 2 of them and the one didn't occur after i updated it and none of the others updated have seen this error (most are updated). I know this is by no means conclusive or even indicative that this is a fix. But it should be tried.

In short: BIOS 2010.09.30 from Oct 15 , ILO 1.15 and if you can the newer NIC driver should be applied. NIC 2.102.517/518 . This is already out for Windows (see download drivers and firmware for BL465c G7) and should be out 'soon' for ESX 4.1 from VMware (not yet on HP site as of this writing).

Hope this helps as these problems have been trying!

0 Kudos