I've a very strange problem on seven new HP PROLIANT DL385 G7 (latest firmware and upgarde) with ESX 4.1 installed, in a random manner, in different time and unexpectedly system crash and happens PSOD, follow the first line:
- Uncorrected ECC error in L3 Cache LRU on CPU 0 cache index 0x25b PCPU0 in world 4120:idle
I'm afraid that hardware is the concern ...
PSOD is attached to this thread
Indeed looks like problems with the memory.
If you find this information useful, please award points for "correct" or "helpful".
We also have a customer with two proliants dl 385 g7 performance model with two 12 cores amd micros and 32 gb memory that is getting the same psod screens, the first one was at night with only backups working, other happen in during a p2v conversion with no more load on the server and the last one also happened at night, I called hp tech support and they ask me to boot from smartstart cd and pass all the tests.
Now I'm trying with some minor configuration changes. One is that i've setup the system power management to be on full performance( it was on OS controlled) and the second is I disabled the C1E processor state.
Keep you posted how it works.
PS: I'm not vmware employee, just want the system to up and stable so I could start deploying VM's.
We have a call raised with VMWare and HP on this and seems to be related to ALL 61xx AMD chipsets. at least on a HP server.
We have 4 385's running at 4.1 at the moment and 3 of them are failing with errors. One is regular and fails everyday. Since we have had this issue we ran a upgrade of the bios on the HP servers to the latest version for ESXi and also installed the latest hotfixes. So far we have managed over 24hours with no failures.
I will keep this updated to our reliability. The issue is that the PSOD shows the issue as a memory/hardware failure and is not very detailed on where the problem lies.
We also have run extensive hardware checks with no issues. Like some on here we do have 1 server that just plods on no issues and is IDENTICAL hardware and VMWare setup. We have not checked Bios setting and hardware layout yet.
Well this is with the same ESXi 4.1 and the same processor but a different server:
2 12core 2.1Ghz
64GB of memory using 8GB memory modules.
Running HP Insight Diagnostics , complete test 2 loops, turns up no issues. This is using SmartStart 8.50 X64.
Attached is my contributing screen.
This is the only server so far.
I will add that this is with iLo 3 1.10 and BIOS ROM is the A19 06/02/2010 and without the November 15 VMware updates for ESXi 4.1 .
We have just got in 3 of the new DL 385 G7 servers with the AMD 12 core 2.2 Ghz processors (6174). 2 of these servers are experiencing the issues like you guys are seeing. After running for a random amount of time, we get a pink screen crash with an Uncorrected ECC error in the L3 Cache LRU on CPU0 Cache. We have replaced the CPU's on one of the boxes and it has not crashed since. Our second box had this error yesterday and we are getting HP to replace the CPU's. At this point we are not sure if this is a VMWare / AMD issue or if there is a bad batch of AMD processors out there.
Well we just finally had the one server that has not crashed finally crash on us after weeks of running fine.
This is really starting to be an issue for us as this is a live enviroment and I can't afford to remove them from the farm. Really hope we get a solution soon.
Thus far no PSOD's on the machines that we have gotten the CPU's replaced on. BTW, we have gone through several bad CPU's on the replacements. We have never had this many problems from HP before.
What BIOS are you on? Did any of these machines that got the PSOD have the newer updated BIOS dated 2010.09.30 (15 Oct 2010) ?
See the release notes (http://bit.ly/epe1bI ) for the latest System ROM update.
Its a short version of an HP link.
This would be good to know as i have a couple PSOD but have not yet experienced it again on the server i updated with this BIOS.
I will upgrade one of the four dl385 g7's to the latest firmware this evening, the rests in two days.
I opened a case in hp support two weeks ago, they ask me to pass diagnostics for the smart array and 7 loops to the insight diagnostics, I will send them the results of this server today,
Did they suggest to change the amd micro?
how did you manage to get the micro changed?
did they send a technician or just the micro?
Like I originally said, we just purchased 3 new DL385's. One of them was experiencing a random power off issue. We called HP and they sent out a HP CSE (we have known him for years) to help us resolve the power off issue. While he was here one of the other servers had the CPU PSOD crash and we showed it to him. He said that it looked like a bad CPU and he ordered us 2 new ones. Well, we fixed the power issue with the original server and a few days later, it had a CPU PSOD. He ordered us 2 replacement CPU's for it too. Funny thing is, some of the replacement CPU's that we are getting have been opened already. On another order, one of the replacement CPU's was DOA from the start. I am not sure that replacing the CPU's is the ultimate fix, but it has been so far. I think that they could PSOD any day. I have no confidence in the servers at this point. When we first started having these problems, HP had me to download the Firmware CD 9.10C and flash all servers with the latest firmware. I did this and it made no difference.
Like i asked, what BIOS are you running?
That DVD is old already. I ran the newest 9.20 DVD and it still doesn't have the newer BIOS or ILO 1.15 in it. Useless. The BIOS will need to be updated by USB key and that is what i did. ILO is updated through the Web Adminstration using the BIN file. Don't know if it will help but it might and any clues are good clues. Hasn't happened since i updated the BIOS and we bought 29 BL465c G7 blades and we have only seen this on 2 of them and the one didn't occur after i updated it and none of the others updated have seen this error (most are updated). I know this is by no means conclusive or even indicative that this is a fix. But it should be tried.
In short: BIOS 2010.09.30 from Oct 15 , ILO 1.15 and if you can the newer NIC driver should be applied. NIC 2.102.517/518 . This is already out for Windows (see download drivers and firmware for BL465c G7) and should be out 'soon' for ESX 4.1 from VMware (not yet on HP site as of this writing).
Hope this helps as these problems have been trying!