VMware Cloud Community
MK22
Contributor
Contributor

Dell R805 Uncorrectable ECC memory error - crashed ESXi host

We have 5 Dell R805 2 socket dual core 2222 servers, in two different ESXi host clusters, some with 32G and some with 64G RAM, and 2 Dell R805 2 socket quad core 2360 servers in another ESXi host cluster with 64G of RAM. Three times now we have had an "uncorrectable ecc memory error" crash and restart the ESXi host at the hardware level, each time on a different 2222 dual core server, this has not happened on the quads. Dell had us replace the memory after the first incedent and flash the BIOS and BMC firmware after the second (after two weeks of meetings with account reps, tech managers, etc..). The third incedent happened today after three months of running fine. The first two incendents happened with ESXi 3.5 and since then we have upgraded all the hosts to ESXi 4. Memory tests fine with the VMware recomended utility http://www.memtest.org. Has anybody else expierenced "uncorrectable ecc memory errors".

VCP
Reply
0 Kudos
15 Replies
MK22
Contributor
Contributor

I cannot believe nobody has replied to this, we can't be the only ones with this issue as there has been three out of seven of our R805 machines this has occured on. Are we the only ones in the world running production VMs on Dell R805 w/ AMD 2200 procs?

VCP
Reply
0 Kudos
vm_arch
Enthusiast
Enthusiast

Suggest you check the BIOS logs on the servers that have had the problem.

We get that once every so often on a Dell 2950 with quad-core intels - looked in the BIOS log the other day and found a note that said that number 8 memory dimm had been 'disabled' due to it failing an ECC check. Rebooted the server after the ESX failed and it will run for ages - then the BIOS will lock out DIMM8 and things go hinky again.

Doesn't seem to matter whih physical dimm is in socket 8 - even when the memory is swapped out we get same issue - Uncorrectable memory error in esx and a Dimm8 Disabled due to ecc failure in bios.

I ended up leaving the last pair of dimms out and the error hasn't occured now for about 3 months

suspicion = something I saw with HP DL servers once... each dimm can be seen as a single rank per side, or dual rank per side - some servers can only take so many ranks... the DL380G4 had a limitation (I seem to recall) of 8 ranks total and not more than dual-rank (or 2 ranks) per dimm... therefore a dual-sided, Dual-rank dimm counted as four ranks (and thus wouldn't work or wouldnt function as expected)

With the Dell, I believe that the fact that I had 4Gb dimms in it (that counted as dual-sided AND dual-rank) might somehow exeed whatever the rank limitation is. I haven't found a limit published... is just a thoery based on experience

Reply
0 Kudos
sr01
Enthusiast
Enthusiast

Wow I have experienced these errors too. Dell R805. Happened once in August and happened again last week. I also ran memtest but it couldnt find any errors. Dell told me to swap the RAM sticks to other slots so I moved them to A3 and A4. Lets see what happens next. Dissapointing though. I was hoping the RAM could correct itself since its ECC.

ECC Uncorr Err: Memory sensor, uncorrectable ECC (

DIMM_B3 DIMM_B4 ) was asserted.

Reply
0 Kudos
MK22
Contributor
Contributor

Yes, that sounds like a pain, but at least the ECC is doing what it is supposed to and the BMC locks out that DIMM. We are actualy getting hard resets because it is not locking the DIMM. I think they are having a problem with the code in the R805 BIOS talking to these AMD 2222 duals, our AMD 2360 quads never have this problem.

VCP
Reply
0 Kudos
MK22
Contributor
Contributor

Never got resolved, getting new servers for all the AMD 2222 dual core machines. Keeping the R805's with AMD 2360's.

VCP
Reply
0 Kudos
sr01
Enthusiast
Enthusiast

question: did you have the same type of RAM in each slot? Same manufacturer, etc? My RAM is mixed with Dell's and Crucial. But same rank, timing, etc. Thanks.

Reply
0 Kudos
MK22
Contributor
Contributor

Yes, we have all the same exact RAM. Some were replaced after orginal purchase, with identical sticks, one after each crash.

VCP
Reply
0 Kudos
sr01
Enthusiast
Enthusiast

thanks thats good to know. I hope this issue is fixed in a future BIOS/hardware update. It sucks when it randomly reboots without a warning.

Reply
0 Kudos
Cyberfed27
Hot Shot
Hot Shot

We are seeing the same exact issue on two Dell M905 blades with the AMD processors.

They were bulletproof when we bought them with 4GB DIMMS. We recently upgraded two M905's to have 128GB of RAM in each using all 8GB ECC DIMMS (Dell certified memory purchased from Dell).

We have had BOTH blades fail multiple times, logging the same ECC memory error you posted and causing a system reboot. All the memory dimms have been replaced multiple times, we even had the system board replaced on one of the servers. No luck whatsoever. Every week or so the system would crash/reboot.

We've done all the BIOS/Firmware updates recommended by Dell etc...

Dell has been useless up to this point and has not provided a resolution. If you ever found a solution please let me know!

Reply
0 Kudos
VirtualManTR
Contributor
Contributor

Hello together,

we had the same problem with brand new PowerEdge Server.Dell changed Motherboard , still same situation.I changed the RAM modules into different slots, anyhow server crashed again. After 3 weeks hard work, I found the solution:

the reason why the server crashes furthermore is the power save mode of the processor. After wake up from power save mode of the processor , the RAM Module can't wake up so fast like the processor.

Disable the C-State in BIOS of the processor and your server never crashes with this error.

Reply
0 Kudos
wb2
Contributor
Contributor

I'm seeing the exact same thing with an r900 server.  Did you ever figure out what the problem was on your server?

Reply
0 Kudos
ElTech
Contributor
Contributor

I have a Dell 2850 and we got the EB10C UNCOR ERR and we just replaced the memory with some known good memory.  It booted just fine with the new memory.  When you take the top panel off the server, it is not the DIMM (memory module) on the top left corner (ours was a 256MB) this was not our problem.  If you look in the middle, a black plastic lid covers the kernal memory and you can replace them just like you would on your PC.  We had two sticks of memory and we replaced both.  Hope this helps.  It worked for us.

Reply
0 Kudos
ElTech
Contributor
Contributor

I have a Dell 2850 and we got the EB10C UNCOR ERR and we just replaced the memory with some known good memory.  It booted just fine with the new memory.  When you take the top panel off the server, it is not the DIMM (memory module) on the top left corner (ours was a 256MB) this was not our problem.  If you look in the middle, a black plastic lid covers the kernal memory and you can replace them just like you would on your PC.  We had two sticks of memory and we replaced both.  Hope this helps.  It worked for us.

Reply
0 Kudos
kurtd
Enthusiast
Enthusiast

I'm getting the same error with two Dell R715's.  They are about 2 years old and I've been getting the error since they were new.  So far I've had two Dimm's replaced and the motherboard.  Still getting the errors but have not seen the error on the dimms that were replaced.  Our systems has a mix of samsung and hynix ram.  Most of the time the errors are on the hynix dims.  These are the errors:

LCD: “E2111 SBE log disabled on DIMM A3”

LOG: “Persistent correctable memory error logging disabled for a memory device at location DIMM_A3”

Server's seem to work fine otherwise and I need to clear then log once a week or so to get a blue lcd screen.  Pretty annoying.

Reply
0 Kudos
gdmersh
Enthusiast
Enthusiast

Hi, 

I have a few Dell R720 machines.

I see 5 correctable memory errors when I run ipmitool sel list. But it doesn't show which DIMM slot encountered them.

But how do you see the log that actually shows the DIMM module. 

Where do I see the bios logs.

Thanks 

 

Reply
0 Kudos