Solved: Re: HP DL 385 and 585 G5 - memory woes?, Dell too?

pdrace · ‎07-03-2008

We ecently purchased 3 new servers, 2 DL 585 G5s and a 385 G5 with 4GB dimms. The HP service guy has practically became a staff member since we started running VMWare (3.5 update 1) on the. They have replaced all 12 dimms on one and all the dimms on the other. They've replaced 2 on the 385. 1 of the 585s and the 385 have now finally been running for a week straight without problem. The other 585 came up with 2 new dimm errors this morning after running for 36 hours or so.

They are telling me that other customers are seeing memory issues with Vmware as well. The contention is that VMWare tolerates less errors than other operating systems. I'm wondering if it's just a quality control issue with the memory form factor.

Is anybody else seeing this?

How about on Dell boxes with the equivalent processors and memory?

Is there a server lemon law?

dominic7 · ‎07-26-2008

I have to agree completely with pdrace, where I work we run a large number of DL585 G1's and G2's and memory failures are indeed a common occurence. On a weekly basis I would say that I have HP come in and replace at least 1 2GiB dimm somewhere in my environment. With the G1's HP is almost never able to determine which dimm is actually causing the problem ( except in the rare case that the fault light turns on, it generally results in a PSOD on the G1s ) and we force them to replace all 32GiB in the server. In the past we had HP replace a single DIMM at a time, but it's too much to risk having 30+ VMs crash on a host because HP can't determine with any degree of success which dimm is actually the problem. I've literally replaced probably 300GiB of RAM in total on the G1s. The G2's are much more graceful and report ECC errors in the IML so we can VMotion all the VMs off and replace the RAM without downtime.

When you finish with the memory errors, watch out for the NMI errors on the G1's when running multiple VMotions.

View solution in original post

kjb007 · ‎07-03-2008

I previously ran DL380/5 and DL580/5 and don't agree with that statement. I currently run Dell 6950's and I again disagree with that statement. Guess you just got a band bunch of sticks in my view.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

pdrace · ‎07-03-2008

What "statement" are you disagreeing with?

The HP service people are saying they are having other customers having the same issue.

kjb007 · ‎07-03-2008

The tolerance to memory errors is what I am disagreeing with. I'm not saying that I've had no memory issues, and every system has soft errors now and then, but I haven't seen soft errors cause problems with my ESX.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

rDale · ‎07-06-2008

Im running DL585s witha total of 384 x 4GB dimms and only had 1 failure over a period of 2 months.

rminick · ‎07-15-2008

Where are you seeing the memory errors, in the IML?

I have recently added 2 DL385 G5's to our mix of Proliants. They have been running great for a few weeks until now. I had one hard lock early this morning. Of course it passes HP's diag tests. The only thing I see in the kernel logs is some path changes on the fabric which has never caused this before and one error I have never seen before "ALERT: CpuSched: 12575: processor apparently halted for

8017258545704 ms".

Richard

Richard J Minick, VCP

Texiwill · ‎07-16-2008

Hello,

Actually, VMware is not less tolerant of Memory errors, it is just that it USES all memory in the system unlike other operating systems. So if you had say 8GBs of memory in a Windows box, the chances of you touching all memory in the system is pretty slim unless you are running something very large. So since ESX touches and uses all the memory depending on load, what you see is different. A few things with the DLxx5 series. Make sure your CPU/Memory boards are balanced. I.e. do not have 32GBs on one board and 16 on the other. VMware definitely does not like that!

Also you could have bad memory, it happens from time to time. WIth the density of memory it was discovered that Shielded memory tends to work better than non-shielded. Also make sure the Memory is HP Branded. It tends to work much better than non-HP memory.

I would definitely make sure your memory is shielded, but other than that, it is possible you got several bad sticks of memory.

As for testing, you want to run memtest86 and not the HP diags. Run them for at least 24-48 hours. And do the same for the HP Diags before you even place ESX on the system.

I saw quite a few issues with memory when working at HP, and most of the time it was shielded vs. non-shielded. At least in really dense configurations.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

pdrace · ‎07-16-2008

Hello,

>Actually, VMware is not less tolerant of Memory errors, it is just that it USES all memory in the system unlike other operating systems. So if you had say 8GBs of >memory in a Windows box, the chances of you touching all memory in the system is pretty slim unless you are running something very large. So since ESX touches >and uses all the memory depending on load, what you see is different. A few things with the DLxx5 series. Make sure your CPU/Memory boards are balanced. I.e. do >not have 32GBs on one board and 16 on the other. VMware definitely does not like that!

Not my idea this is coming from HP's service, even their escalation engineer was saying this,

>Also you could have bad memory, it happens from time to time. WIth the density of memory it was discovered that Shielded memory tends to work better than non->shielded. Also make sure the Memory is HP Branded.

The memory was factory installed by HP. Even though it may be HP branded, the memory may be from any number of OEMS such as Kingston, Micron, Samsung, Hi-Link,etc. Their techs are saying that they are finding they are having better luck when replacing DIMMS if they replace them in pairs and use memory from the same OEM, At one point they were suggesting that the problem was with memory from one particular OEM but then later backed off that claim.

This is actually kind of humurous. I had a big dust up with HP before about obtaining additional memory for 2 DL 385s. I ordered memory from HP to upgrade these boxes from 8 to 16 GB. After receiving the order the reseller contacted me to tell me that HP couldn't provide these parts anymore and that they would have to obtain the memory second source from Kingston. I objected strenously to this to no avail. I almost cancelled the order for the 585 G5s based on this, Maybe I should have? I ended up ordering the upgrade memory myself from Crucial. These two servers have had no new issues after adding this memory and installing 3.5 update 1. I did have to have 2 of the original HP branded dimms repalced though. :smileydevil: (There had been internal health warnings on this server before the memory was added.)

>It tends to work much better than non-HP memory.

>I would definitely make sure your memory is shielded, but other than that, it is possible you got several bad sticks of memory.

>As for testing, you want to run memtest86 and not the HP diags. Run them for at least 24-48 hours.

Their escalation engineer ran their own proprietary meatgrinder utility on one of the servers all weekedn after replacing all the memory on one of the 585s. They only ran it for a couple of hours on the other and we ended up having another failure after booting it back up in ESX within 24 hours.

>And do the same for the HP Diags before you even place ESX on

>the system.

I ran HP diags on a DL 385 G5 all weekend beofr putting ESX on it. A dimm failure occurred after 5 days of running ESX 3.5. They replaced the DIMM and it's partner and it has been OK since.

>I saw quite a few issues with memory when working at HP, and most of the time it was >shielded vs. non-shielded. At least in really dense configurations.

I'll look into this but I didn't see any option to choose shielded memory when configuring these servers.

Texiwill · ‎07-16-2008

Hello,

Not my idea this is coming from HP's service, even their escalation engineer was saying this,

If a hardware engineer they would say that.... Sigh.

The memory was factory installed by HP. Even though it may be HP branded, the memory may be from any number of OEMS such as Kingston, Micron, Samsung, Hi-Link,etc. Their techs are saying that they are finding they are having better luck when replacing DIMMS if they replace them in pairs and use memory from the same OEM, At one point they were suggesting that the problem was with memory from one particular OEM but then later backed off that claim.

Factory installed does not really mean much these days.

This is actually kind of humurous. I had a big dust up with HP before about obtaining additional memory for 2 DL 385s. I ordered memory from HP to upgrade these boxes from 8 to 16 GB. After receiving the order the reseller contacted me to tell me that HP couldn't provide these parts anymore and that they would have to obtain the memory second source from Kingston. I objected strenously to this to no avail. I almost cancelled the order for the 585 G5s based on this, Maybe I should have? I ended up ordering the upgrade memory myself from Crucial. These two servers have had no new issues after adding this memory and installing 3.5 update 1. I did have to have 2 of the original HP branded dimms repalced though. :smileydevil: (There had been internal health warnings on this server before the memory was added.)

Crucial makes very good memory. I think most of their highspeed stuff is also shielded.

Their escalation engineer ran their own proprietary meatgrinder utility on one of the servers all weekedn after replacing all the memory on one of the 585s. They only ran it for a couple of hours on the other and we ended up having another failure after booting it back up in ESX within 24 hours.

meatgrinder is pretty good. But I would run memtest86 even so. Its more thorough about memory than anything out there yet.

I ran HP diags on a DL 385 G5 all weekend beofr putting ESX on it. A dimm failure occurred after 5 days of running ESX 3.5. They replaced the DIMM and it's partner and it has been OK since.

Very odd that.... But it could be a sustained heat type of thing. Meatgrinder may not have run long enough to see it.

I'll look into this but I didn't see any option to choose shielded memory when configuring these servers.

Not sure there is. Speak to your sales rep and see what they can do. I worked a case (software) not hardware where they had DL585s and a full load of memory, which was 128GBs at the time. When we went to shielded memory all problems cease to be an issue.

I would escalate your case up the HP management chain.... Ask them to replace the memory with the appropriate memory for a system running ESX. Sometimes they have different batches, etc. This is one of those things that really depends on your support engineer to help as well as your sales rep.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

pdrace · ‎07-17-2008

Well one of the DL 585s that had been ok for a couple of weeks has a internal health error again this morning...

Texiwill · ‎07-17-2008

Hello,

The error may not be related to Memory but something else as well..... What was the exact error? I would also check the diagnostic lights on the server itself.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

pdrace · ‎07-17-2008

It is a memory error. Probably not as severe as it didn't cause errors in the vmkernel log or cuse vm crashes.

HP is coming in to replace the DImm(s) this afternoon.

Texiwill · ‎07-17-2008

Hello,

Are you setup with Raid Memory? I really like that feature of HP Hardware.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

rDale · ‎07-21-2008

funny thing just got hit by this problem lost 12 modules in 1 week all 4GB DIMMs the odd thing is none of these modules failed in the 4 months they were running in the BL685s but now that there in the DL585 G5 failures are starting.

pdrace · ‎07-21-2008

Both 585 G5s and the 385 G5 had dimm errors displayed this morning!! :_|

Not bad enough to crash the host or vm but it doesn't instill any confidence. I've now installed the Insight Agents on all after reading this:

Our HP rep is offering to replace them with equivalent Intel based boxes or issue a refund for them. Seems like they have thrown in the towel.

pdrace · ‎07-21-2008

No we don't have raid memory, I thought that was only available on Intel based boxes.

Texiwill · ‎07-21-2008

Hello,

I have used both Intel and AMD based HP systems.... As for Raid memory you are correct... My mistake. If you are not set on AMD, going with Intel Quad-Cores may be a better approach... Either way I think something is still wrong with the hardware for it to throw errors like that all the time. Perhaps the CPU/Memory boards....

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

ALFi · ‎07-22-2008

Hi,

back in the DL585g1 day's i have seen that realy often, also seen it on whitebox tyan mainboard's

it happend only when more then 4 dimm's per cpu are used or the dimms where running to fast

the dimms where failing in random order.. if we testet a failed dim in another machine it was fine

sometimes the server crashed twice an hour sometimes it run straigth for a week..

every ~3th crash one or two dimms where marked as failed

HP was changing the memory/cpu boards a lot.. i think we had about 8 revisions in 6 months, and was downclocking the dimms if you have more then 4 per socket

today all our 15 DL585g1 are running fine.. they have up to 96GB memory..

i thought amd has fixed this issue with the move from soket 940 to soket F

our DL585g2 are running fine (althought i think none of them have more then 4 dimm's per socket)

sorry no DL585g5

one way to test the server was: install esx, make many 32bit 3,6gb instanzes (up to the server memory size) and run www.memtest.org on every of those instanzes, if this would run 48h hours without problem the server was fine

ngrundy · ‎07-22-2008

\
> How about on Dell boxes with the equivalent processors and memory?

I'll throw an answer into the ring from a Dell perspective, we run a mix of 2950's, 6850's and 6950's, across the fleet of 18 VM hosts totaling ~700GB of RAM i think we've had 4 memory sticks replaced in 3 years.

In all honesty i'd say you just got a bunch of servers with either a bad batch of ram or a dud mainboard revision.

pdrace · ‎07-22-2008

I have two 585 G1s with 32 GB each. I had 1 Dimm failure in the first week. Haven't had a problem since. Same goes for two 385 G1s. We also have 4 585 G2s running RH Linux and Oracle for over a year, no failures. If this is common on the DL 385 and 585 G5s that's really bad news for AMD whether it's their fault or not.

It looks like we'll be trading these in for 580 and 380 G5s. I have an boxed 585 G5 that they sent as a repalcement if we kept having problems with one of the boxes. At this point I can't even justify spending the time to try it.

All

HP DL 385 and 585 G5 - memory woes?, Dell too?