VMware Cloud Community
burdweiser
Enthusiast
Enthusiast

Memory upgrade nightmare

I'm having some trouble with a memory upgrade I attempted this weekend. I have two hosts in a HA cluster. The two are Dell 2970's. I was doing an upgrade from 32GB to 64GB of RAM. All slots are populated with RAM. I went from 4GB sticks to 8GB sticks. The original RAM was Samsung 4GB / 2Rx4 / PC2 / 5300P / 555. The new RAM is Hynix 8GB / 4Rx4 / PC2 / 5300P / 555. Does it matter that it went from 2Rx4 to 4Rx4?

The servers are just going down hard, no PSOD. I get the nice amber Dell display, but the RAM failure seems pretty random. I do a battery of diagnostics on the RAM and everything passes. Am I missing anything here? Everything is configured in the BIOS correctly. The host server seems fine and happy to run with the RAM, but the time perod for failure is pretty random. Do I need to be looking in the hostd.log file?

0 Kudos
24 Replies
weinstein5
Immortal
Immortal

I take from your description that server do not even boot - is this correct? If they do not boot then the problem sounds like at the hardware level - make sure the BIOS is up to date and confirm with Dell that the memory is supported -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
burdweiser
Enthusiast
Enthusiast

We ordered the RAM directly from Dell. The moduels are listed as supported for 2970's. It's very strange though that the diagnostics do not find anything wrong. I'm going to try the BIOS update from 1.5.4 to 3.0.2.

0 Kudos
weinstein5
Immortal
Immortal

Hopefully that will be it - when I have added memory in the past I just put the memory in and it is available -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
burdweiser
Enthusiast
Enthusiast

The BIOS update included a "memory initialization code" fix. I will have to watch the host for the rest of the day to see if it goes down again. One very odd thing I noticed though... before the BIOS update, the server reported 64GB of RAM, after the BIOS update I show 63.99GB of RAM in virtual center. Very strange.

0 Kudos
burdweiser
Enthusiast
Enthusiast

It looks like the BIOS upgrade didn't fix the issue. The server failed again. The nice little amber light flashes on the front of the server with messages about DIMM 5 and 6. We are going to open a case with Dell and see about getting the RAM replaced. Is there a particular log I can look at on the ESX server to see if any memory errors were written before the server went down? Should that be in the hostd.log?

0 Kudos
SuryaVMware
Expert
Expert

If it is a PSOD you should be able to see an NMI on which memory bank you are receiving errors. But i guess in your case it is a brutal poweroff.

I guess you should look for any even logs in the BIOS for memory errors. in any case worth to review /var/log/vmkernel and /var/log/messages

-Surya

0 Kudos
gorto
Enthusiast
Enthusiast

Smells like hardware ........ forget the Dell diags - use Memtest86 and soak-test the new memory for a week without failures.

0 Kudos
burdweiser
Enthusiast
Enthusiast

Surya - What would be a good command to run to view the log files? I tried to setup an account with root permissions and use winscp to copy the files (vmkernel and messages), but I get a permissions denied error. --- never mind, I forgot I had to connect directly to the server and view the logs under administration.

Gorto - I cannot get the server to stay on for more than 12 hours (sometimes only 2 hours). So I'm not sure how useful memtest86 will be for me. It does sound like a good tool that I will invest in though.

0 Kudos
SuryaVMware
Expert
Expert

I guess you are getting permission denied message from the ssh itself. you will have to enable root login via ssh.

edit /etc/ssh/sshd_config and uncomment or make the following to yes f it is set to no.

PermitRootLogin yes

your ssh should work.

if you are on the console you can use tail, less to review the logs. much like the unix log review no specific command.

As far as the memtest86 is concern it is a good program to report memort errors. might take lot of time though. If i were you i would try remove some of the memory modules from the server and see the stability of the server with non-production VMs.

Let me know if this helps.

-Surya

0 Kudos
burdweiser
Enthusiast
Enthusiast

Is there a compatibility guide out there for RAM? I know there is for mostly everything else. I tried searching, but couldn't find anything. Dell took some reports from our 2970's and they are sending those off to the manufacture (hynix) to find our why their RAM is certified on Dell systems but is not working. They sent us another batch of RAM, so I am going to be attempting another upgrade this weekend, but only on one host.

0 Kudos
Lightbulb
Virtuoso
Virtuoso

You may have caught a bad lot of DIMMS. I once had three bad MBs in a row, latter found out that they were all from a lot that was bad, but boy does something like that shake your confidence in your skills.

0 Kudos
burdweiser
Enthusiast
Enthusiast

Is there a way to capture the logs for the memtest86 application? I am running that now, but the app seems to just sit there after I boot to the CD.

Just a quick update. I installed replacment Hynix RAM into the system and it ran about 3 or 4 weeks and crashed on two DIMMS. It took Dell about 12 hours (on our gold support contract) to get someone out to replace the motherboard. They put the same RAM back into the system. Now I need to run a battery of tests.

0 Kudos
Lightbulb
Virtuoso
Virtuoso

Can't think of anyway that you could redirect the output of the tests. Vendor support tends to fluctuate over time. The last time that I dealt with Dell a few years ago they seemed to be improving from a bad patch their Gold Support was pretty good though so maybe they are feeling the pinch.

Hope things work out for you.

0 Kudos
Rumple
Virtuoso
Virtuoso

If you did not get a blue status screen showing tests in process then its not booting at all on the servers.

Did you download memtest86 or memtest86+

the + version have better support for hardware.

0 Kudos
burdweiser
Enthusiast
Enthusiast

I was using the regular version. I am going to try the + version now. Thanks.

0 Kudos
burdweiser
Enthusiast
Enthusiast

We ran memtest 86+ for a week, and didn't find any issues.

We had a Dell tech come out and replace the motherboard, but we only got two

weeks worth of life out of the system before the RAM failed again. They are

trying to do goofy stuff, like replace a processor. The Dell techs are giving

me crap, telling me it's a VMware problem and they want me to get VMware

support on the phone before they will do anything further. But, we have a TAM

and our account team that should be getting us a full system replacement soon.

Keep'in my figures crossed.

0 Kudos
aldikan
Hot Shot
Hot Shot

Hi Burdweiser,

It might escape me, but do you have DELL Open Manage portion installed on your ESX hosts?

It would log HW related stuff in the Log file, and this might help

We do have it configured on all our hosts and found very useful with HW troubleshooting.

Also,

Do you have a DRAC (dell remote access cards) on your hosts?

If you do they will be helpful to capture any screen errors with screen capturing programs.

HTH,

Alex

0 Kudos
burdweiser
Enthusiast
Enthusiast

I do have OpenManage installed. It is capturing the hardware errors. I've been running the Dell DSET also for the techs to look at the logs. The DRAC does not function when the system goes down. There is nothing on the screen, it just goes blank. I have to pull the power from the server just to get the thing to turn back on. This is really frustrating because the Dell techs want to blame it on a VMware issue, when I know it is not. VMware was running fine for the last 9 months while the server had 32GB of RAM. We added 64GB of RAM and we only get 2 weeks of life out of the system before it crashes. I do not see anything wrong in the host logs. We just need a system replacement.

0 Kudos
aldikan
Hot Shot
Hot Shot

The Fact that even Dell Remote Access Card (DRAC) is not functioning when Server goes down further points to hardware issue.

Drac meant for off-band management, independent of any OS installation and software errors.

I would try to get your Dell account manager involved. He should be able to push case forward until it lands with knowledgeable engineer.

Please keep us posted on further developments,

Alex

0 Kudos