VMware Cloud Community
kiwid
Contributor
Contributor
Jump to solution

ESXi 4.1 U1 PSOD on ProLiant ML370G6.. help?? -screenshot attached

Hi,

I'm having trouble with an ESXi server that I've recently built, it's PSOD'ing every few hours.

I've attached screenshots of this, if someone could take a look and see if they can identify what might be the issue here I would appreciate it immensly!

I'm run memtest86 and it came up with the attached error almost instantly,  amd running it again now and it's been going for about an hour without any errors (so far).

from looking at the PSOD 1 is looks like a CPU problem?  would you agree?    or do I have a faulty RAM stick in there?

I'm using the HP ProLiant specific version os ESXi 4.1 U1. The system is a HP ProLiant ML370 G6 /w 10GB RAM &  1.2TB of local SAS disk.

Also - is someone explain the HP NMI drivers for me..?  from reading their write up it looks like these might be triggering the PSOD?  instead of continuing running and logging the problem..   surely it's better to log and carry on?

Thanks

Tags (2)
0 Kudos
1 Solution

Accepted Solutions
idle-jam
Immortal
Immortal
Jump to solution

i've a feeling that it's a memory more than a CPU. i would advise you of removing all the memory and leave the minimum RAM available and test thru memtest. add one module after each test until the error is found. from there you can isolate which module is having fault. if you have vmware support, they would be able to pinpoint the error very fast ..

View solution in original post

0 Kudos
7 Replies
idle-jam
Immortal
Immortal
Jump to solution

i've a feeling that it's a memory more than a CPU. i would advise you of removing all the memory and leave the minimum RAM available and test thru memtest. add one module after each test until the error is found. from there you can isolate which module is having fault. if you have vmware support, they would be able to pinpoint the error very fast ..

0 Kudos
kiwid
Contributor
Contributor
Jump to solution

thanks - I'll give it a go today and report back, thanks for your reply!

0 Kudos
bilalhashmi
Expert
Expert
Jump to solution

If you think its a hardware issue, why dont you run some hardware diagonistics on this server and see what that comes back with? I think that will be a better way to approach this in finding out the root cause.

Follow me @ Cloud-Buddy.com

Blog: www.Cloud-Buddy.com | Follow me @hashmibilal
JacovT
Contributor
Contributor
Jump to solution

At first glance this seems like faulty RAM to me. Try swopping out the RAM first.

If the problem persist, swop out the board.

---Jaco

kiwid
Contributor
Contributor
Jump to solution

Hi,

I swapped out the RAM and the system has been functioning normally since.. *it appears* it was indeed a faulty RAM module. the system has been up for over 40 hours without a problem now (it was crashing every 4-5 hours..),  when I can get back out there and am able to take the system offline I will run some hw diagnostics to hopefully confirm that indeed that was the problem.

thankfully my esx server at home uses the same type of RAM as this one (and hasn't had an issue since install nearly > 2 years ago) ...now I've just got to test all the modules I've pulled out to find the faulty one! 😕

I know HP have some good tools for performing hw diag while the system is offline but is there any tools which can be run while the system is online?  reason this system is running sbs 2011 (incl exchange) so it's difficult to take it offline for long...

Thanks everyone for your replys.

0 Kudos
idle-jam
Immortal
Immortal
Jump to solution

you could put the ram int your home system and do the diagnostic test.. 😃

0 Kudos
kiwid
Contributor
Contributor
Jump to solution

sorry, I meant it's still going to take a while to isolate which of 8 the sticks of ram has the problem.. I've got a spare dell  workstation chassis which uses that RAM to test on luckily 😃

I'd like to do some (non-RAM related) hardware diags on the proliant though too just in case there was some other kind of issue at play as well.. can't be too careful since that server runs everything for the business...

0 Kudos