Mathjaz
Contributor
Contributor

HP ProLiant DL380e Gen8 instaled ESXi 5.5.0 Build 2068190 on internal 8GB SD card PSOD problems

Jump to solution

I have new server, new instalation.

Inside i have 2 EG0600FCSPL disk in RAID 1

There are curently running 3 VM. 2 Server 2012 R2 and one Windows 8.1 VM.

First i got PSOD 14 days ago, and it poped up during instalation. I updated firmwere diretly from server.

Now when server is rdy to start runing (domain is UP, computer in domain etc..) i got PSOD again yesterday. After reading some forums, may be problem with HP power saving futures. I set all on high preformance. At night got new PSOD,In attachment ares PSOD's from yesterda and this night.  All were about "No heartbeat".

I am also noticing when i log to the server in the morning, it need like 30 sec to get log in. After firt login is taking like 5 sec, much faster. This i am noticing from start till today. Maybe is conected with someting.

Any idea what can be wrong ?

Tnx

0 Kudos
1 Solution

Accepted Solutions
markzz
Enthusiast
Enthusiast

Mathjaz.

Regarding power management.

Ive never had any success with power management enabled.

Power management means PSOD..

In bios you should set power management to high performance, disable all "C" states and set cooling to maximum.

We have suffered PSOD before due to power management and now ALWAYS disable it.

View solution in original post

0 Kudos
13 Replies
LeslieBNS9
Enthusiast
Enthusiast

Looks like you have the same problem as this other thread.... HP ProLiant DL380e Gen8 - ESXi 5.5.0 build 2143827 - PSoD

I recommend following HP's Vmware Recipe for building out your server. After rebuilding according to HP's specs as far as firmware/driver and custom ISO image we have not had any problems with the servers like this in our environment.

Here's the recipe we used from HP for our own servers http://vibsdepot.hp.com/hpq/recipes/June2014VMwareRecipe13.0.pdf

0 Kudos
Mathjaz
Contributor
Contributor

Yesterday after posting i notice that i installed ESXi in october, and newest verzion is same build but novebmer date. So i reinstaled (clean install) again.  So far so good, but only 1 day alive.  Hope i get back after a week with good replay Smiley Happy

0 Kudos
Mathjaz
Contributor
Contributor

Again PSOD Smiley Sad

Today i tried to upgrade firmwere from smart iteligent. Found new firmwere on network device. Treyed to upgrade it, but when he's done and rescaning the hardwere and new firmwere, there is again same network device with old firmwere....

After rebooting server, server vent crazy (ventilators to 100 %, ilo unknown), and i was still conected on ilo and ping was working, no command did work out. Pressed power off and he went off. But i couldn't power him on from ILO. After manual pressing power on, he went crazy again ... We manualy pressed power off for 5 sec and cut off electricity for 1 minut, and power on was now normal.

Same thing happend last time, but thot it was one time incident...

And PSOD was same as i posted in first post.  Does anone know what it means ?

0 Kudos
Alistar
Expert
Expert

Hi there,

you had a Non-Maskable Interrupt (Non-maskable interrupt - Wikipedia, the free encyclopedia) interfering with the hardware - the last instruction before the host crashes is related to Spinlock Spinlock - Wikipedia, the free encyclopedia - this seems that the CPU is waiting for instructions, but didn't get any (got locked up) and after two ticks there was a check to induce kernel panic via NMI to maintain data integrity.

I'd suggest checking up the iLO for any error or critical events. My suspicion is on a faulty CPU or the motherboard if the host went crazy as you say, this does not seem to be bound to any other hardware or driver. Get your HP support involved and have your hardware checked and exchanged.

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/
0 Kudos
markzz
Enthusiast
Enthusiast

Hey Mathjaz.

When you update firmware on these HP servers I'd suggest you do NOT update individual firmware..

Download the HP SPP 2014.09.00 and use it as your base package..

As mentioned by LeslieBNS9 HP labs also release what they call a recipe. This is a suggested firmware verse software compatibility matrix.

Personally I'd just go with the SPP2014.09.00..

From an ESXi perspective.

Download the HP custom ESXi5.5u2 image from VMWare.

Once installed you need to disable 2 drivers if you want this to be stable..

The SMX driver which interacts with SIM (which I doubt you are using)

and the axm driver which interacts via PCC with power management on the gen8 servers, you don't need this unless you want PSOD's

Putty into your server and run the below. Once complete reboot the host

  1. Get Current state of CIMvmw_hp-smx-providerProviderEnabled

esxcfg-advcfg -g /UserVars/CIMvmw_hp-smx-providerProviderEnabled

  1. Set CIMvmw_hp-smx-providerProviderEnabled State to 0 disabled

esxcfg-advcfg -s 0 /UserVars/CIMvmw_hp-smx-providerProviderEnabled

  1. Get Current state of CIMvmw_hp-smx-providerProviderEnabled

esxcfg-advcfg -g /UserVars/CIMvmw_hp-smx-providerProviderEnabled

/etc/init.d/sfcbd-watchdog stop

/etc/init.d/sfcbd-watchdog start

/etc/init.d/sfcbd-watchdog status

  1. Disable AMS driver.

/etc/init.d/hp-ams.sh stop

  1. Run this command to remove the VIB.

esxcli software vib remove -n hp-ams

The site as changed my hash to 1. where you see the "1." please replace these with a hash "shift 3" on your keyboard.

0 Kudos
Mathjaz
Contributor
Contributor

Hello

We have changed motherbord of the server. And still same problem PSOD.

Tryed what markzz sugested, and server was up only few hours than again PSOD.

SPP 2014.09 i dont have, but was also trying to do F10  Intelignet provision and update firmwere. Again problem, now i got "Unable to contact update server" everytime i try (trying over a week now).

Also set server on maximum preformance.

Any more idea ?

Realy don't know anymore what to do.

0 Kudos
Alistar
Expert
Expert

Hi there,

you can get the SPP from here: HP® Servers - Service Pack for ProLiant

The error is still reported on PCPU0, so my best bet would be to exchange the 1st CPU. Perhaps if you run some stress tests on the whole NUMA node and get it to crash (while it won't on the next one) you can be almost certainly sure that this is a hardware fault.

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/
markzz
Enthusiast
Enthusiast

Alistar may have something of an answer there..

How about you remove CPU1.

Move CPU2 to CPU1's socket and test again.

0 Kudos
markzz
Enthusiast
Enthusiast


regarding the SPP.

you can either create a DVD from the ISO or you can mount the ISO via the ILO.

Once booted you only need to choose to update firmware from DVD media.. Don't try to go on line it all gets messy with proxy's etc..

0 Kudos
Alistar
Expert
Expert

That's a great idea! You will find an example stress test (albeit for RAM) in my article: Stress Testing an ESXi Host with Windows Server VMs | VMXP

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/
Mathjaz
Contributor
Contributor

Tnx Alistar,

I have downloaded SPP 2014.09.00 and automaticly updated server. After i also checked F10 smart inteligent, which is now working ok, and no new update avalible  Smiley Happy

Meanwhile i decided to insert new SD card and instaled vSphere 5.5 U1 Dec 2014, since is newer date.    (Release build  1746018)

Stress test is also good idea in some way Smiley Happy

But since i think i have problem when server goes to sleep and than PSOD  (not every time) don't know if is the best test Smiley Happy

Will strees test him for few days.. Hope it survive Smiley Happy


Tnx all, will replay, hope in few days, not tomorow Smiley Happy


0 Kudos
markzz
Enthusiast
Enthusiast

Mathjaz.

Regarding power management.

Ive never had any success with power management enabled.

Power management means PSOD..

In bios you should set power management to high performance, disable all "C" states and set cooling to maximum.

We have suffered PSOD before due to power management and now ALWAYS disable it.

View solution in original post

0 Kudos
Mathjaz
Contributor
Contributor

Today system still stable.

Next week server will go in production.

Tnx all for help

0 Kudos