VMware Cloud Community
cykVM
Expert
Expert

HP Proliant DL380e Gen8, HP OEM VMWare ESXi 5.5 Update 2 keeps crashing (PSOD)

Hello everyone,

I maintain a single VMWare host running vSphere 5.5 (ESXi) Update 2 OEM HP version at the moment for a mid-size charity.

The hardware in use:

HP Proliant DL380e Gen8 (bought brand new in August 2014), HP SmartArray B320i storage controller, HP H222 host bus adapter (only a HP Ultrium4 tape drive connected to that), HP Intel 4port NIC 366i, 32GB RAM, 2 Quadcore Intel Xeon E5-2407

The box was initially installed and configured in August using HP OEM vSphere 5.5 Update 1 installation CD. vSphere is installed on the RAID array configured on the B320i controller. A VMWare Essentials license is also in use/installed.

It's running 3 Windows 2008 R2 VMs (DC, Exchange 2010 and a backup server with Backup Exec 2010 R3 [I know this is not a recommended/supported configuration, but it worked with 5.5 U1 without issues]) besides 2 Debian Linux VMs.

2 weeks ago during weekend maintenance I first installed the latest HP SPP (Service Pack for Proliant) Sept. 2014 which provided several firmware updates for e.g. the B320i, the 366i NIC etc.

After that I performed an upgrade instalölation of vSphere HP OEM 5.5 Update 2 version, which was also released by HP beginning of Sept..

All those setup/update procedures went through without any issues, error messages or crashes.

The host was running fine for 3 days and suddenly crashed with a PSOD stating: PCPU 0: no heartbeat (2/2 IPIs received) [unfortunately I did not take a screenshot]

I reset/rebooted the host through iLo4 console and kept an eye on the server the next days.

The first PSOD took place during daily (nightly) backup on the connected tape drive.

On the following Friday/Saturday night (about 2 days later) it crashed again with the following PSOD - again with PCPU 0: no heartbeat (2/2 IPIs received):

PSOD1.PNG

So I started investigating this, found some hints here in the VMWare communities leading to recommended BIOS settings of HP Proliant servers and checked the actual settings and changed the values to the recommended ones. The server was running fine without gliutches for about 16 hours then crashed again with this PSOD:

PSOD2.PNG

I continued investigation, and especially took an eye on power management setting in BIOS, vSphere and in the Windows VMs.

Also checked installed firnware versions of the storage controllers and NIC and driver versions in use. All OK there (as recommended in HP VMWare recipe Sept. 2014).

Server was running fine for about a week after the reboot then another PSOD early this morning at about 3 a.m.:

PSOD3.PNG

The server/VMs were mostly idle at this time, no heavy I/O activity.

The first two PSODs happened during backup but not at a certain time (one at about 10 p.m. the other early in the morning between 2 and 3 a.m.).

I read through tons of hints to faulty NIC drivers/firmware, BIOS confgurations etc. but nothing helps or even everything is configured exactly as in HP recommondations for vSphere 5.x.

For the BIOS settings I followed this list/table:Recommended BIOS Settings on HP ProLiant DL580 G7 for VMware vSphere | Boerlowie's Blog

vSphere is configured to "High Performance Mode" and the Windows VMs, too.

I'm somehow stuck now, so maybe someone here has a good hint for me?

If you need any further hardware/software/configuration/whatever details, just ask.

Cheers and thanks in advance for any help,

cykVM

122 Replies
cykVM
Expert
Expert

P.S. For the SmartArray B320i the latest hpvsa driver is in use. I somehow start thinking of going back to 5.5 U1 but generally would like to avoid this.

No PSODs during office hours, only at night/early morning, where noone is using the system.

Reply
0 Kudos
cykVM
Expert
Expert

@vlho Thanks for your suggestion, but I'm afraid the hpsa driver is not in use on this server since it's not a SmartArray Pxxx controller (but a Bxxx). Using the hpvsa driver for the SmartArray controller...

Reply
0 Kudos
cykVM
Expert
Expert

Ok, thanks again, will try disabling VT-d laters since the office will open soon. I personally suspect the BIOS throttling down the CPU (cores) for power saving after the server is mostly idle for a while. But this might be only a coincidence...

I just checked the luckily saved logfiles after the first PSOD. IN vmkernel.log shortly before the first crash there are several events logged leading to the CD/DVD-drive connected to the onboard AHCI SATA controller. No media was inserted at the point of failure. Several events like this were logged before the crash:

2014-09-22T23:38:16.342Z cpu3:32823)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x1a (0x412e83338540, 0) to dev "mpx.vmhba35:C0:T0:L0" on path "vmhba35:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2014-09-22T23:38:16.342Z cpu3:32823)ScsiDeviceIO: 2338: Cmd(0x412e83338540) 0x1a, CmdSN 0x1f05 from world 0 to dev "mpx.vmhba35:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

and referring to random cpu0 - cpu4... (not just cpu3 as above)

Reply
0 Kudos
vuzzini
Enthusiast
Enthusiast

Hello cykVM,

I am afraid whether the B320i and B366i controllers are supported to run ESXi 5.5 U2. I do not see them listed under VMware HCL, whereas all the below listed controllers are supported as per VMware HCL.

  • HP Smart Array P212 Controller
  • HP Smart Array P220i Controller
  • HP Smart Array P222 Controller
  • HP Smart Array P230i Controller
  • HP Smart Array P410 Controller
  • HP Smart Array P410i Controller
  • HP Smart Array P411 Controller
  • HP Smart Array P420 Controller
  • HP Smart Array P420i Controller
  • HP Smart Array P421 Controller
  • HP Smart Array P430 Controller
  • HP Smart Array P431 Controller
  • HP Smart Array P700m Controller
  • HP Smart Array P711m Controller
  • HP Smart Array P712m Controller
  • HP Smart Array P721m Controller
  • HP Smart Array P731m Controller
  • HP Smart Array P812 Controller
  • HP Smart Array P822 Controller
  • HP Smart Array P830 Controller
If you found this or any other answer useful please consider the use of the Helpful or Correct buttons to award points. Sandeep Vuzzini Sr. DevOps Engineer
cykVM
Expert
Expert

Thanks, Sundeep. That's correct and I'm aware of that. That's why I used the HP modified installation image(s) for vSphere 5.5 U1/U2. The included hpvsa driver for the B320i storage controller and the igb driver for the Intel 4port NIC HP 366i (Intel I350 gigabit NIC) generally works fine.

No glitches during working hours.

Maybe I post something in HP support forums later.

Reply
0 Kudos
Wh33ly
Hot Shot
Hot Shot

Did you upgrade the ILO firmware already ?

Because it can cause a PSOD: see the advisory below

http://h20566.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/topIssuesDisplay/?sp4ts.oid=5219...

Advisory: (Revision) - HP Integrated Lights-Out 4 - FIRMWARE UPDATE REQUIRED: Intermittent Non-Maskable Interrupt (NMI) Events May Occur on ProLiant Gen8 Servers with HP Integrated Lights-Out 4 Firmware Versions 1.30, 1.32, 1.40 and 1.50


IMPORTANT: The firmware update in this advisory is considered a critical fix and is required to prevent or correct the issue detailed below. HP strongly recommends immediate application of this critical fix. Neglecting to perform the required action could leave the server in an unstable condition, which could potentially result in sub-optimal server performance, including server lock-ups. By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors.

If this issue occurs, the operating system will indicate that an NMI has happened; however, the specific indication will vary by OS:

  • VMware ESXi operating systems will experience a Purple Screen of Death (PSOD).
Reply
0 Kudos
cykVM
Expert
Expert

Wh33ly schrieb:

Did you upgrade the ILO firmware already ?

Yes, iLo-4 Firmware was upgraded to version 2.0 before Sept. HP SPP was run.

It anyway already had 1.51 installed.

Reply
0 Kudos
cykVM
Expert
Expert

And another PSOD shortly after first backup job started. Disabled VT-d now in BIOS and will see how it works.

Reply
0 Kudos
COS
Expert
Expert

I'm assuming you have more than one processor.

Make sure you set all you hardware power profiles to maximum. It looks like the system shut off a Core/CPU and esxi panicked.

You'll have to go into the BIOS and set for maximum performance.

Thanks


Reply
0 Kudos
cykVM
Expert
Expert

After disabling Vt-d in BIOS server ran fine for about 1.5 - 2 days. Yesterday (late) evening it crashed again with same PSOD error as above, I then rebooted and checked BIOS settings again. Since the error messages from vmkernel.log leading to the CD/DVD drive I now tried to switch the onboard SATA controller from AHCI mode to legacy SATA. There is only the CD/DVD connected to that controller, so generally no need for AHCI being enabled. Will see how this goes.

If that does not help either I will go back to 5.5 U1 version, I think.

Reply
0 Kudos
cykVM
Expert
Expert

After two further PSODs - now during office hours - I was fed up with testing various BIOS settings and different configurations. I now went back to 5.5 Update 1 by using SHIFT+R during boot up.

Hopefully now the server will run stable.

The interesting thing about the last 2 PSODs was that they seem to be initiated by creating a new VM. The initial configuration of that Windows VM guest was successful but as soon as I powered it on the server/VMWare crashed with the No heartbeat PSOD.

This was only reproduceable in 5.5 Update 2 (build 2068190 HP customized), new VM deployment works fine with Update 1 version/kernel (build 1746018 HP customized).

The newly created VM disappeared from inventory after the PSOD, it was still existing in the associated datastore but not in the inventory.

I now suspect the (physical) NIC drivers in conjunction with the virtual E1000 drivers to be the cause of this kernel panic. But this is only an assumption.

And just to mention it, the HP 366i 4-port NIC using the igb VMWare driver is in VMWare HCL: VMware Compatibility Guide: I/O Device Search

CyrilH
Contributor
Contributor

Thank you for the post cykVM,

I have the exact same issue, with a slightly different configuration:

- DL360e Gen8, HP Smart Array B320i, HP Ethernet 1Gb 4-port NIC 366i, 40GB RAM, 2 Quadcore Intel Xeon E5-2407.

- I also used the HP modified installation image(s) for vSphere 5.5 U1/U2, but I upgraded directly from ESXi 5.1.

- I also upgraded the HP Firmware before planning the VMWare upgrade.

We must have read a lot of the same posts.

The first PDSs / PSODs happened out of office hours as well and now can happen anytime, but rarely more than once a day.

Twice the PDSs / PSODs happened before first Guest was fully booted.

The last one occured when booting a Windows Server 2003 guest.

I haven't upgraded the hardware to Version 10.

I run Trend Micro Worry Free Business, but I haven't found any link with the Deep Security / ESXi issues.

I also see that you get the same: Coredump to disk. Slot 1 of 1. DiskDump: FAILED: Timeout.

The configuration of my "CoreDump to File" seems OK though.

I haven't resigned myself yet to use Shift-R, but will have to if nothing comes up.

QUESTION: is there a point to try to disabled the Heartbeat Check to see if there is any improvement? Is it even a viable option?

I'm interested in any update.

Thanks.

Reply
0 Kudos
cykVM
Expert
Expert

Hi CyrilH,

as for the update/actual situation: The server is still running with the downgraded 5.5 U1 version, so far without any crashes. It's using the updated drivers from the September 5.5 U2 ISO, just the bootbank switched to previuos VMWare version.

I read tons of blogposts, HP and VMWare community threads, webpages etc. The most likely "solution" is the one vlho stated above referring the IOMMU error. But disabling VT-d did not help - the server just stayed stable a little longer after disabling VT-d in BIOS.

I also posted something on HP communities during the weekend, see Proliant DL380e Gen8 keeps crashing (PSOD) after u... - HP Enterprise Business Community

Only one further user noted that he ran into the same trouble but with no further information so far.

For me the SHIFT+R method looks like the only option for now. Not sure if it's a good idea to switch off the heartbeat check in VMWare kernel nor if it's even possible.

cykVM

Reply
0 Kudos
cykVM
Expert
Expert

In between an HP employee suggested this:

quoting Suman from HP:

>In BIOS, under HP Power Regulator, use HP Static High Performance Mode.

>And add VMware boot flag timerEnableTSC = false

>Add VMware boot flag usePCC = false

I already tried "Static High Performance Mode" in BIOS but without changing/setting the boot flags in VMWare.

Whereof the usePCC boot option does no longer exist anymore since VMWare 5.0 Update 2. It defaults to false/disabled now.

Reply
0 Kudos
cykVM
Expert
Expert

Another user in HP forums ran into the same problems with random crashes/PSODs.

He installed HP customized 5.5 Update 2 from scratch on a SD-Card. It looks like his Proliant crashes even a bit more often. uptime max. about 1 day.

He tried to set the kernel bootflag timerEnableTSC = false with his installation and did a rbeoot after that. But no joy, the server crashed again during nightime and showed the no heartbeat PSOD this morning.

So, to sum this up, the only option for now is to go back to previuos version of VMWare, e.g. by using the SHIFT+r method.

Or to go back and upgrade to 5.5 Update 1 HP customized, which runs fine for me (or even do a installation of 5.5 U1 from scratch).

See discussion at HP: http://h30499.www3.hp.com/t5/ProLiant-Servers-ML-DL-SL/Proliant-DL380e-Gen8-keeps-crashing-PSOD-afte...

CyrilH
Contributor
Contributor

Hi,

I ended up using Shift-R as well after my ESXi started crashing during boot with the following error instead of "no heartbeat":

- Can't detect the last level cache.

I'll schedule an upgrade to 5.5 U1 HP later.

Thank you for the updates.

Regards,

Cyril

Reply
0 Kudos
cykVM
Expert
Expert

Thanks for the feedback. It for now looks like the only viable option to go back to previous version.

And just as a warning: In between HP released a new VMWare hpvsa driver version 5.5.0-90 [2014.09.11] for both VMWare 5.5 and 5.1.

For my 5.5 U1 the new driver was no good and caused a massive drop in performance/throughput and I had to go back to 5.5.0-86 which was installed before.

See HP forums (link above) for more details.

Reply
0 Kudos