VMware Cloud Community
cykVM
Expert
Expert

HP Proliant DL380e Gen8, HP OEM VMWare ESXi 5.5 Update 2 keeps crashing (PSOD)

Hello everyone,

I maintain a single VMWare host running vSphere 5.5 (ESXi) Update 2 OEM HP version at the moment for a mid-size charity.

The hardware in use:

HP Proliant DL380e Gen8 (bought brand new in August 2014), HP SmartArray B320i storage controller, HP H222 host bus adapter (only a HP Ultrium4 tape drive connected to that), HP Intel 4port NIC 366i, 32GB RAM, 2 Quadcore Intel Xeon E5-2407

The box was initially installed and configured in August using HP OEM vSphere 5.5 Update 1 installation CD. vSphere is installed on the RAID array configured on the B320i controller. A VMWare Essentials license is also in use/installed.

It's running 3 Windows 2008 R2 VMs (DC, Exchange 2010 and a backup server with Backup Exec 2010 R3 [I know this is not a recommended/supported configuration, but it worked with 5.5 U1 without issues]) besides 2 Debian Linux VMs.

2 weeks ago during weekend maintenance I first installed the latest HP SPP (Service Pack for Proliant) Sept. 2014 which provided several firmware updates for e.g. the B320i, the 366i NIC etc.

After that I performed an upgrade instalölation of vSphere HP OEM 5.5 Update 2 version, which was also released by HP beginning of Sept..

All those setup/update procedures went through without any issues, error messages or crashes.

The host was running fine for 3 days and suddenly crashed with a PSOD stating: PCPU 0: no heartbeat (2/2 IPIs received) [unfortunately I did not take a screenshot]

I reset/rebooted the host through iLo4 console and kept an eye on the server the next days.

The first PSOD took place during daily (nightly) backup on the connected tape drive.

On the following Friday/Saturday night (about 2 days later) it crashed again with the following PSOD - again with PCPU 0: no heartbeat (2/2 IPIs received):

PSOD1.PNG

So I started investigating this, found some hints here in the VMWare communities leading to recommended BIOS settings of HP Proliant servers and checked the actual settings and changed the values to the recommended ones. The server was running fine without gliutches for about 16 hours then crashed again with this PSOD:

PSOD2.PNG

I continued investigation, and especially took an eye on power management setting in BIOS, vSphere and in the Windows VMs.

Also checked installed firnware versions of the storage controllers and NIC and driver versions in use. All OK there (as recommended in HP VMWare recipe Sept. 2014).

Server was running fine for about a week after the reboot then another PSOD early this morning at about 3 a.m.:

PSOD3.PNG

The server/VMs were mostly idle at this time, no heavy I/O activity.

The first two PSODs happened during backup but not at a certain time (one at about 10 p.m. the other early in the morning between 2 and 3 a.m.).

I read through tons of hints to faulty NIC drivers/firmware, BIOS confgurations etc. but nothing helps or even everything is configured exactly as in HP recommondations for vSphere 5.x.

For the BIOS settings I followed this list/table:Recommended BIOS Settings on HP ProLiant DL580 G7 for VMware vSphere | Boerlowie's Blog

vSphere is configured to "High Performance Mode" and the Windows VMs, too.

I'm somehow stuck now, so maybe someone here has a good hint for me?

If you need any further hardware/software/configuration/whatever details, just ask.

Cheers and thanks in advance for any help,

cykVM

122 Replies
menait
Contributor
Contributor

I have the exact same problem on our production server.  Now, from what I gather on this thread, there is no other solution except to downgrade to 5.5 U1.  My question to you is, is it possible to install (via USB) 5.5 U1 over the existing U2 installation without losing VMs and datastores?  Or do I have to set it up from scratch?

I'm adding another ESXi server in a matter of days and it looks to me I should be using U1 instead of U2.  I wonder if VMWare is aware of this problem and if so, are they doing something about it?

0 Kudos
cykVM
Expert
Expert

menait:

I have the exact same problem on our production server.  Now, from what I gather on this thread, there is no other solution except to downgrade to 5.5 U1.  My question to you is, is it possible to install (via USB) 5.5 U1 over the existing U2 installation without losing VMs and datastores?  Or do I have to set it up from scratch?

First of all do also read/check the above link to HP communities for further details. I think it would be a good idea to "collect" as many people as possible there for HP to react on this issue. So you may also post there.

For your questions: I won't install a lower version over the Update 2/U2 version. As far as I know this is not supported by VMWare and could have various side-effects (e.g. VMWare installation will recognize the installed (newer) drivers and probably complain on downgrading them. You can do a fresh install of 5.5 U1 on an empty/new SD-Card or USB key and mount the datastore afterwards or (if you also upgraded from an older VMWare version to 5.5 U2) go back to the former kernel by using the SHIFT+r method on VMWare boot (making use of the altbootbank which replaces the 5.5 U2 bootbank then).

menait:

I'm adding another ESXi server in a matter of days and it looks to me I should be using U1 instead of U2.  I wonder if VMWare is aware of this problem and if so, are they doing something about it?

From my experiences/discoveries I think this is a mixed VMWare/HP issue. There are evidences that the VMWare driver hpvsa (provided by HP) for the B320i / B120i Smart Array controller does not (fully) work together with the kernel of VMWare 5.5 U2.

But this is only my suggestion and might not be the real cause of the issue.

And for your fresh installation of ESXi I would recommend using 5.5 U1 for now until we get a solution to this from either HP or VMWare.

I may open up a cese with HP about this next days but I have to discuss this first with the representative of the charity who owns the server. This might take a while.

0 Kudos
menait
Contributor
Contributor

cykVM wrote:

You can do a fresh install of 5.5 U1 on an empty/new SD-Card or USB key and mount the datastore afterwards or (if you also upgraded from an older VMWare version to 5.5 U2) go back to the former kernel by using the SHIFT+r method on VMWare boot (making use of the altbootbank which replaces the 5.5 U2 bootbank then).

My problem is I upgraded from 5.1 so I cannot use Shift+r.  I have already upgraded my VM hardware versions to 10 (in an attempt to fix this problem in the first place).  It looks like I will wait for the new server, set up 5.5 U1 on it, then migrate over all my VMs, then reformat the original server.

Out of curiosity, have you tried setting the server's power management settings to "os-controlled"?  I've heard that this a potential solution for some people.

I would appreciate it if you can keep us informed of further developments.  I no longer have paid support for both VMware and HP so I cannot contact their support directly.

0 Kudos
cykVM
Expert
Expert

menait:

My problem is I upgraded from 5.1 so I cannot use Shift+r.  I have already upgraded my VM hardware versions to 10 (in an attempt to fix this problem in the first place).  It looks like I will wait for the new server, set up 5.5 U1 on it, then migrate over all my VMs, then reformat the original server.

Out of curiosity, have you tried setting the server's power management settings to "os-controlled"?  I've heard that this a potential solution for some people.

I would appreciate it if you can keep us informed of further developments.  I no longer have paid support for both VMware and HP so I cannot contact their support directly.

The server's power management was set to OS CONTROLLED on my upgrade installation from 5.5 U1 to 5.5 U2. VMWare was set to HIGH PERFORMANCE MODE (no power management) at this point. Upgrade went through without issues but after a while the PSODs came up (if I remember correctly the server ran fine for about 2 or 3 days in the first place until the first PSOD appeared). I also tried STATIC HIGH PERFORMANCE during my tests/fixing with no luck.

I then tried various settings in BIOS or VMWare advanced configuration to fix this problem. The server kept crashing randomly, nothing helped. After that I decided to go back to 5.5 U1 kernel with SHIFT+r and since then I got back a working system which runs without issues.

At HP business community another user made a fresh install from scratch of 5.5 U2 to SD-Card and ran into the same problems, his server crashed reproduceably after about 24 hours of runtime. He also tried various BIOS and VMWare settings and nothing helped. He is going back to the previous version, too.

But as you mention your new server coming, does this have the B320i/B120i controller or is another array controller built into that?

The problem for me opening a case at HP is that I do not have 5.5 U2 running anymore and I won't upgrade until I know for sure this is fixed. Can't reboot/upgrade/downgrade the server every 2-3 days and I am fine with 5.5 U1 running stable for now.

0 Kudos
menait
Contributor
Contributor

But as you mention your new server coming, does this have the B320i/B120i controller or is another array controller built into that?

It does have the same array controller.  Unfortunately, that is the only model our local vendor/supplier have available.

0 Kudos
cykVM
Expert
Expert

I think the Pxxx series controllers work fine, but they are far more expensive and only available as an addon-card for the xxxE series servers.

The Pxxx controllers also have the advantage that they are listed in VMWare's HCL, the Bxxx(i) are not and probably will never be.

0 Kudos
jbam
Contributor
Contributor

I am experiencing the exact same issue

Hardware:

HP Proliant DL380e Gen8

B320i Array Controller

1Gb 4-port 366i Network Adapter

Software:

HP OEM VMWare ESXi 5.5 U2   --> Upgraded from HP OEM 5.1

Immediately after upgrading ESXi the server became unreliable.

My server PSODs on average every 24 hours.  I have had PSOD on reboot a few times, and up to 48 hours of uptime.

I have seen this same PSOD while running ESXi 5.1,  it occured twice in 8 months time.

10-13-14.PNG

0 Kudos
vlho
Hot Shot
Hot Shot

Hi,

you try install new driver for HP Dynamic Smart Array B120i/B320i Controller, version 2014.09.11 / inside scsi-hpvsa-5.5.0-90OEM...:

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdDetails/?sp4ts.oid=5269386&sp...

0 Kudos
cykVM
Expert
Expert

I won't recommend that. I already tested the 5.5.0-90 hpvsa driver. It makes everything even worse. Had massive loss of performance on data backups with LTO4 drive and Backup Exec 2010 R3.

Backup took 2-3 hours with about 4GB/min throughput with 5.5.0-86 hpvsa driver. After update to 5.5.0-90 hpvsa driver throughput went down to about 0.3GB/min (down to 10%) and backup would take something between 15 and 20 hours.

That was with my SHIFT+r downgraded 5.5 U1 version. So not sure if it makes a difference with U2, but I won't try this.

I posted that already on HP forums.

0 Kudos
cykVM
Expert
Expert

... but another update was just released: new BIOS/System ROM version 2014.08.02 (13 Oct 2014) for HP ProLiant DL360e Gen8/DL380e Gen8 (P73)

It does not address any VMWare issues in the fixes/revision history section: http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdDetails/?sp4ts.oid=5269386&sp...

Anyway may try this during next weekend.

0 Kudos
rubensinfo
Contributor
Contributor

Hello guys

Recently the company I work for hired two HP servers with VMware, the same configurations and also showed the same purple screen.

Engineers from VMware and HP recommended update for B320i controller driver version 0.90

So far the error has not returned and seems to have solved the problem.

But do not know details of the update because it was outsourced company that performed the repairs.

0 Kudos
cykVM
Expert
Expert

Hi rubensinfo,

thanks for this information. As said before, for me the hpvsa -90 driver does not work with the downgraded to 5.5 U1 version. It does not cause any PSODs but performance on accessing Windows shares during backup goes down massively.

But maybe another user here tests the -90 driver with his 5.5 U2 version.

cykVM

0 Kudos
menait
Contributor
Contributor

How do I check the version of my hpvsa driver?  And how to install an updated one?

I'm aware of cykVM's warning regarding this driver version.  But if I'm able to rollback the driver if needed, I'd be willing to try it.

0 Kudos
cykVM
Expert
Expert

General information on finding out the installed version of storage/network drivers is found here: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102720...

In short I prefer using SSH shell to the host (you may need to switch to maintenance mode for that):

listing the storage devices by:

esxcfg-scsidevs -a

and finding out the hpvsa version installed by:

vmkload_mod -s hpvsa |grep Version

General information on installing drivers: VMware KB: Installing async drivers on VMware ESXi 5.0, 5.1, and 5.5

I did it this way:

  • downloading the new hpvsa -90
  • extracting the zip file and uploading the *.vib to the datastore
  • go to SSH shell and copy the *.vib from datastore to /var/log/vmware
  • install the driver by: esxcli software vib install –v /var/log/vmware/<vibname.vib here>
  • if everything installs fine, no error messages etc., reboot the host
  • after host is back up, ssh to it again and check if driver is installed and used: esxcli software vib list | grep -i hpvsa


hopefully I did not miss anything here, double-check with above link on installing drivers


Rollback to -86 or previous version works the same way.



menait
Contributor
Contributor

Thanks.

It seems I am using -88 version.  Since this requires a host reboot, I might not be able to try this until the weekend.  We'll see...

0 Kudos
cykVM
Expert
Expert

Are you sure, you are using a B320i or looking at the right device? Could not find any -88 version of the hpvsa driver.

Only -86 and new -90.

P.S. Ah, OK, you are on VMWare 5.1.0. So that's OK.

Be sure to download the updated -90 driver for 5.1.0 then.

0 Kudos
menait
Contributor
Contributor

This is what I get

2014-10-15_15-40-51.png

0 Kudos
cykVM
Expert
Expert

Yes, it's OK, was wrong in thinking you are also running VMWare 5.5. You have 5.1 and for the version before -90 is -88.

Download the -88 driver right after downloading -90 and put it on your datastore to have it handy in case of trouble.

0 Kudos
menait
Contributor
Contributor

No actually you are right.  I'm currently running 5.5 U2.  But looking at the driver page on HP's site, I can see -88 is indeed the previous version even for 5.5.  Yes I will be downloading both versions in case I need to roll back.

0 Kudos
cykVM
Expert
Expert

Yes, I missed the 88 version then. -86 was included in 5.5 U1 install, I think. Anyway, does not really matter if 86 or 88. 😉

0 Kudos