VMware Cloud Community
cykVM
Expert
Expert

HP Proliant DL380e Gen8, HP OEM VMWare ESXi 5.5 Update 2 keeps crashing (PSOD)

Hello everyone,

I maintain a single VMWare host running vSphere 5.5 (ESXi) Update 2 OEM HP version at the moment for a mid-size charity.

The hardware in use:

HP Proliant DL380e Gen8 (bought brand new in August 2014), HP SmartArray B320i storage controller, HP H222 host bus adapter (only a HP Ultrium4 tape drive connected to that), HP Intel 4port NIC 366i, 32GB RAM, 2 Quadcore Intel Xeon E5-2407

The box was initially installed and configured in August using HP OEM vSphere 5.5 Update 1 installation CD. vSphere is installed on the RAID array configured on the B320i controller. A VMWare Essentials license is also in use/installed.

It's running 3 Windows 2008 R2 VMs (DC, Exchange 2010 and a backup server with Backup Exec 2010 R3 [I know this is not a recommended/supported configuration, but it worked with 5.5 U1 without issues]) besides 2 Debian Linux VMs.

2 weeks ago during weekend maintenance I first installed the latest HP SPP (Service Pack for Proliant) Sept. 2014 which provided several firmware updates for e.g. the B320i, the 366i NIC etc.

After that I performed an upgrade instalölation of vSphere HP OEM 5.5 Update 2 version, which was also released by HP beginning of Sept..

All those setup/update procedures went through without any issues, error messages or crashes.

The host was running fine for 3 days and suddenly crashed with a PSOD stating: PCPU 0: no heartbeat (2/2 IPIs received) [unfortunately I did not take a screenshot]

I reset/rebooted the host through iLo4 console and kept an eye on the server the next days.

The first PSOD took place during daily (nightly) backup on the connected tape drive.

On the following Friday/Saturday night (about 2 days later) it crashed again with the following PSOD - again with PCPU 0: no heartbeat (2/2 IPIs received):

PSOD1.PNG

So I started investigating this, found some hints here in the VMWare communities leading to recommended BIOS settings of HP Proliant servers and checked the actual settings and changed the values to the recommended ones. The server was running fine without gliutches for about 16 hours then crashed again with this PSOD:

PSOD2.PNG

I continued investigation, and especially took an eye on power management setting in BIOS, vSphere and in the Windows VMs.

Also checked installed firnware versions of the storage controllers and NIC and driver versions in use. All OK there (as recommended in HP VMWare recipe Sept. 2014).

Server was running fine for about a week after the reboot then another PSOD early this morning at about 3 a.m.:

PSOD3.PNG

The server/VMs were mostly idle at this time, no heavy I/O activity.

The first two PSODs happened during backup but not at a certain time (one at about 10 p.m. the other early in the morning between 2 and 3 a.m.).

I read through tons of hints to faulty NIC drivers/firmware, BIOS confgurations etc. but nothing helps or even everything is configured exactly as in HP recommondations for vSphere 5.x.

For the BIOS settings I followed this list/table:Recommended BIOS Settings on HP ProLiant DL580 G7 for VMware vSphere | Boerlowie's Blog

vSphere is configured to "High Performance Mode" and the Windows VMs, too.

I'm somehow stuck now, so maybe someone here has a good hint for me?

If you need any further hardware/software/configuration/whatever details, just ask.

Cheers and thanks in advance for any help,

cykVM

122 Replies
menait
Contributor
Contributor

Our server crashed again this morning...then once more upon restart.  Then again this afternoon.

So I have no choice but try the hpvsa driver update which I just did.  However, I don't know if it affected my throughput.

How did you test this?

0 Kudos
cykVM
Expert
Expert

menait schrieb:

So I have no choice but try the hpvsa driver update which I just did.  However, I don't know if it affected my throughput.

How did you test this?

See post/answer 28 by me: HP Proliant DL380e Gen8, HP OEM VMWare ESXi 5.5 Update 2 keeps crashing (PSOD)

I first recognized it right after installation of the -90 driver, slight hickups on accessing the fileserver shares, speed on copying large files went up and then down and so on. But you have to use very large files, e.g. a 4 or 8GB ISO file and copy that from one VM (e.g. fileserver) to another.

But the main impact was on the nightly backup. I've got a kind of unsupported configurationm there. The LTO4 tape drive is connected to a HP H222 SAS host bus adapter (one and only device on that controller) and the tape is passed through to a Windows 2008 R2 VM with Backup Exec 2010 R3.

This works perfectly well with 5.5 U1 and the -88 hpvsa driver. Backup of fileserver takes about 3-4 hours with verify.

After I installed the -90 driver the backup during the following night was still running next morning and predicted another 8-10 hours of further runtime.

Backup EXEC shows you the data thoughput on writing to tape, which was at about 4GB/min with -88 driver and now with -90 driver down to 0.3 GB/min (300MB/min).

Then I first thought something was wrong in the backed up fileserver VM, rebooted the fileserver and the Backup EXEC VM, re-run the backup job and speed was still at about 0.3GB/min. Backup should take something like 18 hours orso.

Cancelled the job again and checked further things, even rebooted the whole host. Nothing worked. So I went back to -88 hpvsa driver and speed was back to 4GB/min.

Did not do any further tests with copying files around etc.

If you don't have Backup EXEC you may try a backup to disk backup with Windows (built-in) Backup tool.

0 Kudos
menait
Contributor
Contributor

I use freefilesync to do backups and you may be right about the performance hit.

2014-10-17_12-38-06.png

I tried running a backup today and I only get aorund 8.25 MB/sec.  Somehow I remember it should be faster than that.  Also, software that we run over the network seems slower.  I hope it's only my imagination because so far, the server is actually running pretty stable.  I have also noticed CPU usage has dropped down considerably.  So far, no PSOD (crossing my fingers).

That said, copying big files back and forth using Windows Explorer doesn't seem to be affected.  Speed seems normal to me.

0 Kudos
cykVM
Expert
Expert

menait schrieb:

I use freefilesync to do backups and you may be right about the performance hit.

I tried running a backup today and I only get aorund 8.25 MB/sec.  Somehow I remember it should be faster than that.  Also, software that we run over the network seems slower.  I hope it's only my imagination because so far, the server is actually running pretty stable.  I have also noticed CPU usage has dropped down considerably.  So far, no PSOD (crossing my fingers).

That said, copying big files back and forth using Windows Explorer doesn't seem to be affected.  Speed seems normal to me.

It was only on copying some larger files between the VMs but I did not do intensive testing on that as the server was running stable at that point and none of the users complained.

But the massive speed drop during backups was not acceptable. You 8.25MB/sec are about the 300-400MB/min I got. But Backup Exec uses agents to access Windows filesystems/shares, so that is not comparable to a simple filecopy within Windowss Explorer.

Does freefilesync not write any logs for the backup/sync jobs which might contain the throuput rate?

I still think that something relevant changed in 5.5 U2 kernel which makes the older hpvsa drivers unuseable in some way. Another strange thing was that backup speed for direct (full) VM backups did not drop. So I can only speculate what's going wrong here. My guess is, that Windows VM accesses hpvsa driver for accessing the virtual disk and then it goes down to the kernel and that at some point crashes as it thinks the datastore/disk volume is not accessible or produces timeouts.

0 Kudos
jbam
Contributor
Contributor

I updated to hpvsa-90 and have not PSOD/crashed in the last 30 hours.  If the server doesn't PSOD over the weekend I will be happy.

I have not noticed the file transfer speed being impacted, I can still saturate my links.  I will test more this afternoon.

0 Kudos
rubensinfo
Contributor
Contributor

Hello guys

Sorry for the absence

Caveat, I'm using:

      -HP DL Gen8 360e

      -B320i

      -VMware 5.5.0 build-2068190

      But the error screen is the same ...

Really showed no more And the PSOD is four days online

But in a simple writing test with "dd tool" inside a VM with Linux, showed oscillations "switching from fast to slow" large files "4GB example" as the image below:

test-dd.png

Note: in any other host in which I test the "dd", there is no oscillation results, and always test at off production.

I will also test the VMware IO Analyser, to see the results of random (reading and writing), as the "dd" is only for sequential writing.

0 Kudos
cykVM
Expert
Expert

Thanks jbam and rubensinfo for your input.

@rubensinfo: it might also only affect continous streams inside VMs as used for e.g. data backups (to disk or tape) or dd. So maybe with a "normal" usage by simple file copy you may not even recognize the impact.

It does not seem to have an impact on direct datastore access via the storage APIs available only with at least VMWare essentials license.

I will keep 5.5 U1 until I know for sure this is fixed - not willing to wait for 12 to 18 hours for the basic fileserver backup to finish. Anyway if such a performance loss occurs you are also never sure that data is written correctly to the actual device.

0 Kudos
Frederic9500
Contributor
Contributor

I know about an issue with the smart array driver 5.5.0.58.1, where you need to update the driver to 5.5.60.x

Here

Also I have see another issue with old Broadcom network driver, a PSOD happen with transferring files bigger than 8GB, every night when Veeam backup was running

Here

May be can this help you

0 Kudos
cykVM
Expert
Expert

Frederic9500 wrote:

I know about an issue with the smart array driver 5.5.0.58.1, where you need to update the driver to 5.5.60.x

Here

That's a different controller. Should be Pxxx series controller using hpsa VMWare driver and not hpvsa (for Dynamic Smart Array integrated [embedded]).

Frederic9500 wrote:

Also I have see another issue with old Broadcom network driver, a PSOD happen with transferring files bigger than 8GB, every night when Veeam backup was running

Here

At least I'm not using a broadcom NIC, having a 4port Intel NIC using igb driver, this has - as far as I see - no issues.I did not get any PSOD on copying large files/backups only performance was massively decreased.

But anyway thanks for your suggestions but I'm afraid this doesn't help in this case.

0 Kudos
solphonic
Contributor
Contributor

I came across this exact issue last week with a brand new HP ML350e Gen8 which has a B120i controller - had it at the customer premise doing a P2V migration overnight with Vmware Converter when it just stopped responding (PSOD) - I had just freshly installed the latest HP 5.5 U2 image.

It was very hard to track down - as the error looked to do with the second CPU (which I'd installed myself) - PCPU:6 no heartbead etc etc.

I tried updating the bios, reseating the CPU heatsink, changing all the power options as recommended in various other posts, reinstalling 5.5 to the SD, cloning the SD card to a USB key etc etc.

After the first time it hung time it would fail to even load esxi properly before PSOD came up again.

Finally I nuked the array and rebooted and whalla! So I deduced it had to be the storage controller - which is what led me to this thread.

I have since installed the updated storage controller driver as suggested and can confirm stability for the last 3 days - and network speeds appear to be normal.

0 Kudos
jawad
Contributor
Contributor

I have had 2 PSODs on my BL 460Gen8 servers with U2a (2143827) and iLO4 2.02. Have also upgraded to latest BIOS, hpsa driver, hpsa firmware (P420i) and so on. Still got a PSOD on two different hosts.

pinkscreen.jpgpinkscreen2.jpg

In the shadows...
0 Kudos
cykVM
Expert
Expert

Hi jawad,

this is a completely different server configuration from the hardware perspective. I would recommend putting up a new discussion on that. Also the PSODs are different from the ones we got.

Regards,

cykVM

0 Kudos
sgloupi
Contributor
Contributor

Hello everybody,

I have roughly the same configuration as you: HP ProLiant Gen8 DL360e with a B320i controller.

The PSOD appeared when I transferred a VM to my new HP server (I used SCP): the copying job stopped randomly. I tried many times : sometimes the PSOD happens at 30% or 80%....

After many search, I disabled the "Collaborative Power Control" in BIOS : and i was able to copy / move my VM to my new Hp server.

Cautious boy : before to start my VM and let it in production i created a fresh VM on my new HP server and i ran many stress test in order to obviously test my new Hp server. No PSOD after a week of tests. So i decided to pass my VM in production.

Ha the great life of an IT technician....

But (yes there is a but) after one day and half of production : bam PSOD!

Ha the great life of an IT technician....

So, after i read this thread, i tried (in spite of cykVM warnings) the new hpvsa -90 driver  and for the moment it works (for at last 2 days)...

Thanks guys for your informations, i will tried to keep you up to date with my story.

0 Kudos
cykVM
Expert
Expert

Thanks for your input. During next weekend I might test the hpvsa-90 driver again but this time with first updating the mpt2sas driver for the H222 SAS HBA (where (only) the tape drive is connected to). Will see how this works with my downgraded VMWare 5.5 U1 installation.

If everything runs fine and especially backup performance stays OK with that configuration I will try another upgrade to 5.5 U2 the weekend after.

0 Kudos
cykVM
Expert
Expert

Things are getting strange and stranger with every test I run.

As written in initial posting this server is located at a charity and I do not have a 2nd machine for testing things. So everything is tested on the production machine. Was hard enough to make them buy this box as we have always to keep an eye on budget.

I recognized yesterday that by reverting back to 5.5 Update 1 kernel with recovery SHIFT+r method it also reverted back to the initially installed hpvsa 5.5.0-86 driver. So I decided to first try -88 driver version. Installed that yesterday early morning and rebooted the system.

Everything was running fine and even no performance drop on tape backups. No glitches during workday again. But I did not provision a new VM or anything similar yesterday.

This morning the server PSODed again and again with no heartbeat message. This must have happened anytime between 11 p.m. yesterday and 3 a.m. this morning UK time.

On reboot and quick check on BIOS settings it booted fine into VMWare but roughly at the point nscd was loaded again "no heartbeat" PSOD. This happened twice and luckily on third restart VMWare came up and I could connect through ssh and revert back to the -86 driver.

For whatever reason there must be something utterly wrong in hpvsa version -88 and -90, at least for compatibility with VMWare 5.5.0 Update 1 (build 1746018).

There is one strange thing on hpvsa -86 driver: The *.vib file shows version -86 but after installation system shows:

# vmkload_mod -s hpvsa |grep Version

   Version: Version 5.5.0-84OEM, Build: 1331820, Interface: 9.2 Built on: Apr  3 2014

Anyway this is working for me so far (and was before the upgrade to -88 or -90 on Update 1)

Another evidence that the hpvsa driver is to be blamed: on PSOD the system is not able to write kernel debug/trace (coredump) to disk anymore. I configured a file to write dumps to but this fails.

0 Kudos
Qudzie_Mars
Contributor
Contributor

Had  a similar issue but my issus that couldn't upgrade from 5.5 u1 to 5.5 u2 the solution to this is to download the iso directly from vmware not hp, it seems the customized hp iso's are not wrkng as expected.

Im using ml150g6 server in my lab environment with 38gb of memory.

Hope this will assist ....

0 Kudos
cykVM
Expert
Expert

That won't work with our hardware configuration as the B120i/B320i controller is not supported in genuine VMWare installation images. It needs the hpvsa driver which is only available in HP customized images. Upgrade to 5.5 Update 2 and initial installation of 5.5 Update 1 worked without any issues for my setup.

0 Kudos
abelliot
Contributor
Contributor

Hi,

Same problem on a new fresh install.

2 crash on the night (when Veeam working).

I have update the hpvsa driver to -90.

Waiting now if it's working.

0 Kudos
cykVM
Expert
Expert

Hi abelliot,

Just out of curiosity: Is Veeam backing up to disk, external disk or tape?

0 Kudos
sgloupi
Contributor
Contributor

So I come to you to let you know the result of my experience.

In short : everything worked fine for a week. It seems to me that performances was good (and for my users too).

For the backup I use an USB hard drive with Windows Backup: for me backup time is correct.

But after a week in production, I had a new PSOD: but not with the same message (except 14)!

Today I can not say that this server is reliable ....

0 Kudos