VMware Cloud Community
robertwbrandt
Contributor
Contributor

ESXi Crashes on HP ML350p Gen8 with P420i

I recently ran across a disturbing problem.

We had a server that would crash every 5-15 minutes are booting up.

The server is an HP ML350p Gen8 with a P420i RAID card 16GB of memory and a 4TB RAID6 disk. And we are using the HP version of ESXi 5.1.0, 799733.

The problem was that the server would just stop responding.  Sometimes it would be ping-able, but would not respond to VI commands or the VI Client.  And worst, the VMs on the box would immediately cease!

I eventually found out that the problem was a bad hard drive, but here is the problem.  A single bad hard drive should not KILL a server, especially when using RAID6.  I initially thought the problem might be an HP issue, but the server itself was blinking the bad disk and the Array Manager identified the disk as bad.  I think the problem is either with VMware or the HP device driver for VMware.

Below are syslog extracts from the VMware box just before the system froze:


2014-08-15T13:53:06Z hp-ams[5896]:  --- done with initialize...
2014-08-15T13:53:10Z hpHelper[5896]: WARNING: POST Error: 1778-Slot X Drive Array Resuming Automatic Data Recovery (Rebuild) process
2014-08-15T13:55:01Z crond[4520]: crond: USER root pid 6807 cmd /sbin/hostd-probe
2014-08-15T13:55:01Z syslog[6808]: starting hostd probing.
2014-08-15T13:55:16Z syslog[6808]: hostd probing is done.
2014-08-15T14:00:01Z crond[4520]: crond: USER root pid 7065 cmd /usr/lib/vmware/vmksummary/log-heartbeat.py
2014-08-15T14:00:01Z crond[4520]: crond: USER root pid 7066 cmd /sbin/hostd-probe
2014-08-15T14:00:01Z syslog[7068]: starting hostd probing.
2014-08-15T14:00:16Z syslog[7068]: hostd probing is done.
2014-08-15T14:01:01Z crond[4520]: crond: USER root pid 7122 cmd /sbin/auto-backup.sh
2014-08-15T14:05:01Z crond[4520]: crond: USER root pid 7540 cmd /sbin/hostd-probe
2014-08-15T14:05:01Z syslog[7541]: starting hostd probing.
2014-08-15T14:05:16Z syslog[7541]: hostd probing is done.
2014-08-15T14:10:01Z crond[4520]: crond: USER root pid 7779 cmd /sbin/hostd-probe
2014-08-15T14:10:01Z syslog[7780]: starting hostd probing.
2014-08-15T14:10:16Z syslog[7780]: hostd probing is done.

2014-08-18T09:06:09Z hp-ams[5898]:  --- done with initialize...
2014-08-18T09:06:14Z hpHelper[5898]: WARNING: POST Error: 1778-Slot X Drive Array Resuming Automatic Data Recovery (Rebuild) process
2014-08-18T09:10:01Z crond[4520]: crond: USER root pid 6936 cmd /sbin/hostd-probe
2014-08-18T09:10:01Z syslog[6937]: starting hostd probing.
2014-08-18T09:10:16Z syslog[6937]: hostd probing is done.
2014-08-18T09:15:01Z crond[4520]: crond: USER root pid 7181 cmd /sbin/hostd-probe
2014-08-18T09:15:01Z syslog[7182]: starting hostd probing.
2014-08-18T09:15:16Z syslog[7182]: hostd probing is done.
2014-08-18T09:20:01Z crond[4520]: crond: USER root pid 7449 cmd /sbin/hostd-probe
2014-08-18T09:20:01Z syslog[7450]: starting hostd probing.
2014-08-18T09:20:16Z syslog[7450]: hostd probing is done.
2014-08-18T09:25:01Z crond[4520]: crond: USER root pid 7690 cmd /sbin/hostd-probe
2014-08-18T09:25:01Z syslog[7691]: starting hostd probing.
2014-08-18T09:25:16Z syslog[7691]: hostd probing is done.

2014-08-18T11:55:42Z hp-ams[5904]:  --- done with initialize...
2014-08-18T11:55:46Z hpHelper[5904]: WARNING: POST Error: 1778-Slot X Drive Array Resuming Automatic Data Recovery (Rebuild) process
2014-08-18T12:00:01Z crond[4520]: crond: USER root pid 6975 cmd /usr/lib/vmware/vmksummary/log-heartbeat.py
2014-08-18T12:00:01Z crond[4520]: crond: USER root pid 6976 cmd /sbin/hostd-probe
2014-08-18T12:00:01Z syslog[6978]: starting hostd probing.
2014-08-18T12:00:16Z syslog[6978]: hostd probing is done.
2014-08-18T12:01:01Z crond[4520]: crond: USER root pid 7032 cmd /sbin/auto-backup.sh
2014-08-18T12:05:01Z crond[4520]: crond: USER root pid 7457 cmd /sbin/hostd-probe
2014-08-18T12:05:01Z syslog[7458]: starting hostd probing.
2014-08-18T12:05:16Z syslog[7458]: hostd probing is done.
2014-08-18T12:10:01Z crond[4520]: crond: USER root pid 7692 cmd /sbin/hostd-probe
2014-08-18T12:10:01Z syslog[7693]: starting hostd probing.
2014-08-18T12:10:16Z syslog[7693]: hostd probing is done.

Luckily this system was at a site which I could reach fairly easily, but we have may boxes that are scattered this little country.

Has anyone else seen this problem or have a fix?

Thanks

Bob

Tags (4)
0 Kudos
4 Replies
JPM300
Commander
Commander

Hey Robert,


I would say update to the latest Firmware / Drivers and see if the problem continues.  This deffirently seems Raid Controller related.  What firmware version are you currently running?

robertwbrandt
Contributor
Contributor

Thanks,

The firmware version I'm running on (for the P420i) is 4.6.8. Looking at the HP website there appears to be a 5.42  version. (But the website is as clear as reading tea leaves)

I was thinking of upgrading the version of ESXi on the box. There was a new update released in June 2014. (We were running the Sept 2013 version)

I will upgrade and update this thread, but the problem is that I can't replicate the problem so I have no way of knowing this fix will work until another disk fails.  Not an ideal situation...

Bob

0 Kudos
JPM300
Commander
Commander

Yeah, if you still have a support contract you can download HP's firmware disk, you just boot off that ISO and it scans your system and updates any firmware to the latest that It can.  You can always pick and choose what you want to do, but HP makes this process really easy.

0 Kudos
FritzBrause
Enthusiast
Enthusiast

Since you run the Seb 2013 release, I guess you do not run the hpsa driver 5.x.0.58-1.

Nevertheless, check on your system with esxcli software vib get | grep hpsa.

If you run this version, upgrade.

References:

http://h20564.www2.hp.com/portal/site/hpsc/public/kb/docDisplay/?docId=c04302261

http://kb.vmware.com/kb/2075978


0 Kudos