Datastore / Disk latency problems with HP ProLiant...

digitalnomad · ‎12-02-2015

After updating a mixture of G7, G8 and G9 VMHosts to 5.5 update 3a and Cookbook release September 2015 (SPP JUN 15). I started having host errors specifically on my G7 hardware. One host went so far as to disconnect from VCenter.

I went the full boat update on the OS to bring it fully in line with HP's recipe. This was the first time hitting drivers in quite a while. So when the first errors started coming in, I immediately suspected the HPSA v106 ( Version:5.5.0.106-1OEM ) which was updated from v50 ( Version:5.5.0.50-1OEM)

VCenter is reporting errors

Lost access to volume 56424481-7f094eb0-8ee6-

80c16e6e15e0 (VMHost_local) due to

connectivity issues. Recovery attempt is in

progress and outcome will be reported shortly.

info 2/2/2015 9:00:45 AM (VMHost_local)

VMKernel.log was reporting some conflict's with claim rules between PowerPath and the NMP for the local disk but that was cleared.

2015-11-29T07:53:04.162Z cpu22:33327)WARNING: LinScsi: SCSILinuxAbortCommands:1843: Failed, Driver hpsa, for vmhba1

Hostd.log

2015-11-30T19:37:48.614Z [248C4B70 info 'Vimsvc.ha-eventmgr'] Event 1820 : Lost access to volume 4ffd89b4-760e9689-81e3-e83935a81a45 (gldpiesx002_local) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

2015-11-30T19:37:48.615Z [248C4B70 info 'Vimsvc.ha-eventmgr'] Event 1821 : Successfully restored access to volume 4ffd89b4-760e9689-81e3-e83935a81a45 (gldpiesx002_local) following connectivity issues.

I opened tickets with both HP and VMWare

VMWare came back with the first fix which was to upgrade the hpsa driver(hpsa 5.5.0.114-1OEM) which can be downloaded from https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI55-HP-HPSA-550114-1OEM&productId=353. however overnight , the errors returned.

HP has suggested back-revving to 5.5.0.74-1(1 Oct 2014) but from a previous discussion here [ Datastore / Disk latency problems with HP ProLiant DL380 G7 - HP Smart Array P410i controller after ... ] I believe that was also a bad version.

I'm going to wander down the road a bit and see where it leads and if necessary go back to the 70 or even 50 release

Comments welcome

digitalnomad · ‎12-02-2015

Found a version 116 on the HP website released 11/30/15 [ https://h20565.www2.hpe.com/hpsc/swd/public/detail?sp4ts.oid=4142793&swItemId=MTX_ae7b6b8db7044b5b89... ] and issue still surfaces

VirtualCop · ‎12-03-2015

Hi digitalnomad,

1) thx for the info reg. scsi-hpsa driver ver 116!

2) HP call is opened. HP insider said, the fix for this issue was on scope in ver 110, but it was not released, because was unstable.

3) Firmware for P410i upgraded to 6.64.. the same story ;(

4) I can confirm scsi-hpsa ver 114 contained in hp-customized 5.5 U3a (Mware-ESXi-5.5.0-Update3a-3116895-HP-550.9.4.26-Nov2015.iso) still produce "Lost access" events periodicaly (each 15-40 min), ... but only if a spare drive for raid5 is configured!

Could someone confirm that unconfiguring of the spare-drive fixing thie problem ?

Regards,

Cop

digitalnomad · ‎12-04-2015

I have confirmed that my drives are in a mirror configuration with dedicated spare.... not Raid 5 configuration. I'll try running an reconfigure and see if that will alleviate the errors however due to compliance issues, can't run in that configuration.

For the record, I tried every release of the driver 116,114,84 and 74 failed. I had to rev back to 60

I tied this back to VMWare Engineering's case with HP which is ongoing 4651641937

Regards DGN

ManiacMark · ‎12-07-2015

I'm running ESXi 6.0 (HP image) on a DL380G7 with P410i 1GB capacitor-backed cache. Array is local storage Raid5 with 1 hot spare.

I have no guests running and I'm getting the "Lost access to volume".... followed by "Successfully restored" within 1 second. Sometimes this happens at exact 1 hour intervals, other times it's about 1.5 hours or so.

I've refrained from loading my production guests because of these warnings.

I installed 6.0Update1 and it didn't help, so a few hours ago I installed the latest available patch 3247720.

relevant vibs installed are:

scsi-hpsa 6.0.0.114-1OEM.600.0.0.2494585

hpssacli 2.30.6.0-6.0.0.2159203

hptestevent 6.0.0.01-00.00.8.2159203

scsi-hpdsa 5.5.0.46-1OEM.550.0.0.1331820

scsi-hpvsa 5.5.0.100-1OEM.550.0.0.1331820

After that patch install I had 3 occurrences of "Lost access to volume" about 1 hour apart each .

At that point I clicked on Rescan storage button just to see if anything would change. Now I haven't had one warning in the last 5 hours. Interesting. I'll keep checking....

ManiacMark · ‎12-07-2015

No luck. Even with 6.0.0.114-1OEM.600.0.0.2494585 I still get the warnings/errors and I see this repeatedly in vmkernel.log:

WARNING: LinScsi: SCSILinuxAbortCommands:1882: Failed, Driver hpsa, for vmhba1

barcones · ‎12-09-2015

Hi,

Same problem with HP DL380 G7 with VMware-ESXi-6.0.0-Update1-3073146-HP-600.9.4.34-Nov2015.

6 disks in RAID 1+0 + 2 disks for Hot Spare.

Hope HP fix this issue soon.

Best regards.

digitalnomad · ‎12-09-2015

Darn, couldn't reconfigure the spare on the fly even with advanced feature pack.All, my G7s are at remote sites as well.

Can anyone confirm that the trigger is the actual "spare" configuration.

Here's latest update from HP

This e-mail is with reference to the case number: xxx logged for DL580 G7.

I could gather that 4651641937 is already elevated and the Level 2 support are working on the issue.

Please confirm if I can go ahead and close the case xxxx or keep the case open.

If the issue is not resolved or you need further assistance please get back to us on chat for further support. We are available 24x7 at www.hp.com/go/hpchat .

Thank you for contacting HP!

Checks in the mail....

DGN

VirtualCop · ‎12-11-2015

Hey DGN,

I can confirm if the "Lost access-Successfully restored" events appear periodically, they disappear after a spare drive removal.

----------

Example: spare is active:

logicaldrive 2 (2.7 TB, RAID 5, OK)

      physicaldrive 2C:1:4 (port 2C:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 3C:1:5 (port 3C:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 3C:1:6 (port 3C:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 3C:1:7 (port 3C:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 3C:1:8 (port 3C:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 4C:2:4 (port 4C:box 2:bay 4, SAS, 600 GB, OK)
      physicaldrive 2C:1:3 (port 2C:box 1:bay 3, SAS, 600 GB, OK, spare)

Lost-Errors appear each 21 minutes:

Lost access to volume 565db408-5e5976f2-da1d-d485644693cc (v-L002-RAID5TEST) due to connectivity issues.
Recovery attempt is in progress and outcome will be reported shortly. info 11.12.2015 09:16:01

Lost access to volume 565db408-5e5976f2-da1d-d485644693cc (v-L002-RAID5TEST) due to connectivity issues.
Recovery attempt is in progress and outcome will be reported shortly. info 11.12.2015 08:55:50

Lost access to volume 565db408-5e5976f2-da1d-d485644693cc (v-L002-RAID5TEST) due to connectivity issues.
Recovery attempt is in progress and outcome will be reported shortly. info 11.12.2015 08:36:09

-------------

Spare drive was unconfigured for 2h:

/opt/hp/hpssacli/bin/hpssacli ctrl slot=0 array B remove spares=2C:1:3

/opt/hp/hpssacli/bin/hpssacli ctrl all show config

logicaldrive 2 (2.7 TB, RAID 5, OK)

      physicaldrive 2C:1:4 (port 2C:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 3C:1:5 (port 3C:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 3C:1:6 (port 3C:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 3C:1:7 (port 3C:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 3C:1:8 (port 3C:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 4C:2:4 (port 4C:box 2:bay 4, SAS, 600 GB, OK)

unassigned

physicaldrive 2C:1:3 (port 2C:box 1:bay 3, SAS, 600 GB, OK)

No "Lost volume"-errors !

After 2h the spare drive configured back.... and errors appear again !

2015-12-11T11:05:51.423Z cpu3:32967)WARNING: LinScsi: SCSILinuxAbortCommands:1843: Failed, Driver hpsa, for vmhba1

For me it a clear statement that scsi-hpsa driver has a issue with the spare-drive heartbeat method.

Why is so difficult to fix it ?

It is near 6 month I'm waiting for solution from HP...

HP Call still be open.

Regards

Cop

digitalnomad · ‎12-14-2015

Thanks Cop.... owe ya a signal 8

I agree but HP has been having some severe problems with their software builds of late especially with their Utilities bundle crappin the hosts. Hopefully they can add a little qa to their software builds.

That basically nails the symptomology....

So in Summary

HPSA driver later than v60 driving a HP Smart Array p410 controller of any firmware vintage with a spare drive configured. The spare drive configuration seems to cause an i/o back-feed into the driver blowing it up at irregular intervals. This issue will not surface unless a spare drive is configured on the array controller.

Existing HP case # from VMWare Engineering is 4651641937 Feel free to jump on the boat this is a long standing issue

jcosta · ‎01-04-2016

Manic replied to my other post but I am having this same issue on a R730. Raid 10 with one spare.

I don't usually configure spares but had an extra and this is the first instance in several installs in the last month I have started to see this issue.

I am going to remove the spare tonight and see if it fixes the issue. Fingers crossed!

ManiacMark · ‎01-20-2016

I can also confirm that once I removed the Spare drive from the array configuration, the warnings about lost volume access stopped.

VirtualCop · ‎03-04-2016

Hello all,

good news: HP continue working on this issue.

Could you guys, who still expiriencing errors after enabling of the spare drive (SCSILinuxAbortCommands:1843) check following for me:

1) please create ADU report, extract it and confirm that following values in ADUReport.htm are NOT zero for the affected spare drive:

Spin Cycles: 0x0000001e

Spin Up Time: 0x005d

2) please replace the spare drive with any other drive with spin-up counts = 0 and report, whether if you still receive 1843-events hourly

Thank you.

Regards

Cop

Note: ADU creation in esxi5.5 was changed. Here is an example for the command for the ADU tool:

C:\Temp\HPSSADU\hpssaduesxi.exe --server=esxi7000.mydomain.com --user=root --password=MyPass C:\Temp\esxi7000-ADU-2016-03-02.zip

vKopp · ‎05-09-2016

Hello guys,

I‘ve found a new scsi-hpsa driver version was released (ver 118) !!!

http://h20564.www2.hpe.com/hpsc/swd/public/detail?swItemId=MTX_65ed08c0b1b946e8b92bf1314a#tab-histor...

Version: 5.5.0.118-1(21 Apr 2016)

Operating System(s): VMware vSphere 5.5

File name: scsi-hpsa-5.5.0.118-1OEM.550.0.0.1331820.x86_64.vib (74 KB)

I installed it on DL370G6 esxi host and don't see any SCSI aborts for two hours !!!

I will do some more stress test next days and report...

Regards,

Cop

de2rfg · ‎06-07-2016

Did version 118 work for you? I installed it a while ago but still see "vmkernel: cpu16:32841)WARNING: LinScsi: SCSILinuxAbortCommands:1891: Failed, Driver hpsa, for vmhba1" warnings.

ManiacMark · ‎07-21-2016

Since I am running 6.0 I found the 6.0 equivalent driver version 118 (hpsa-6.0.0.118-3638475.zip) here:

Drivers & Software - HP Support Center.‌

h20564.www2.hpe.com/hpsc/swd/public/detail?swItemId=MTX_e53ee74822884382a579582751#tab3

For those that don't know how, I installed the single VIB file via SSH connection with WinSCP/Putty:

[root@localhost:~] esxcli software vib install -v /tmp/hpsa118.vib --no-sig-check

Installation Result

Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.

Reboot Required: true

VIBs Installed: Hewlett-Packard_bootbank_scsi-hpsa_6.0.0.118-1OEM.600.0.0.249 4585

VIBs Removed: Hewlett-Packard_bootbank_scsi-hpsa_6.0.0.116-1OEM.600.0.0.24945 85

Another neat way to install VIB which I wasn't aware of using the embedded web host:

http://www.virtuallyghetto.com/2015/11/neat-way-of-installing-or-updating-any-vib-using-just-the-esx...

We'll see how it goes.... according to HP this version 118 fixes the issue.

Dick6502 · ‎08-01-2016

Hi,

Same here with a HP ProLiant DL380 G7 with P410i Smart Array controller (firmware 6.64) and local RAID 10 volume with hot spares.

(all firmware updated with Service Pack for ProLiant (SPP) Version 2016.04.0)

I'm experiencing high latency during "lost access to volume ....." event logs.

Patched this server to driver version "scsi-hpsa_6.0.0.118-1OEM.600.0.0.2494585.vib".

(server was installed with VMware-ESXi-6.0.0-Update2-3620759-HPE-600.9.5.0.48-Apr2016.iso lately)

But still same "lost access to volume ...." errors.

Still waiting/searching for final solution.

JorisBoth · ‎10-13-2016

Did anyone find an answer for this? I am having the same problem.. The only difference is that I have a RAID 50 setup without a hotspare.

I've already updated my ESXi 6 installation to the latest version and replaced the driver with the 118 version. The system runs fine for about a day or two, but then it all starts to slow down...

The strange thing is, I do have the latency errors and the WARNING: LinScsi: SCSILinuxAbortCommands:1843: Failed, Driver hpsa, for vmhba1 errors in my vmkwarning.log, but I don't have the Lost access to volume errors..

yubr · ‎10-18-2016

I have run into this issue too.. with HP DL380 G9 and P440ar, RAID10 and RAID 6 (two luns) and a dedicated spare on both lun.

Firmware at latest 4.02, using ESXi 6.0 U2. Tried drivers 114, 116, and 118, all are giving same problem of "LinScsi: SCSILinuxAbortCommands:1891: Failed, Driver hpsa" and Lost / restore access to the local attached luns. It happens especially during rescan for datastore.

Dick6502 · ‎11-14-2016

Hi,

Well... Today i patched the new version 6.0.0.120-1 (24 Oct 2016) .... Running for 2 hours now and it looks like that the "lost access to volume ...." events are disappeared !!!

I don't see any high latency spikes at the disk performance graph anymore too. I hope this issue is finally fixed.

Setup:

- HP ProLiant DL380 G7 with P410i Smart Array controller (firmware 6.64) and local RAID 10 volume with hot spares;

- All firmware updated with Service Pack for ProLiant (SPP) Version 2016.04.0);

- Install base was VMware-ESXi-6.0.0-Update2-3620759-HPE-600.9.5.0.48-Apr2016.iso

Patched this server to driver version "scsi-hpsa_6.0.0.120-1OEM.600.0.0.2494585.vib".

Check out / download / instructions: http://h20564.www2.hpe.com/hpsc/swd/public/detail?swItemId=MTX_85465f7232534baa823e784611#tab5

Good luck!

Please share your experience.

All

Datastore / Disk latency problems with HP ProLiant G7 - HP Smart Array P410i controller " WARNING: LinScsi: SCSILinuxAbortCommands:1843: " and "Lost access to volume" (Still an issue with hpsa 5.5.0.114-1OEM )