Solved: "Lot access to volume" on a nightly basis

Mattzilla99 · ‎05-07-2014

Hi all

Hoping for some pointers to see where to look as we have been experiencing so many issues.

Environment is:

ESX 5.5 (latest version as of this week)

HP Blade ProLiant BL460c Gen8

Brocade FC switches

VNX5400 storage

Everything is at current patch level (HP Feb SSP, latest drivers etc from the latest issue of VMWare/HP Recipe Book)

On a nightly basis we see messages of the following:

Lost access to volume 52e00703-7702b882-d845-0017a4779402 (DataStore) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

07/05/2014 07:50:16

Successfully restored access to volume 52e0072b-0adb69f0-f97b-0017a4779402 (DataStore) following connectivity issues.info

From the vmkernal.log I can see these messages:

2014-05-07T02:44:14.860Z cpu0:33743)lpfc: lpfc_scsi_cmd_iocb_cmpl:2157: 0:(0):3271: FCP cmd x2a failed <2/4> sid x0d0505, did x0d0100, oxid xd4 Abort Requested Host Abort Req

2014-05-07T02:44:14.860Z cpu0:32813)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x412e80450900, 32805) to dev "naa.60060160fd503600ea66bd8acb82e311" on path "vmhba0:C0:T2:L4" Failed: H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL

2014-05-07T02:44:14.860Z cpu0:32813)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60060160fd503600ea66bd8acb82e311" state in doubt; requested fast path state update...

2014-05-07T02:44:14.860Z cpu0:32813)ScsiDeviceIO: 2337: Cmd(0x412e80450900) 0x2a, CmdSN 0x299 from world 32805 to dev "naa.60060160fd503600ea66bd8acb82e311" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-05-07T02:44:15.859Z cpu0:32881)lpfc: lpfc_scsi_cmd_iocb_cmpl:2157: 0:(0):3271: FCP cmd x2a failed <2/4> sid x0d0505, did x0d0100, oxid xe7 Abort Requested Host Abort Req

2014-05-07T02:44:15.860Z cpu0:32813)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60060160fd503600ea66bd8acb82e311" state in doubt; requested fast path state update...

2014-05-07T02:44:15.860Z cpu0:32813)ScsiDeviceIO: 2337: Cmd(0x413682f6e580) 0x2a, CmdSN 0x29a from world 32805 to dev "naa.60060160fd503600ea66bd8acb82e311" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-05-07T02:44:20.025Z cpu4:36428)World: 14296: VC opID 4CAEF736-0000009D-0-fb maps to vmkernel opID 4f50e1ac

2014-05-07T02:44:20.397Z cpu1:34426)Fil3: 15408: Max retries (10) exceeded for caller Fil3_FileIO (status 'IO was aborted by VMFS via a virt-reset on the device')

2014-05-07T02:44:20.397Z cpu1:34426)BC: 2288: Failed to write (uncached) object '.iormstats.sf': Maximum kernel-level retries exceeded

2014-05-07T02:44:23.013Z cpu8:34236)HBX: 2692: Waiting for timed out [HB state abcdef02 offset 4161536 gen 179 stampUS 1629072438 uuid 536997d8-04a1ea9d-6c9a-0017a4779402 jrnl <FB 2656233> drv 14.60] on vol 'uk1-san01:Production DataStore 3'

2014-05-07T02:44:29.645Z cpu4:33106)lpfc: lpfc_scsi_cmd_iocb_cmpl:2157: 0:(0):3271: FCP cmd xa3 failed <3/4> sid x0d0505, did x0d0000, oxid xeb Abort Requested Host Abort Req

2014-05-07T02:44:29.645Z cpu20:33044)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:651: Path "vmhba0:C0:T3:L4" (UP) command 0xa3 failed with status Timeout. H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2014-05-07T02:44:29.755Z cpu7:32857)HBX: 255: Reclaimed heartbeat for volume 52e0072b-0adb69f0-f97b-0017a4779402 (uk1-san01:Production DataStore 3): [Timeout] Offset 4161536

2014-05-07T02:44:29.755Z cpu7:32857)[HB state abcdef02 offset 4161536 gen 179 stampUS 1632947371 uuid 536997d8-04a1ea9d-6c9a-0017a4779402 jrnl <FB 2656233> drv 14.60]

2014-05-07T02:44:31.808Z cpu0:32807)lpfc: lpfc_scsi_cmd_iocb_cmpl:2157: 0:(0):3271: FCP cmd x2a failed <2/4> sid x0d0505, did x0d0100, oxid x112 Abort Requested Host Abort Req

2014-05-07T02:44:31.808Z cpu0:32813)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60060160fd503600ea66bd8acb82e311" state in doubt; requested fast path state update...

2014-05-07T02:44:31.808Z cpu0:32813)ScsiDeviceIO: 2337: Cmd(0x412e8349b000) 0x2a, CmdSN 0x2ab from world 32805 to dev "naa.60060160fd503600ea66bd8acb82e311" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-05-07T02:44:32.407Z cpu1:34426)Fil3: 15408: Max retries (10) exceeded for caller Fil3_FileIO (status 'IO was aborted by VMFS via a virt-reset on the device')

2014-05-07T02:44:32.407Z cpu1:34426)BC: 2288: Failed to write (uncached) object '.iormstats.sf': Maximum kernel-level retries exceeded

2014-05-07T02:44:33.541Z cpu16:32855)HBX: 255: Reclaimed heartbeat for volume 52e0072b-0adb69f0-f97b-0017a4779402 (uk1-san01:Production DataStore 3): [Timeout] Offset 4161536

2014-05-07T02:44:33.541Z cpu16:32855)[HB state abcdef02 offset 4161536 gen 179 stampUS 1646119268 uuid 536997d8-04a1ea9d-6c9a-0017a4779402 jrnl <FB 2656233> drv 14.60]

Any help much appreciated.

Thanks

Mattzilla99 · ‎09-03-2014

The problem was a combination of factors. Specifically, THIN LUNs on EMC VNX with Windows 2012 R2. It was to do with the way 2012R2 issues TRIM and UNMAP commands to the array. It basically blows up the SP and has a knock on effect across all LUNs.

The thing in our case was that we have a whole HP CHassis of Blades, mostly running ESX, one running Windows 2012 R2... and we essentially had a single 2012R2 physical machine taking out the entire array....

Defrag on 2012R2 exasperated this issue and was a way to spot the LUN (seeing massive IO on the Page File LUN for example, during defrag operation)

View solution in original post

Mattzilla99 · ‎05-07-2014

Note we quite frequently get hosts "disconnecting" from vCenter, VMs getting orphaned, VMS becoming unresponsive etc etc.

Thanks

DanielOprea · ‎05-07-2014

Hello,

The storage part is working properly? The areas are well? The ports on the switch are working properly? In the logs of the SAN, SAN switch, and storage array appears something?

PLEASE CONSIDER AWARDING any HELPFUL or CORRECT answer. Thanks!!
Por favor CONSIDERA PREMIAR cualquier respuesta ÚTIL o CORRECTA. ¡¡Muchas gracias!!
Blogs: https://danieloprea.blogspot.com/

Mattzilla99 · ‎05-07-2014

I think it's looking fine - don't think we see any errors of any kind of the fabric.

DanielOprea · ‎05-07-2014

Note:

Look to this KBs:

VMware KB: Troubleshooting fibre channel storage connectivity

VMware KB: Host Connectivity Degraded in ESX/ESXi

VMware KB: Troubleshooting LUN connectivity issues on ESXi/ESX hosts

PLEASE CONSIDER AWARDING any HELPFUL or CORRECT answer. Thanks!!
Por favor CONSIDERA PREMIAR cualquier respuesta ÚTIL o CORRECTA. ¡¡Muchas gracias!!
Blogs: https://danieloprea.blogspot.com/

doncalitri · ‎09-03-2014

Hi Matzilla,

What was the cause (and fix) of your issue?

Mattzilla99 · ‎09-03-2014

The problem was a combination of factors. Specifically, THIN LUNs on EMC VNX with Windows 2012 R2. It was to do with the way 2012R2 issues TRIM and UNMAP commands to the array. It basically blows up the SP and has a knock on effect across all LUNs.

The thing in our case was that we have a whole HP CHassis of Blades, mostly running ESX, one running Windows 2012 R2... and we essentially had a single 2012R2 physical machine taking out the entire array....

Defrag on 2012R2 exasperated this issue and was a way to spot the LUN (seeing massive IO on the Page File LUN for example, during defrag operation)

All

"Lot access to volume" on a nightly basis