Hi all
Hoping for some pointers to see where to look as we have been experiencing so many issues.
Environment is:
ESX 5.5 (latest version as of this week)
HP Blade ProLiant BL460c Gen8
Brocade FC switches
VNX5400 storage
Everything is at current patch level (HP Feb SSP, latest drivers etc from the latest issue of VMWare/HP Recipe Book)
On a nightly basis we see messages of the following:
Lost access to volume 52e00703-7702b882-d845-0017a4779402 (DataStore) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
07/05/2014 07:50:16
Successfully restored access to volume 52e0072b-0adb69f0-f97b-0017a4779402 (DataStore) following connectivity issues.info
From the vmkernal.log I can see these messages:
2014-05-07T02:44:14.860Z cpu0:33743)lpfc: lpfc_scsi_cmd_iocb_cmpl:2157: 0:(0):3271: FCP cmd x2a failed <2/4> sid x0d0505, did x0d0100, oxid xd4 Abort Requested Host Abort Req
2014-05-07T02:44:14.860Z cpu0:32813)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x412e80450900, 32805) to dev "naa.60060160fd503600ea66bd8acb82e311" on path "vmhba0:C0:T2:L4" Failed: H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2014-05-07T02:44:14.860Z cpu0:32813)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60060160fd503600ea66bd8acb82e311" state in doubt; requested fast path state update...
2014-05-07T02:44:14.860Z cpu0:32813)ScsiDeviceIO: 2337: Cmd(0x412e80450900) 0x2a, CmdSN 0x299 from world 32805 to dev "naa.60060160fd503600ea66bd8acb82e311" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-05-07T02:44:15.859Z cpu0:32881)lpfc: lpfc_scsi_cmd_iocb_cmpl:2157: 0:(0):3271: FCP cmd x2a failed <2/4> sid x0d0505, did x0d0100, oxid xe7 Abort Requested Host Abort Req
2014-05-07T02:44:15.860Z cpu0:32813)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60060160fd503600ea66bd8acb82e311" state in doubt; requested fast path state update...
2014-05-07T02:44:15.860Z cpu0:32813)ScsiDeviceIO: 2337: Cmd(0x413682f6e580) 0x2a, CmdSN 0x29a from world 32805 to dev "naa.60060160fd503600ea66bd8acb82e311" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-05-07T02:44:20.025Z cpu4:36428)World: 14296: VC opID 4CAEF736-0000009D-0-fb maps to vmkernel opID 4f50e1ac
2014-05-07T02:44:20.397Z cpu1:34426)Fil3: 15408: Max retries (10) exceeded for caller Fil3_FileIO (status 'IO was aborted by VMFS via a virt-reset on the device')
2014-05-07T02:44:20.397Z cpu1:34426)BC: 2288: Failed to write (uncached) object '.iormstats.sf': Maximum kernel-level retries exceeded
2014-05-07T02:44:23.013Z cpu8:34236)HBX: 2692: Waiting for timed out [HB state abcdef02 offset 4161536 gen 179 stampUS 1629072438 uuid 536997d8-04a1ea9d-6c9a-0017a4779402 jrnl <FB 2656233> drv 14.60] on vol 'uk1-san01:Production DataStore 3'
2014-05-07T02:44:29.645Z cpu4:33106)lpfc: lpfc_scsi_cmd_iocb_cmpl:2157: 0:(0):3271: FCP cmd xa3 failed <3/4> sid x0d0505, did x0d0000, oxid xeb Abort Requested Host Abort Req
2014-05-07T02:44:29.645Z cpu20:33044)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:651: Path "vmhba0:C0:T3:L4" (UP) command 0xa3 failed with status Timeout. H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2014-05-07T02:44:29.755Z cpu7:32857)HBX: 255: Reclaimed heartbeat for volume 52e0072b-0adb69f0-f97b-0017a4779402 (uk1-san01:Production DataStore 3): [Timeout] Offset 4161536
2014-05-07T02:44:29.755Z cpu7:32857)[HB state abcdef02 offset 4161536 gen 179 stampUS 1632947371 uuid 536997d8-04a1ea9d-6c9a-0017a4779402 jrnl <FB 2656233> drv 14.60]
2014-05-07T02:44:31.808Z cpu0:32807)lpfc: lpfc_scsi_cmd_iocb_cmpl:2157: 0:(0):3271: FCP cmd x2a failed <2/4> sid x0d0505, did x0d0100, oxid x112 Abort Requested Host Abort Req
2014-05-07T02:44:31.808Z cpu0:32813)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60060160fd503600ea66bd8acb82e311" state in doubt; requested fast path state update...
2014-05-07T02:44:31.808Z cpu0:32813)ScsiDeviceIO: 2337: Cmd(0x412e8349b000) 0x2a, CmdSN 0x2ab from world 32805 to dev "naa.60060160fd503600ea66bd8acb82e311" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-05-07T02:44:32.407Z cpu1:34426)Fil3: 15408: Max retries (10) exceeded for caller Fil3_FileIO (status 'IO was aborted by VMFS via a virt-reset on the device')
2014-05-07T02:44:32.407Z cpu1:34426)BC: 2288: Failed to write (uncached) object '.iormstats.sf': Maximum kernel-level retries exceeded
2014-05-07T02:44:33.541Z cpu16:32855)HBX: 255: Reclaimed heartbeat for volume 52e0072b-0adb69f0-f97b-0017a4779402 (uk1-san01:Production DataStore 3): [Timeout] Offset 4161536
2014-05-07T02:44:33.541Z cpu16:32855)[HB state abcdef02 offset 4161536 gen 179 stampUS 1646119268 uuid 536997d8-04a1ea9d-6c9a-0017a4779402 jrnl <FB 2656233> drv 14.60]
Any help much appreciated.
Thanks
The problem was a combination of factors. Specifically, THIN LUNs on EMC VNX with Windows 2012 R2. It was to do with the way 2012R2 issues TRIM and UNMAP commands to the array. It basically blows up the SP and has a knock on effect across all LUNs.
The thing in our case was that we have a whole HP CHassis of Blades, mostly running ESX, one running Windows 2012 R2... and we essentially had a single 2012R2 physical machine taking out the entire array....
Defrag on 2012R2 exasperated this issue and was a way to spot the LUN (seeing massive IO on the Page File LUN for example, during defrag operation)
Note we quite frequently get hosts "disconnecting" from vCenter, VMs getting orphaned, VMS becoming unresponsive etc etc.
Thanks
Hello,
The storage part is working properly? The areas are well? The ports on the switch are working properly? In the logs of the SAN, SAN switch, and storage array appears something?
I think it's looking fine - don't think we see any errors of any kind of the fabric.
Note:
Look to this KBs:
VMware KB: Troubleshooting fibre channel storage connectivity
VMware KB: Host Connectivity Degraded in ESX/ESXi
VMware KB: Troubleshooting LUN connectivity issues on ESXi/ESX hosts
Hi Matzilla,
What was the cause (and fix) of your issue?
The problem was a combination of factors. Specifically, THIN LUNs on EMC VNX with Windows 2012 R2. It was to do with the way 2012R2 issues TRIM and UNMAP commands to the array. It basically blows up the SP and has a knock on effect across all LUNs.
The thing in our case was that we have a whole HP CHassis of Blades, mostly running ESX, one running Windows 2012 R2... and we essentially had a single 2012R2 physical machine taking out the entire array....
Defrag on 2012R2 exasperated this issue and was a way to spot the LUN (seeing massive IO on the Page File LUN for example, during defrag operation)