VMware Cloud Community
szszsz1
Contributor
Contributor

iSCSI connection goes down for long time while simulating failure

Hello Vmware experts,

I have trouble in our Vcenter environment that cannot be solved by me or by external consultant. Few words about physical layout: Our storage is 3PAR 8400 - 2 controller model - each controller connected by 10Gbit fiber, iSCSI network is done by 2 Cisco Nexus 3524 switches that run in virtual-port-channel mode and there are 3 ESX servers (only one is on the picture though). These servers have Emulex OCe 14000 iSCSI CNA network card to access this iSCSI network. This card has 2 ports and these 2 ports are shown as vmhba2 and vmhba64 in vSphere, respectively. Since each port (i.e. each hba adapter) has only 2 of 4 addresses of 3PAR reachable (remember, through black cables don't flow any iSCSI related data, it's only for switch syncing), there are 2 iSCSI targets per HBA adapter. There are no orphan ports in the network (i.e. traffic does not cross vPC peer-link - black lines between switches). The ports in one Emulex card, both have different IQN and 3PAR maps 2 physical ports to 1 IQN and 2 physical ports to the other IQN (shown below the topology picture). The whole situation is depicted below.

Forum - vmware.png

ports.PNG

When I simulate a failure of a controller by shutting down ports on the Nexus switches in a way that one controller stays alive (e.g. I shut down port 1/2 on both switches), the virtual machine writing to the storage cannot write for like 40 seconds - the addresses of controller that didn't go down are pingable without ping loss regardless of the other controller that went offline. When I do the same on 3PAR (i.e. shutting 3PAR's ports in GUI), the virtual machine continunes to write flawlessly to the storage. This statement has been said based on 3PARs utilization report and observation of traffic load on the ports that connect the surviving controller. The application used to test it on virtual machine's side was diskspd.exe. Below this post, there are 2 excerpts from log of the ESX host - the longer is when ports were shut down on the Nexus (i.e. 40 seconds of outage, VM could not write to the storage) and the other (shorter) when ports were shut down on the 3PAR (no outage or delay in writing the data to the 3PAR from the VM).

In VMWare, I have tried multiple configurations all of them behaving basically in the same way (when shutting down the Nexus ports connecting 1 controller, VM cannot write for like 40 seconds). I tried to have IP addresses configured only on vmhba adapters without any vSwitch or vmKernel port, I tried to have 2 vmKernels and 2 vmSwitches  - e.g. vSwitch4 and vSwitch5 and vmk6 and vmk7. Each vSwitch was related to one vmnic and vmk port. None of that changed anything.

The symptom of the problem is that VMware detects that paths affected by Nexus ports outage are down(dead) after long time - roughly the same 40 seconds as is the duration of the outage during which VM cannot write to storage.

And now the promised logs from ESX host during the outages:

1st one - the one 40 seconds long - ports being shut down on Nexus:

2018-06-07T14:35:48.395Z cpu35:78273)HBX: 2958: '3PAR_DS_FC_2': HB at offset 3735552 - Waiting for timed out HB:

2018-06-07T14:35:48.395Z cpu35:78273)  [HB state abcdef02 offset 3735552 gen 277 stampUS 17971405120 uuid 5b18fccc-cd0a770c-0ff5-f4e9d4facd70 jrnl <FB 3> drv 24.81 lockImpl 4 ip 10.17.26.2]

2018-06-07T14:35:55.281Z cpu43:83809)WARNING: [1-0:1:00.3] elxiscsi_WaitforCompletion:2770 abort: cmd timed out

2018-06-07T14:35:55.285Z cpu48:66608)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x88 (0x439d0db16900, 78275) to dev "naa.60002ac0000000000000000300020583" on path "vmhba64:C0:T3:L0" Failed: H:0x1 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:FAILOVER

2018-06-07T14:35:55.285Z cpu48:66608)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "naa.60002ac0000000000000000300020583": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

2018-06-07T14:35:55.285Z cpu48:66608)NMP: nmp_ThrottleLogForDevice:3576: last error status from device naa.60002ac0000000000000000300020583 repeated 10 times

2018-06-07T14:35:55.288Z cpu50:66222)[1-0:1:00.3] elxiscsi_ReopenSession:3368 sess 0x43066f251540:0x43066e191010 handle 0 target 3, rescan

2018-06-07T14:35:56.279Z cpu10:65678)WARNING: [0-0:1:00.2] elxiscsi_WaitforCompletion:2770 abort: cmd timed out

2018-06-07T14:35:56.279Z cpu10:65678)ScsiDeviceIO: 2918: Cmd(0x439500f62d40) 0x89, CmdSN 0x22d4 from world 65599 to dev "naa.60002ac0000000000000000300020583" failed H:0x8 D:0x0 P:0x0

2018-06-07T14:35:56.285Z cpu7:66218)[0-0:1:00.2] elxiscsi_ReopenSession:3368 sess 0x43066f316710:0x43066e190490 handle 2 target 0, rescan

2018-06-07T14:35:58.400Z cpu35:78273)HBX: 2958: '3PAR_DS_FC_2': HB at offset 3735552 - Waiting for timed out HB:

2018-06-07T14:35:58.400Z cpu35:78273)  [HB state abcdef02 offset 3735552 gen 277 stampUS 17968405121 uuid 5b18fccc-cd0a770c-0ff5-f4e9d4facd70 jrnl <FB 3> drv 24.81 lockImpl 4 ip 10.17.26.2]

2018-06-07T14:35:58.774Z cpu22:83796)ScsiDeviceIO: 2968: Cmd(0x439d0a0d59c0) 0x2a, CmdSN 0x22d7 from world 65599 to dev "naa.60002ac0000000000000000300020583" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x43 0x0.

2018-06-07T14:36:01.377Z cpu10:65678)ScsiDeviceIO: 2918: Cmd(0x439d0db16900) 0x88, CmdSN 0x8000002f from world 78275 to dev "naa.60002ac0000000000000000300020583" failed H:0x8 D:0x0 P:0x0

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)ScsiDeviceIO: 2918: Cmd(0x439d009944c0) 0x8a, CmdSN 0x80000017 from world 78275 to dev "naa.60002ac0000000000000000300020583" failed H:0x8 D:0x0 P:0x0

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)ScsiDeviceIO: 2918: Cmd(0x439d0a98d140) 0x88, CmdSN 0x80000018 from world 78275 to dev "naa.60002ac0000000000000000300020583" failed H:0x8 D:0x0 P:0x0

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)ScsiDeviceIO: 2918: Cmd(0x439d0a81cec0) 0x88, CmdSN 0x80000079 from world 78275 to dev "naa.60002ac0000000000000000300020583" failed H:0x8 D:0x0 P:0x0

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)ScsiDeviceIO: 2918: Cmd(0x439d00848ac0) 0x88, CmdSN 0x8000004a from world 78275 to dev "naa.60002ac0000000000000000300020583" failed H:0x8 D:0x0 P:0x0

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.377Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:01.378Z cpu10:65678)WARNING: NMP: nmp_DeviceStartLoop:728: NMP Device "naa.60002ac0000000000000000300020583" is blocked. Not starting I/O from device.

2018-06-07T14:36:05.875Z cpu39:65864)HBX: 2958: '3PAR_DS_FC_2': HB at offset 3735552 - Waiting for timed out HB:

2018-06-07T14:36:05.875Z cpu39:65864)  [HB state abcdef02 offset 3735552 gen 277 stampUS 17968405121 uuid 5b18fccc-cd0a770c-0ff5-f4e9d4facd70 jrnl <FB 3> drv 24.81 lockImpl 4 ip 10.17.26.2]

2018-06-07T14:36:08.405Z cpu35:78273)HBX: 2958: '3PAR_DS_FC_2': HB at offset 3735552 - Waiting for timed out HB:

2018-06-07T14:36:08.405Z cpu35:78273)  [HB state abcdef02 offset 3735552 gen 277 stampUS 17968405121 uuid 5b18fccc-cd0a770c-0ff5-f4e9d4facd70 jrnl <FB 3> drv 24.81 lockImpl 4 ip 10.17.26.2]

2018-06-07T14:36:15.293Z cpu48:83807)WARNING: [0-0:1:00.2] elxiscsi_WaitforCompletion:2770 abort: cmd timed out

2018-06-07T14:36:15.294Z cpu48:83807)WARNING: ScsiPath: 8221: Set retry timeout for failed TaskMgmt abort for CmdSN  0x0, status Failure, path vmhba2:C0:T2:L50

2018-06-07T14:36:15.296Z cpu29:66394)WARNING: NMP: nmpDeviceAttemptFailover:603: Retry world restore device "naa.60002ac0000000000000000300020583" - no more commands to retry

2018-06-07T14:36:15.298Z cpu7:66218)[0-0:1:00.2] elxiscsi_ReopenSession:3368 sess 0x43066f305380:0x43066e196ac0 handle 4 target 2, rescan

2018-06-07T14:36:15.304Z cpu10:65678)HBX: 283: '3PAR_DS_FC_2': HB at offset 3735552 - Reclaimed heartbeat [Timeout]:

2018-06-07T14:36:15.304Z cpu10:65678)  [HB state abcdef02 offset 3735552 gen 277 stampUS 18009431835 uuid 5b18fccc-cd0a770c-0ff5-f4e9d4facd70 jrnl <FB 3> drv 24.81 lockImpl 4 ip 10.17.26.2]

2018-06-07T14:36:15.526Z cpu8:66607)NMP: nmp_ThrottleLogForDevice:3593: last error status from device naa.60002ac0000000000000000300020583 repeated 18 times

2018-06-07T14:36:15.526Z cpu8:66607)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x89 (0x439512a33700, 84406) to dev "naa.60002ac0000000000000000300020583" on path "vmhba64:C0:T4:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE

2018-06-07T14:36:15.526Z cpu8:66607)ScsiDeviceIO: 2968: Cmd(0x439512a33700) 0x89, CmdSN 0x22df from world 84406 to dev "naa.60002ac0000000000000000300020583" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.

2018-06-07T14:36:15.526Z cpu48:66608)ScsiDeviceIO: 2968: Cmd(0x439d00840240) 0x89, CmdSN 0x22e0 from world 84407 to dev "naa.60002ac0000000000000000300020583" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.

2018-06-07T14:36:15.527Z cpu8:66607)ScsiDeviceIO: 2968: Cmd(0x439500ee0e40) 0x89, CmdSN 0x22e1 from world 84408 to dev "naa.60002ac0000000000000000300020583" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.

2018-06-07T14:36:15.528Z cpu6:66605)ScsiDeviceIO: 2968: Cmd(0x439512a38080) 0x89, CmdSN 0x22e3 from world 84408 to dev "naa.60002ac0000000000000000300020583" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.

2018-06-07T14:36:15.528Z cpu6:66605)ScsiDeviceIO: 2968: Cmd(0x439512abb480) 0x89, CmdSN 0x22e4 from world 84406 to dev "naa.60002ac0000000000000000300020583" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.

2018-06-07T14:36:15.529Z cpu8:66607)ScsiDeviceIO: 2968: Cmd(0x439500f51540) 0x89, CmdSN 0x22e6 from world 84406 to dev "naa.60002ac0000000000000000300020583" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.

2018-06-07T14:37:05.871Z cpu41:66393)NMP: nmp_ResetDeviceLogThrottling:3445: last error status from device naa.60002ac0000000000000000300020583 repeated 5 times

2nd one - the short one when VM does not observe anything:

2018-06-07T14:46:45.284Z cpu47:66606)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x88 (0x439d0a011640, 78275) to dev "naa.60002ac0000000000000000300020583" on path "vmhba2:C0:T2:L0" Failed: H:0x1 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:FAILOVER

2018-06-07T14:46:45.284Z cpu47:66606)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "naa.60002ac0000000000000000300020583": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

2018-06-07T14:46:45.288Z cpu7:66218)[0-0:1:00.2] elxiscsi_ReopenSession:3368 sess 0x43066f305380:0x43066e196ac0 handle 4 target 2, rescan

2018-06-07T14:46:45.292Z cpu47:66606)NMP: nmp_ThrottleLogForDevice:3576: last error status from device naa.60002ac0000000000000000300020583 repeated 10 times

2018-06-07T14:46:45.295Z cpu7:66218)[0-0:1:00.2] elxiscsi_ReopenSession:3368 sess 0x43066f316710:0x43066e190490 handle 2 target 0, rescan

2018-06-07T14:46:45.876Z cpu39:66394)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "naa.60002ac0000000000000000300020583" - issuing command 0x439d0a011640

2018-06-07T14:46:45.876Z cpu47:66606)NMP: nmpCompleteRetryForPath:327: Retry world recovered device "naa.60002ac0000000000000000300020583"

2018-06-07T14:46:53.096Z cpu50:66608)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "naa.60002ac0000000000000000300020583": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

2018-06-07T14:46:53.096Z cpu50:66608)NMP: nmp_ThrottleLogForDevice:3576: last error status from device naa.60002ac0000000000000000300020583 repeated 20 times

2018-06-07T14:46:53.101Z cpu50:66222)[1-0:1:00.3] elxiscsi_ReopenSession:3368 sess 0x43066f251540:0x43066e191010 handle 0 target 3, rescan

2018-06-07T14:46:53.876Z cpu39:66394)WARNING: NMP: nmpDeviceAttemptFailover:640: Retry world failover device "naa.60002ac0000000000000000300020583" - issuing command 0x439d0a928ac0

2018-06-07T14:46:53.876Z cpu28:66606)NMP: nmpCompleteRetryForPath:327: Retry world recovered device "naa.60002ac0000000000000000300020583"

Thank you very much for the effort of helping me in advance, take care.

S.

0 Kudos
0 Replies