VMware Cloud Community
mgolas
Contributor
Contributor

High latency with CEPH iSCSI - Hitting vmhba timeouts

Hi,

does anyone here use CEPH iSCSI with VMware ESXi? It seems that we are hitting the 5 second timeout limit on software HBA in ESXi. It appears whenever there is increased load on the cluster, like deep scrub or rebalance. Is it normal behaviour in production? Or is there something special we need to tune?

We are on latest Nautilus, 12 x 10 TB OSDs (4 servers), 25 Gbit/s Ethernet, erasure coded rbd pool with 128 PGs, aroun 200 PGs per OSD total.

ESXi Log:

2020-10-04T01:57:04.314Z cpu34:2098959)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic:517: vmhba64:CH:1 T:0 CN:0: Failed to receive data: Connection closed by peer

2020-10-04T01:57:04.314Z cpu34:2098959)iscsi_vmk: iscsivmk_ConnRxNotifyFailure:1235: vmhba64:CH:1 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound

2020-10-04T01:57:04.566Z cpu19:2098979)WARNING: iscsi_vmk: iscsivmk_StopConnection:741: vmhba64:CH:1 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)

2020-10-04T01:57:04.654Z cpu7:2097866)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:788: Probe cmd 0xa3 failed for path "vmhba64:C2:T0:L0" (0x5/0x20/0x0). Check if failover mode is still ALUA.

OSD Log:

[303088.450088] Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus iqn.1994-05.com.redhat:esxi1,i,0x00023d000002,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01

[324926.694077] Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus iqn.1994-05.com.redhat:esxi2,i,0x00023d000001,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01

[407067.404538] ABORT_TASK: Found referenced iSCSI task_tag: 5891

[407076.077175] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 5891

[411677.887690] ABORT_TASK: Found referenced iSCSI task_tag: 6722

[411683.297425] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6722

[481459.755876] ABORT_TASK: Found referenced iSCSI task_tag: 7930

[481460.787968] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 7930

Thanks a lot!

Reply
0 Kudos
1 Reply
slfh
Contributor
Contributor

Hello,

do you have found a solution?

I have seen this problem in an environment with an IP-Address conflict. Two Ethernet adapters had the same IP.

Best Regards,

Frank

Reply
0 Kudos