does anyone here use CEPH iSCSI with VMware ESXi? It seems that we are hitting the 5 second timeout limit on software HBA in ESXi. It appears whenever there is increased load on the cluster, like deep scrub or rebalance. Is it normal behaviour in production? Or is there something special we need to tune?
We are on latest Nautilus, 12 x 10 TB OSDs (4 servers), 25 Gbit/s Ethernet, erasure coded rbd pool with 128 PGs, aroun 200 PGs per OSD total.
2020-10-04T01:57:04.314Z cpu34:2098959)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic:517: vmhba64:CH:1 T:0 CN:0: Failed to receive data: Connection closed by peer
2020-10-04T01:57:04.314Z cpu34:2098959)iscsi_vmk: iscsivmk_ConnRxNotifyFailure:1235: vmhba64:CH:1 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound
2020-10-04T01:57:04.566Z cpu19:2098979)WARNING: iscsi_vmk: iscsivmk_StopConnection:741: vmhba64:CH:1 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2020-10-04T01:57:04.654Z cpu7:2097866)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:788: Probe cmd 0xa3 failed for path "vmhba64:C2:T0:L0" (0x5/0x20/0x0). Check if failover mode is still ALUA.
[324926.694077] Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus iqn.1994-05.com.redhat:esxi2,i,0x00023d000001,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
[407067.404538] ABORT_TASK: Found referenced iSCSI task_tag: 5891
[407076.077175] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 5891
[411677.887690] ABORT_TASK: Found referenced iSCSI task_tag: 6722
[411683.297425] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6722
[481459.755876] ABORT_TASK: Found referenced iSCSI task_tag: 7930
[481460.787968] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 7930
Thanks a lot!
do you have found a solution?
I have seen this problem in an environment with an IP-Address conflict. Two Ethernet adapters had the same IP.