VMware Cloud Community
lzoli
Contributor
Contributor

Mellanox ConnectX-3 Pro strange random network disruption with vSAN

Hello there dear community,

I see an odd behaviour with the card in the title and I would like to ask if anyone had the same experience or an idea?

Description of the setup:

The card is a Mellanox ConnectX-3 Pro 40G card, jumbo frames enabled. The server is an Intel R2208WFTZS.

The card's firmware is on the latest stable version from mellanox.

The vsan's vmk interface is connected to one of the ports of the 40G card, jumbo frames enabled.

I'm using a non customised ESXI 6.7 image with the nmlx4_en native driver, all the latest patches installed.

Problem:

The setup works well, however occasionally totally randomly the vmk interface can't reach the other vsan nodes in the same cluster, can't ping anyone.

When I was doing packet capture on the vmk interface, I could see the arp requests arriving from the other nodes to the vmk interface, the interface responds to them, however the responses don't make it out of the server. Seemed similar to a unidirectional link.

If I reload the vmnic interface where the vsan traffic meant to go through, it all returns to normal, the traffic from vsan vmk starts to flow, arp responses to through.

Obviously the logs don't say much other than this interesting part:

2018-08-13T20:26:23.187Z cpu25:2097436)<NMLX_INF> nmlx4_en: vmnic2: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-13T20:26:23.187Z cpu25:2097436)<NMLX_INF> nmlx4_en: vmnic2: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6c:23:xx
2018-08-13T20:26:23.187Z cpu25:2097436)<NMLX_INF> nmlx4_en: vmnic2: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
2018-08-13T20:26:39.594Z cpu43:2098358)VLANMTUCheck: NMVCCheckUplinkResultDetail:880: Vlanmtuchk EVENT: DV port 408 uplink vmnic2MTU check RESULT_CHANGED
2018-08-13T20:26:39.594Z cpu43:2098358)VLANMTUCheck: NMVCCheckUplinkResultDetail:898: Vlanmtuchk EVENT: DV port 408 uplink vmnic2VLAN check RESULT_CHANGED

Seems like the vsan queue is being destroyed? Not sure why is that happening. And you see after that the dvs alarms come on.

Other than this line, no signs of driver and firmware related logs or crashes.

It happened on 2 servers already, identical ones so I'm afraid it's not just specific to one server. Don't really know where to start or how to dig deeper.

Any idea or suggestion?

Thank you very much!

Zoltan

0 Kudos
13 Replies
lzoli
Contributor
Contributor

The exact same thing happened on a Dell R640 as well with the same card.

Same error:

2018-08-15T08:06:35.593Z cpu56:2097436)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

2018-08-15T08:06:35.593Z cpu56:2097436)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6d:c9:xx

2018-08-15T08:06:35.593Z cpu56:2097436)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

Any suggestions welcomed!

0 Kudos
JamesBOC
Contributor
Contributor

I'm having the same problem. Same messages in the logs, same card, same ESXi build, but do not have the latest CVE patch applied yet.

I have 6 identical servers, all exhibiting the same problem. I installed the Mellanox tools to get the exact firmware info and such and to make sure it was updated.. it is..

vsan is enabled.

However I can add to this, that it is for me not JUST vsan traffic that stops, it is ALL traffic on a given port.. fortunately so far only 1 port at a time. (yes this includes VM Traffic)

I am using a VDS which I previously had configured on all port groups to use link state as the fail checker, I changed it to beacon probing and threw 1 of my 1gb Intel onboard nics in the standby uplink section.. this has masked the problem, but not fixed it obviously.

I am struggling here with this... opening a case with VMWare tomorrow. Otherwise I'll throw some intel x710's in there and call it a day. I purchased mellanox for the RDMA features as these nodes are compatible with either S2D or VMWare.

Switch is a Juniper EX4600, no dcbx enabled - Stock cos flow queues no additional CoS enabled.

Media type is DAC cable. Card is a Mellanox ConnectX-3 Pro Dual port QSFP 40GbE.

flint output:

[root@vmserver3:/opt/mellanox/bin] ./flint_ext -d mt4103_pci_cr0 hw query

HW Info:

  HwDevId               503

  HwRevId               0x0

Flash Info:

  Type                  M25PXxx

  TotalSize             0x200000

  Banks                 0x1

  SectorSize            0x1000

  WriteBlockSize        0x10

  CmdSet                0x80

[root@vmserver3:/opt/mellanox/bin] ./flint_ext -d mt4103_pci_cr0 query

Image type:            FS2

FW Version:            2.42.5000

FW Release Date:       5.9.2017

Product Version:       02.42.50.00

Rom Info:              type=PXE version=3.4.752

Device ID:             4103

Description:           Node             Port1            Port2            Sys image

GUIDs:                 ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff

MACs:                                       ec0d9abXXXXX     ec0d9aXXXXXX

VSD:

PSID:                  MT_1060111023

Switch output from buggered NIC.. notice zero bytes Input rate... no flaps, no errors.

Physical interface: et-3/2/1, Enabled, Physical link is Up

  Interface index: 786, SNMP ifIndex: 800, Generation: 280

  Description: S2D6-2

  Link-level type: Ethernet, MTU: 9216, MRU: 0, Speed: 40Gbps, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled, Source filtering: Disabled, Flow control: Disabled, Media type: Fiber

  Device flags   : Present Running

  Interface flags: SNMP-Traps Internal: 0x4000

  Link flags     : None

  CoS queues     : 12 supported, 12 maximum usable queues

  Hold-times     : Up 0 ms, Down 0 ms

  Current address: 44:aa:50XXXXX, Hardware address: 44:aa:50XXXXX

  Last flapped   : 2018-08-21 09:08:10 UTC (00:50:24 ago)

  Statistics last cleared: 2018-08-15 22:25:32 UTC (5d 11:33 ago)

  Traffic statistics:

   Input  bytes  :        6183356033749                    0 bps

   Output bytes  :        7207567747153               386784 bps

   Input  packets:           3029566835                    0 pps

   Output packets:           4256486601                  260 pps

   IPv6 transit statistics:

    Input  bytes  :                   0

    Output bytes  :                   0

    Input  packets:                   0

    Output packets:                   0

  Input errors:

    Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Bucket drops: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0, L2 mismatch timeouts: 0, FIFO errors: 0, Resource errors: 0

  Output errors:

    Carrier transitions: 2, Errors: 0, Drops: 0, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 0, Resource errors: 0, Bucket drops: 0

  Egress queues: 12 supported, 5 in use

  Queue counters:       Queued packets  Transmitted packets      Dropped packets

    0                                0           4137036295                    0

    3                                0                    0                    0

    4                                0                    0                    0

    7                                0                17102                    0

    8                                0            103713995                    0

  Queue number:         Mapped forwarding classes

    0                   best-effort

    3                   fcoe

    4                   no-loss

    7                   network-control

    8                   mcast

  Active alarms  : None

  Active defects : None

  MAC statistics:                      Receive         Transmit

    Total octets                 6183356033749    7207567747153

    Total packets                   3029566835       4256486601

    Unicast packets                 3027634433       4144924773

    Broadcast packets                  1395256         83221337

    Multicast packets                   537146         28340491

    CRC/Align errors                         0                0

    FIFO errors                              0                0

    MAC control frames                       0                0

    MAC pause frames                         0                0

    Oversized frames                         0

    Jabber frames                            0

    Fragment frames                          0

    VLAN tagged frames               220965439

    Code violations                          0

  MAC Priority Flow Control Statistics:

    Priority :  0                             0                0

    Priority :  1                             0                0

    Priority :  2                             0                0

    Priority :  3                             0                0

    Priority :  4                             0                0

    Priority :  5                             0                0

    Priority :  6                             0                0

    Priority :  7                             0                0

  Filter statistics:

    Input packet count                       0

    Input packet rejects                     0

    Input DA rejects                         0

    Input SA rejects                         0

    Output packet count                                       0

    Output packet pad count                                   0

    Output packet error count                                 0

    CAM destination filters: 1, CAM source filters: 0

  Packet Forwarding Engine configuration:

    Destination slot: 0 (0x00)

  CoS information:

    Direction : Output

    CoS transmit queue               Bandwidth               Buffer Priority   Limit

                              %            bps     %           usec

    0 best-effort             5     2000000000     5              0      low    none

    3 fcoe                   35    14000000000    35              0      low    none

    4 no-loss                35    14000000000    35              0      low    none

    7 network-control         5     2000000000     5              0      low    none

    8 mcast                  20     8000000000    20              0      low    none

  Interface transmit statistics: Disabled

  Logical interface et-3/2/1.0 (Index 700) (SNMP ifIndex 817) (Generation 318)

    Flags: Up SNMP-Traps 0x24024000 Encapsulation: Ethernet-Bridge

    Traffic statistics:

     Input  bytes  :             73042751

     Output bytes  :              7514303

     Input  packets:              1115405

     Output packets:                37017

    Local statistics:

     Input  bytes  :             73042751

     Output bytes  :              7514303

     Input  packets:              1115405

     Output packets:                37017

    Transit statistics:

     Input  bytes  :                    0                    0 bps

     Output bytes  :                    0                    0 bps

     Input  packets:                    0                    0 pps

     Output packets:                    0                    0 pps

    Protocol eth-switch, MTU: 9216, Generation: 342, Route table: 3

      Flags: Trunk-Mode

No MACS:

show ethernet-switching table interface et-3/2/1

MAC database for interface et-3/2/1

MAC database for interface et-3/2/1.0

{master:0}

No lldp .. nothing..

vmkernel.log (this same bit is spamming it.. log has turned over 6 times in 36 hours.

2018-08-20T02:08:14.446Z cpu68:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:08:54.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

2018-08-20T02:08:54.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:08:54.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

2018-08-20T02:08:59.439Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated

2018-08-20T02:08:59.446Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

2018-08-20T02:08:59.446Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:09:14.446Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

2018-08-20T02:09:14.446Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:09:14.446Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

2018-08-20T02:10:04.439Z cpu60:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated

2018-08-20T02:10:04.446Z cpu60:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

2018-08-20T02:10:04.446Z cpu60:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:10:24.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

2018-08-20T02:10:24.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:10:24.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

2018-08-20T02:11:14.439Z cpu71:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated

2018-08-20T02:11:14.446Z cpu71:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

2018-08-20T02:11:14.446Z cpu71:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:11:29.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

2018-08-20T02:11:29.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:11:29.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

2018-08-20T02:12:24.439Z cpu87:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated

2018-08-20T02:12:24.446Z cpu87:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

2018-08-20T02:12:24.446Z cpu87:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:12:49.446Z cpu68:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

2018-08-20T02:12:49.446Z cpu68:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:12:49.446Z cpu68:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

2018-08-20T02:13:04.439Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated

2018-08-20T02:13:04.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

2018-08-20T02:13:04.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:14:04.446Z cpu92:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

2018-08-20T02:14:04.446Z cpu92:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:14:04.446Z cpu92:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

2018-08-20T02:14:49.440Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated

2018-08-20T02:14:49.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

2018-08-20T02:14:49.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:14:54.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

2018-08-20T02:14:54.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:14:54.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

2018-08-20T02:15:04.440Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated

2018-08-20T02:15:04.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

2018-08-20T02:15:04.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:16:34.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

2018-08-20T02:16:34.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

2018-08-20T02:16:34.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

2018-08-20T02:16:59.439Z cpu51:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated

2018-08-20T02:16:59.446Z cpu51:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

2018-08-20T02:16:59.446Z cpu51:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a

0 Kudos
lzoli
Contributor
Contributor

It's good to see it's not just me who is having these issues. My fix was to swap the cards out to Mellanox Connect X4 because this problem literally crippled production random times a day.

Touch wood X4 works so far with the latest firmware.

Literally I got the same switch as you do and we use DAC cables too. I can confirm it's not just vsan traffic that stops too even though when I first seen the problem I've linked it with traffic volume. After a few days passed since I opened this topic, even management traffic gone through the 10Gbit/s card variants of the X3 stopped. It's not data volume that triggers it.

This problem exist on the 10 and 40G variants of the Mellanox X3 cards.

I think we're facing a device driver or firmware related bug. I don't have support contract with vmware so don't know who to report this.

If you open a case, please let me know where you get with it, because we're in the same boat literally.

0 Kudos
skaWll
Contributor
Contributor

Hi,

Same problem here with VM loosing all network connectivity on ESXi 6.5 U2, using 2 Mellanox ConnectX-3 Pro 10 Gbits cards within a lacp link aggregation.

When the VM loose its network connectivity, I can find those messages in the vmkernel.log :

2018-09-07T13:38:09.868Z cpu11:65725)<NMLX_INF> nmlx4_en: vmnic5: nmlx4_en_QueueApplyFilter - (partners/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

2018-09-07T13:38:09.868Z cpu11:65725)<NMLX_INF> nmlx4_en: vmnic5: nmlx4_en_QueueApplyFilter - (partners/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:82:00:2d

Where 00:50:56:82:00:2d is the MAC address of the VM impacted.

On the Mellanox ConnectX-3 Pro, we're using :

  • the firmware version 2.42.5000
  • the driver version 3.16.11.6

I opened a case with the VMware support, I'll keep you posted.

0 Kudos
crgit
Contributor
Contributor

We are having and have been having essentially the same issue:

We have 8x Dell Poweredge R730XD servers, each with two Mellanox ConnectX-3 cards in them, they connect to two Dell S6100 switches using several different brands of twinax cable (40Gbps-Dell, Cable Rack and LeGrand), 1 and 3 meter lengths.

The ESXi version is DellEMC (provided) ESXi-6.5U1-7967591-A10.

The Mellanox ConnectX-3 driver is 3.16.11.6, firmware is 2.42.5000.

Our servers would get a large amount of FIFO errors and then eventually (initially, every 3-4 weeks, now, every 1-2 weeks) reach a point where the ESXi management (hostd, etc) would be receiving to many logs and become unresponsive. We could restart hostd and some of the management services, but VSAN would never come back online until we force reset the server (normal restart would never go through).

Once the server came back online, everything seemed fine for a while longer. We started with this issue back on ESXi 6.0 U2 and have progressed through updates to the ESXi, firmware, BIOS, controller firmware, etc, to try and resolve this.

VMWare has exhausted their troubleshooting of this and say that they are seeing the cache of the ConnectX-3 card being overrun, which then begins the process of logging overrun and eventually system crash. Dell has not been able to identify any particular issue and have at this point pretty much stopped responding to me.

Due to the above, we have started replacing the Mellanox ConnectX-3 cards with Intel XL710QDA2 cards and have not seen the issue return on those hosts.

0 Kudos
crgit
Contributor
Contributor

Were you ever able to resolve the issue?

0 Kudos
chuckado1
Enthusiast
Enthusiast

Did vmware support ever find a solution for this issue?  I am having the same errors in my vmkernel.log and trying to find a solution.

0 Kudos
lzoli
Contributor
Contributor

I don't have support contract so I didn't raise a case with vmware. Some people here said they'll but I haven't heard anything back from them.

The problem is still there and I don't have any solution for it.

0 Kudos
lzoli
Contributor
Contributor

Anyone got any update on this? It's been going on for a while. Anything is welcomed!

0 Kudos
yoeri
Contributor
Contributor

We have been experiencing the same issue running the DELL VMware ESXi image, both in version 6.0 and in version 6.5. The issue does not seem to occur when we use the stock Mellanox driver that comes with ESXi, instead of the newer drivers that have been included in the DELL images. For now we have downgraded the Mellanox driver (and the firmware).

On a server running these versions of the Mellanox driver, and firmware we have been experiencing occasional network issues:

NIC firmware: 02.42.50.00

NIC driver: 3.16.11.6-1OEM.650.0.0.4598673

We are now running these versions on all our other (6.5) hosts:

NIC firmware: 02.36.50.80

NIC driver: 3.16.0.0-1vmw.650.0.0.4564106

On a host that is experiencing issues, you can reset the network driver by executing the following commands (via iDRAC):

1) Unload the driver:

esxcfg-module -u nmlx4_en
esxcfg-module -u nmlx4_core

2) Load the driver:

/etc/init.d/sfcbd-watchdog stop

esxcfg-module nmlx4_core

esxcfg-module nmlx4_en
/etc/init.d/sfcbd-watchdog start

kill -POLL $(cat /var/run/vmware/vmkdevmgr.pid)

So far I have not been able to reproduce the issue on a host that is not experiencing the issue. On a host that is having the issue, it is very easy to reproduce it by just generating (a fairly small amount of) network traffic.

You can downgrade the driver with the following commands:

esxcli software vib remove -n nmlx4-core
esxcli software vib remove -n nmlx4-en
esxcli software vib remove -n nmlx4-en-rdma
esxcli software vib install -v http://repo/vmware/ESXi650-10719125/vib20/nmlx4-core/VMW_bootbank_nmlx4-core_3.16.0.0-1vmw.650.0.0.4...
esxcli software vib install -v http://repo/vmware/ESXi650-10719125/vib20/nmlx4-en/VMW_bootbank_nmlx4-en_3.16.0.0-1vmw.650.0.0.45641...
esxcli software vib install -v http://repo/vmware/ESXi650-10719125/vib20/nmlx4-rdma/VMW_bootbank_nmlx4-rdma_3.16.0.0-1vmw.650.0.0.4...

Earlier this year we have been troubleshooting this issue with Mellanox and VMware, but since it was difficult to reproduce, we could not solve it. I currently have a ticket open with DELL support.

0 Kudos
yoeri
Contributor
Contributor

Are you using Zerto?

0 Kudos
kovitking
Contributor
Contributor

Does issue solved?

0 Kudos
browney595
Contributor
Contributor

There is now a KB Relating to this issue

VMware Knowledge Base

Resolution at the moment is no fix but Mellanox are working on it.

Workaround downgrade to 3.15.5.5

0 Kudos