Hello there dear community,
I see an odd behaviour with the card in the title and I would like to ask if anyone had the same experience or an idea?
Description of the setup:
The card is a Mellanox ConnectX-3 Pro 40G card, jumbo frames enabled. The server is an Intel R2208WFTZS.
The card's firmware is on the latest stable version from mellanox.
The vsan's vmk interface is connected to one of the ports of the 40G card, jumbo frames enabled.
I'm using a non customised ESXI 6.7 image with the nmlx4_en native driver, all the latest patches installed.
Problem:
The setup works well, however occasionally totally randomly the vmk interface can't reach the other vsan nodes in the same cluster, can't ping anyone.
When I was doing packet capture on the vmk interface, I could see the arp requests arriving from the other nodes to the vmk interface, the interface responds to them, however the responses don't make it out of the server. Seemed similar to a unidirectional link.
If I reload the vmnic interface where the vsan traffic meant to go through, it all returns to normal, the traffic from vsan vmk starts to flow, arp responses to through.
Obviously the logs don't say much other than this interesting part:
2018-08-13T20:26:23.187Z cpu25:2097436)<NMLX_INF> nmlx4_en: vmnic2: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-13T20:26:23.187Z cpu25:2097436)<NMLX_INF> nmlx4_en: vmnic2: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6c:23:xx
2018-08-13T20:26:23.187Z cpu25:2097436)<NMLX_INF> nmlx4_en: vmnic2: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
2018-08-13T20:26:39.594Z cpu43:2098358)VLANMTUCheck: NMVCCheckUplinkResultDetail:880: Vlanmtuchk EVENT: DV port 408 uplink vmnic2MTU check RESULT_CHANGED
2018-08-13T20:26:39.594Z cpu43:2098358)VLANMTUCheck: NMVCCheckUplinkResultDetail:898: Vlanmtuchk EVENT: DV port 408 uplink vmnic2VLAN check RESULT_CHANGED
Seems like the vsan queue is being destroyed? Not sure why is that happening. And you see after that the dvs alarms come on.
Other than this line, no signs of driver and firmware related logs or crashes.
It happened on 2 servers already, identical ones so I'm afraid it's not just specific to one server. Don't really know where to start or how to dig deeper.
Any idea or suggestion?
Thank you very much!
Zoltan
The exact same thing happened on a Dell R640 as well with the same card.
Same error:
2018-08-15T08:06:35.593Z cpu56:2097436)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-15T08:06:35.593Z cpu56:2097436)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6d:c9:xx
2018-08-15T08:06:35.593Z cpu56:2097436)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
Any suggestions welcomed!
I'm having the same problem. Same messages in the logs, same card, same ESXi build, but do not have the latest CVE patch applied yet.
I have 6 identical servers, all exhibiting the same problem. I installed the Mellanox tools to get the exact firmware info and such and to make sure it was updated.. it is..
vsan is enabled.
However I can add to this, that it is for me not JUST vsan traffic that stops, it is ALL traffic on a given port.. fortunately so far only 1 port at a time. (yes this includes VM Traffic)
I am using a VDS which I previously had configured on all port groups to use link state as the fail checker, I changed it to beacon probing and threw 1 of my 1gb Intel onboard nics in the standby uplink section.. this has masked the problem, but not fixed it obviously.
I am struggling here with this... opening a case with VMWare tomorrow. Otherwise I'll throw some intel x710's in there and call it a day. I purchased mellanox for the RDMA features as these nodes are compatible with either S2D or VMWare.
Switch is a Juniper EX4600, no dcbx enabled - Stock cos flow queues no additional CoS enabled.
Media type is DAC cable. Card is a Mellanox ConnectX-3 Pro Dual port QSFP 40GbE.
flint output:
[root@vmserver3:/opt/mellanox/bin] ./flint_ext -d mt4103_pci_cr0 hw query
HW Info:
HwDevId 503
HwRevId 0x0
Flash Info:
Type M25PXxx
TotalSize 0x200000
Banks 0x1
SectorSize 0x1000
WriteBlockSize 0x10
CmdSet 0x80
[root@vmserver3:/opt/mellanox/bin] ./flint_ext -d mt4103_pci_cr0 query
Image type: FS2
FW Version: 2.42.5000
FW Release Date: 5.9.2017
Product Version: 02.42.50.00
Rom Info: type=PXE version=3.4.752
Device ID: 4103
Description: Node Port1 Port2 Sys image
GUIDs: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
MACs: ec0d9abXXXXX ec0d9aXXXXXX
VSD:
PSID: MT_1060111023
Switch output from buggered NIC.. notice zero bytes Input rate... no flaps, no errors.
Physical interface: et-3/2/1, Enabled, Physical link is Up
Interface index: 786, SNMP ifIndex: 800, Generation: 280
Description: S2D6-2
Link-level type: Ethernet, MTU: 9216, MRU: 0, Speed: 40Gbps, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled, Source filtering: Disabled, Flow control: Disabled, Media type: Fiber
Device flags : Present Running
Interface flags: SNMP-Traps Internal: 0x4000
Link flags : None
CoS queues : 12 supported, 12 maximum usable queues
Hold-times : Up 0 ms, Down 0 ms
Current address: 44:aa:50XXXXX, Hardware address: 44:aa:50XXXXX
Last flapped : 2018-08-21 09:08:10 UTC (00:50:24 ago)
Statistics last cleared: 2018-08-15 22:25:32 UTC (5d 11:33 ago)
Traffic statistics:
Input bytes : 6183356033749 0 bps
Output bytes : 7207567747153 386784 bps
Input packets: 3029566835 0 pps
Output packets: 4256486601 260 pps
IPv6 transit statistics:
Input bytes : 0
Output bytes : 0
Input packets: 0
Output packets: 0
Input errors:
Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Bucket drops: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0, L2 mismatch timeouts: 0, FIFO errors: 0, Resource errors: 0
Output errors:
Carrier transitions: 2, Errors: 0, Drops: 0, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 0, Resource errors: 0, Bucket drops: 0
Egress queues: 12 supported, 5 in use
Queue counters: Queued packets Transmitted packets Dropped packets
0 0 4137036295 0
3 0 0 0
4 0 0 0
7 0 17102 0
8 0 103713995 0
Queue number: Mapped forwarding classes
0 best-effort
3 fcoe
4 no-loss
7 network-control
8 mcast
Active alarms : None
Active defects : None
MAC statistics: Receive Transmit
Total octets 6183356033749 7207567747153
Total packets 3029566835 4256486601
Unicast packets 3027634433 4144924773
Broadcast packets 1395256 83221337
Multicast packets 537146 28340491
CRC/Align errors 0 0
FIFO errors 0 0
MAC control frames 0 0
MAC pause frames 0 0
Oversized frames 0
Jabber frames 0
Fragment frames 0
VLAN tagged frames 220965439
Code violations 0
MAC Priority Flow Control Statistics:
Priority : 0 0 0
Priority : 1 0 0
Priority : 2 0 0
Priority : 3 0 0
Priority : 4 0 0
Priority : 5 0 0
Priority : 6 0 0
Priority : 7 0 0
Filter statistics:
Input packet count 0
Input packet rejects 0
Input DA rejects 0
Input SA rejects 0
Output packet count 0
Output packet pad count 0
Output packet error count 0
CAM destination filters: 1, CAM source filters: 0
Packet Forwarding Engine configuration:
Destination slot: 0 (0x00)
CoS information:
Direction : Output
CoS transmit queue Bandwidth Buffer Priority Limit
% bps % usec
0 best-effort 5 2000000000 5 0 low none
3 fcoe 35 14000000000 35 0 low none
4 no-loss 35 14000000000 35 0 low none
7 network-control 5 2000000000 5 0 low none
8 mcast 20 8000000000 20 0 low none
Interface transmit statistics: Disabled
Logical interface et-3/2/1.0 (Index 700) (SNMP ifIndex 817) (Generation 318)
Flags: Up SNMP-Traps 0x24024000 Encapsulation: Ethernet-Bridge
Traffic statistics:
Input bytes : 73042751
Output bytes : 7514303
Input packets: 1115405
Output packets: 37017
Local statistics:
Input bytes : 73042751
Output bytes : 7514303
Input packets: 1115405
Output packets: 37017
Transit statistics:
Input bytes : 0 0 bps
Output bytes : 0 0 bps
Input packets: 0 0 pps
Output packets: 0 0 pps
Protocol eth-switch, MTU: 9216, Generation: 342, Route table: 3
Flags: Trunk-Mode
No MACS:
show ethernet-switching table interface et-3/2/1
MAC database for interface et-3/2/1
MAC database for interface et-3/2/1.0
{master:0}
No lldp .. nothing..
vmkernel.log (this same bit is spamming it.. log has turned over 6 times in 36 hours.
2018-08-20T02:08:14.446Z cpu68:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:08:54.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-20T02:08:54.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:08:54.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
2018-08-20T02:08:59.439Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated
2018-08-20T02:08:59.446Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on
2018-08-20T02:08:59.446Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:09:14.446Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-20T02:09:14.446Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:09:14.446Z cpu59:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
2018-08-20T02:10:04.439Z cpu60:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated
2018-08-20T02:10:04.446Z cpu60:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on
2018-08-20T02:10:04.446Z cpu60:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:10:24.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-20T02:10:24.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:10:24.446Z cpu48:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
2018-08-20T02:11:14.439Z cpu71:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated
2018-08-20T02:11:14.446Z cpu71:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on
2018-08-20T02:11:14.446Z cpu71:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:11:29.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-20T02:11:29.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:11:29.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
2018-08-20T02:12:24.439Z cpu87:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated
2018-08-20T02:12:24.446Z cpu87:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on
2018-08-20T02:12:24.446Z cpu87:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:12:49.446Z cpu68:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-20T02:12:49.446Z cpu68:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:12:49.446Z cpu68:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
2018-08-20T02:13:04.439Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated
2018-08-20T02:13:04.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on
2018-08-20T02:13:04.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:14:04.446Z cpu92:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-20T02:14:04.446Z cpu92:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:14:04.446Z cpu92:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
2018-08-20T02:14:49.440Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated
2018-08-20T02:14:49.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on
2018-08-20T02:14:49.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:14:54.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-20T02:14:54.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:14:54.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
2018-08-20T02:15:04.440Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated
2018-08-20T02:15:04.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on
2018-08-20T02:15:04.446Z cpu69:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:16:34.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-20T02:16:34.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
2018-08-20T02:16:34.446Z cpu49:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
2018-08-20T02:16:59.439Z cpu51:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated
2018-08-20T02:16:59.446Z cpu51:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on
2018-08-20T02:16:59.446Z cpu51:2097484)<NMLX_INF> nmlx4_en: vmnic0: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:6e:58:1a
It's good to see it's not just me who is having these issues. My fix was to swap the cards out to Mellanox Connect X4 because this problem literally crippled production random times a day.
Touch wood X4 works so far with the latest firmware.
Literally I got the same switch as you do and we use DAC cables too. I can confirm it's not just vsan traffic that stops too even though when I first seen the problem I've linked it with traffic volume. After a few days passed since I opened this topic, even management traffic gone through the 10Gbit/s card variants of the X3 stopped. It's not data volume that triggers it.
This problem exist on the 10 and 40G variants of the Mellanox X3 cards.
I think we're facing a device driver or firmware related bug. I don't have support contract with vmware so don't know who to report this.
If you open a case, please let me know where you get with it, because we're in the same boat literally.
Hi,
Same problem here with VM loosing all network connectivity on ESXi 6.5 U2, using 2 Mellanox ConnectX-3 Pro 10 Gbits cards within a lacp link aggregation.
When the VM loose its network connectivity, I can find those messages in the vmkernel.log :
2018-09-07T13:38:09.868Z cpu11:65725)<NMLX_INF> nmlx4_en: vmnic5: nmlx4_en_QueueApplyFilter - (partners/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on
2018-09-07T13:38:09.868Z cpu11:65725)<NMLX_INF> nmlx4_en: vmnic5: nmlx4_en_QueueApplyFilter - (partners/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:82:00:2d
Where 00:50:56:82:00:2d is the MAC address of the VM impacted.
On the Mellanox ConnectX-3 Pro, we're using :
I opened a case with the VMware support, I'll keep you posted.
We are having and have been having essentially the same issue:
We have 8x Dell Poweredge R730XD servers, each with two Mellanox ConnectX-3 cards in them, they connect to two Dell S6100 switches using several different brands of twinax cable (40Gbps-Dell, Cable Rack and LeGrand), 1 and 3 meter lengths.
The ESXi version is DellEMC (provided) ESXi-6.5U1-7967591-A10.
The Mellanox ConnectX-3 driver is 3.16.11.6, firmware is 2.42.5000.
Our servers would get a large amount of FIFO errors and then eventually (initially, every 3-4 weeks, now, every 1-2 weeks) reach a point where the ESXi management (hostd, etc) would be receiving to many logs and become unresponsive. We could restart hostd and some of the management services, but VSAN would never come back online until we force reset the server (normal restart would never go through).
Once the server came back online, everything seemed fine for a while longer. We started with this issue back on ESXi 6.0 U2 and have progressed through updates to the ESXi, firmware, BIOS, controller firmware, etc, to try and resolve this.
VMWare has exhausted their troubleshooting of this and say that they are seeing the cache of the ConnectX-3 card being overrun, which then begins the process of logging overrun and eventually system crash. Dell has not been able to identify any particular issue and have at this point pretty much stopped responding to me.
Due to the above, we have started replacing the Mellanox ConnectX-3 cards with Intel XL710QDA2 cards and have not seen the issue return on those hosts.
Were you ever able to resolve the issue?
Did vmware support ever find a solution for this issue? I am having the same errors in my vmkernel.log and trying to find a solution.
I don't have support contract so I didn't raise a case with vmware. Some people here said they'll but I haven't heard anything back from them.
The problem is still there and I don't have any solution for it.
Anyone got any update on this? It's been going on for a while. Anything is welcomed!
We have been experiencing the same issue running the DELL VMware ESXi image, both in version 6.0 and in version 6.5. The issue does not seem to occur when we use the stock Mellanox driver that comes with ESXi, instead of the newer drivers that have been included in the DELL images. For now we have downgraded the Mellanox driver (and the firmware).
On a server running these versions of the Mellanox driver, and firmware we have been experiencing occasional network issues:
NIC firmware: 02.42.50.00
NIC driver: 3.16.11.6-1OEM.650.0.0.4598673
We are now running these versions on all our other (6.5) hosts:
NIC firmware: 02.36.50.80
NIC driver: 3.16.0.0-1vmw.650.0.0.4564106
On a host that is experiencing issues, you can reset the network driver by executing the following commands (via iDRAC):
1) Unload the driver:
esxcfg-module -u nmlx4_en |
esxcfg-module -u nmlx4_core |
2) Load the driver:
/etc/init.d/sfcbd-watchdog stop |
esxcfg-module nmlx4_core |
esxcfg-module nmlx4_en |
/etc/init.d/sfcbd-watchdog start |
kill -POLL $(cat /var/run/vmware/vmkdevmgr.pid) |
So far I have not been able to reproduce the issue on a host that is not experiencing the issue. On a host that is having the issue, it is very easy to reproduce it by just generating (a fairly small amount of) network traffic.
You can downgrade the driver with the following commands:
esxcli software vib remove -n nmlx4-core |
esxcli software vib remove -n nmlx4-en |
esxcli software vib remove -n nmlx4-en-rdma |
esxcli software vib install -v http://repo/vmware/ESXi650-10719125/vib20/nmlx4-core/VMW_bootbank_nmlx4-core_3.16.0.0-1vmw.650.0.0.4... |
esxcli software vib install -v http://repo/vmware/ESXi650-10719125/vib20/nmlx4-en/VMW_bootbank_nmlx4-en_3.16.0.0-1vmw.650.0.0.45641... |
esxcli software vib install -v http://repo/vmware/ESXi650-10719125/vib20/nmlx4-rdma/VMW_bootbank_nmlx4-rdma_3.16.0.0-1vmw.650.0.0.4... |
Earlier this year we have been troubleshooting this issue with Mellanox and VMware, but since it was difficult to reproduce, we could not solve it. I currently have a ticket open with DELL support.
Are you using Zerto?
Does issue solved?
There is now a KB Relating to this issue
Resolution at the moment is no fix but Mellanox are working on it.
Workaround downgrade to 3.15.5.5