Solved: Re: 10GB network port locks after Veeam backup

veldthui · ‎12-24-2019

I have a 1GB network connection on my Lenovo X3650 M5 server and was originally backing up a couple of VM's using the community version of Veeam 9.5 Update 4b. This works perfectly and has no issue. I have exactly the same setup on an HPe DL360 Gen 9 and it also works perfectly.

I upgraded the network on both and put in a mellonox 10GB card and set it up to have a management port on the 10GB connection. All fine so far.

When I run the backup using Veeam on the HPe 10GB management network it works fne and backs up the VM's. However when I do the same with the Lenovo it completes the backup and then the IP stops responding. Can't ping it, can't get into the UI, nothing. Using the 1GB management network and checking the network stuff everything looks okay but to actually get it working I have shutdown and restart ESXi.

The ESXi version is the Lenovo version and is 6.7 update 3 but it has done this with update 1 and 2 as well.

If I use the 1GB management network for the backup everything is fine.

I deleted the 10GB network stuff from ESXi and recreated them and same issue.

Any ideas on what may be happening appreciated.

veldthui · ‎12-30-2019

Okay it looks like it is a driver issue. After much more searching and reading I found a knowledge base article detailing the issue. Last updated 21 October. Resolution says:

Symptoms

An ESXi host is experiencing full traffic loss
All Virtual Machine traffic using a Mellanox adapter stops
Mellanox adapter driver is in use nmlx4_en 3.15.11.6 and 3.16.11.6 and 3.17.13.1
Traffic is not passing over a Mellanox adapter but the link status shows as active
Both the vmkernel and VMs go unresponsive on the network.
Network Card MT27500 Family [ConnectX-3 and ConnectX-3 Pro Devices]

Cause

This is a driver related issue.

Impact / Risks

All network traffic can be lost when using this adapter and driver combination.

Resolution

This issue is resolved in later versions of the driver.
nmlx4_en 3.15.11.10 (6.0 driver)
nmlx4_en 3.16.11.10 (6.5 driver) or new releases (6.7 driver)

My current version is 3.17.13.1 so is clearly affected by this. The resolution says to use a later driver than 3.16.11.10 which clearly I am so it does not make sense.

One work around suggested was to downgrade the driver to 3.15.5.5. I have a BIOS/Firmware update to do on my server so if it is still having issues after that may try that.

Link to the knowledge base article is https://kb.vmware.com/s/article/60421?lang=en_US

View solution in original post

ashishsingh1508 · ‎12-29-2019

This requires logs to be checked.

This is not a NIC capacity issue.

Could you please check the NIC stats

esxcli network nic stats get -n vmnicX

Ashish Singh VCP-6.5, VCP-NV 6, VCIX-6,VCIX-6.5, vCAP-DCV, vCAP-DCD

T180985 · ‎12-29-2019

As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?

Please mark helpful or correct if my answer resolved your issue. How to post effectively on VMTN https://communities.vmware.com/people/daphnissov/blog/2018/12/05/how-to-ask-for-help-on-tech-forums

blazilla · ‎12-30-2019

Can you move the Management VMkernel Port to a 1 GbE NIC and test this again? Maybe it's a driver or firmware-related issue.

Best regards Patrick https://www.vcloudnine.de

veldthui · ‎12-30-2019

Can you move the Management VMkernel Port to a 1 GbE NIC and test this again? Maybe it's a driver or firmware-related issue.

I have both a 1GbE and 10GbE Management. The 1GbE works perfectly but is slow when copying the backups which is why I want the 10GbE connection working. It is only a small number of VM's but still don't like it to take time because I start the backup manually. The 10GbE connection locks up. I had two 10GbE connections to have a fail over but it did not fail over. Removing the fail over allowed the network to run again until the next backup and then the remaining connection locked.

veldthui · ‎12-30-2019

As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?

I am a linux noob and also ESXi but will see if I can track them down.

blazilla · ‎12-30-2019

Check if you're using the latest driver and firmware for your NICs. This sounds like an driver/ firmware issue.

Best regards Patrick https://www.vcloudnine.de

veldthui · ‎12-30-2019

Could you please check the NIC stats
esxcli network nic stats get -n vmnicX

Here it is. This was just after a backup and the port was not responding.

NIC statistics for vmnic4
   Packets received: 28963399
   Packets sent: 87516039
   Bytes received: 29253117297
   Bytes sent: 121131004446
   Receive packets dropped: 0
   Transmit packets dropped: 0
   Multicast packets received: 811274
   Broadcast packets received: 0
   Multicast packets sent: 0
   Broadcast packets sent: 0
   Total receive errors: 0
   Receive length errors: 0
   Receive over errors: 0
   Receive CRC errors: 0
   Receive frame errors: 0
   Receive FIFO errors: 0
   Receive missed errors: 0
   Total transmit errors: 0
   Transmit aborted errors: 0
   Transmit carrier errors: 0
   Transmit FIFO errors: 0
   Transmit heartbeat errors: 0
   Transmit window errors: 0

veldthui · ‎12-30-2019

As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?

I have just used the Lenonvo specific ISO to install so would assume it has the correct driver for it's own network card. Not sure what you mean about the HCL?

Ran the backup and after the lockup checked the vmkernal log and only reference to the vmnic4 (the connection) is

2019-12-30T19:44:29.367Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated
2019-12-30T19:44:29.388Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on
2019-12-30T19:44:29.388Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:68:f0:d1
and a bit further down
2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:68:f0:d1
2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

Only thing in hostd that looks unusual is

2019-12-30T20:14:15.688Z info hostd[2099523] [Originator@6876 sub=Libs opID=5c5015c6] NetstackInstanceImpl: congestion control algorithm: newreno
2019-12-30T20:14:17.566Z info hostd[2098897] [Originator@6876 sub=Vimsvc.TaskManager opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] Task Created : haTask--vim.vslm.host.CatalogSyncManager.queryCatalogChange-539196836
2019-12-30T20:14:17.567Z info hostd[2099521] [Originator@6876 sub=Default opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] Transfer to exception eraro code: 403, message:
2019-12-30T20:14:17.568Z info hostd[2099521] [Originator@6876 sub=Default opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] AdapterServer caught exception: N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound
--> )

veldthui · ‎12-30-2019

I just did a esxcli network nic get -n vmnic4 and the output is below. It says Pause RX: true and Pause TX: true.

Does this mean that something has paused the NIC and if so how do I unpause it?

[root@esxi67:~] esxcli network nic get -n vmnic4
   Advertised Auto Negotiation: true
   Advertised Link Modes: 1000None/Half, 1000None/Full, 10000None/Half, 10000None/Full, 40000None/Half, 40000None/Full, Auto
   Auto Negotiation: false
   Cable Type:
   Current Message Level: -1
   Driver Info:
         Bus Info: 0000:06:00:0
         Driver: nmlx4_en
         Firmware Version: 2.11.500
         Version: 3.17.13.1
   Link Detected: true
   Link Status: Up by explicit linkSet
   Name: vmnic4
   PHYAddress: 0
   Pause Autonegotiate: false
   Pause RX: true
   Pause TX: true
   Supported Ports:
   Supports Auto Negotiation: true
   Supports Pause: true
   Supports Wakeon: false
   Transceiver: external
   Virtual Address: 00:50:56:5b:ad:25
   Wakeon: None
[root@esxi67:~]

veldthui · ‎12-30-2019

Okay it looks like it is a driver issue. After much more searching and reading I found a knowledge base article detailing the issue. Last updated 21 October. Resolution says:

Symptoms

An ESXi host is experiencing full traffic loss
All Virtual Machine traffic using a Mellanox adapter stops
Mellanox adapter driver is in use nmlx4_en 3.15.11.6 and 3.16.11.6 and 3.17.13.1
Traffic is not passing over a Mellanox adapter but the link status shows as active
Both the vmkernel and VMs go unresponsive on the network.
Network Card MT27500 Family [ConnectX-3 and ConnectX-3 Pro Devices]

Cause

This is a driver related issue.

Impact / Risks

All network traffic can be lost when using this adapter and driver combination.

Resolution

This issue is resolved in later versions of the driver.
nmlx4_en 3.15.11.10 (6.0 driver)
nmlx4_en 3.16.11.10 (6.5 driver) or new releases (6.7 driver)

My current version is 3.17.13.1 so is clearly affected by this. The resolution says to use a later driver than 3.16.11.10 which clearly I am so it does not make sense.

One work around suggested was to downgrade the driver to 3.15.5.5. I have a BIOS/Firmware update to do on my server so if it is still having issues after that may try that.

Link to the knowledge base article is https://kb.vmware.com/s/article/60421?lang=en_US

veldthui · ‎12-30-2019

In light of the issue and it looks like it has been going on a while with no fix I am going to change the network card out with one that uses the Intel chipset.

My HPe is using 10GbE intel chipset and not having any issue so will swap the card out.