veldthui
Enthusiast
Enthusiast

10GB network port locks after Veeam backup

Jump to solution

I have a 1GB network connection on my Lenovo X3650 M5 server and was originally backing up a couple of VM's using the community version of Veeam 9.5 Update 4b. This works perfectly and has no issue. I have exactly the same setup on an HPe DL360 Gen 9 and it also works perfectly.

I upgraded the network on both and put in a mellonox 10GB card and set it up to have a management port on the 10GB connection. All fine so far.

When I run the backup using Veeam on the HPe 10GB management network it works fne and backs up the VM's. However when I do the same with the Lenovo it completes the backup and then the IP stops responding. Can't ping it, can't get into the UI, nothing. Using the 1GB management network and checking the network stuff everything looks okay but to actually get it working I have shutdown and restart ESXi.

The ESXi version is the Lenovo version and is 6.7 update 3 but it has done this with update 1 and 2 as well.

If I use the 1GB management network for the backup everything is fine.

I deleted the 10GB network stuff from ESXi and recreated them and same issue.

Any ideas on what may be happening appreciated.

0 Kudos
1 Solution

Accepted Solutions
veldthui
Enthusiast
Enthusiast

Okay it looks like it is a driver issue. After much more searching and reading I found a knowledge base article detailing the issue. Last updated 21 October.  Resolution says:

Symptoms

  • An ESXi host is experiencing full traffic loss
  • All Virtual Machine traffic using a Mellanox adapter stops
  • Mellanox adapter driver is in use nmlx4_en 3.15.11.6 and 3.16.11.6 and 3.17.13.1
  • Traffic is not passing over a Mellanox adapter but the link status shows as active
  • Both the vmkernel and VMs go unresponsive on the network.
  • Network Card MT27500 Family [ConnectX-3 and ConnectX-3 Pro Devices]

Cause

This is a driver related issue.

Impact / Risks

All network traffic can be lost when using this adapter and driver combination.

Resolution

This issue is resolved in later versions of the driver.
nmlx4_en 3.15.11.10 (6.0 driver)
nmlx4_en 3.16.11.10 (6.5 driver) or new releases (6.7 driver)

My current version is 3.17.13.1 so is clearly affected by this. The resolution says to use a later driver than 3.16.11.10 which clearly I am so it does not make sense.

One work around suggested was to downgrade the driver to 3.15.5.5. I have a BIOS/Firmware update to do on my server so if it is still having issues after that may try that.

Link to the knowledge base article is  https://kb.vmware.com/s/article/60421?lang=en_US

View solution in original post

0 Kudos
11 Replies
ashishsingh1508
Enthusiast
Enthusiast

This requires logs to be checked.

This is not a NIC capacity issue.

Could you please check the NIC stats

esxcli network nic stats get -n vmnicX

Ashish Singh VCP-6.5, VCP-NV 6, VCIX-6,VCIX-6.5, vCAP-DCV, vCAP-DCD
0 Kudos
T180985
Expert
Expert

As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?

Please mark helpful or correct if my answer resolved your issue. How to post effectively on VMTN https://communities.vmware.com/people/daphnissov/blog/2018/12/05/how-to-ask-for-help-on-tech-forums
0 Kudos
blazilla
Enthusiast
Enthusiast

Can you move the Management VMkernel Port to a 1 GbE NIC and test this again? Maybe it's a driver or firmware-related issue.

Best regards Patrick https://www.vcloudnine.de
0 Kudos
veldthui
Enthusiast
Enthusiast

Can you move the Management VMkernel Port to a 1 GbE NIC and test this again? Maybe it's a driver or firmware-related issue.

I have both a 1GbE and 10GbE Management. The 1GbE works perfectly but is slow when copying the backups which is why I want the 10GbE connection working. It is only a small number of VM's but still don't like it to take time because I start the backup manually. The 10GbE connection locks up. I had two 10GbE connections to have a fail over but it did not fail over. Removing the fail over allowed the network to run again until the next backup and then the remaining connection locked.

0 Kudos
veldthui
Enthusiast
Enthusiast

As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?

I am a linux noob and also ESXi but will see if I can track them down.

0 Kudos
blazilla
Enthusiast
Enthusiast

Check if you're using the latest driver and firmware for your NICs. This sounds like an driver/ firmware issue.

Best regards Patrick https://www.vcloudnine.de
0 Kudos
veldthui
Enthusiast
Enthusiast

Could you please check the NIC stats

esxcli network nic stats get -n vmnicX

Here it is. This was just after a backup and the port was not responding.

NIC statistics for vmnic4

   Packets received: 28963399

   Packets sent: 87516039

   Bytes received: 29253117297

   Bytes sent: 121131004446

   Receive packets dropped: 0

   Transmit packets dropped: 0

   Multicast packets received: 811274

   Broadcast packets received: 0

   Multicast packets sent: 0

   Broadcast packets sent: 0

   Total receive errors: 0

   Receive length errors: 0

   Receive over errors: 0

   Receive CRC errors: 0

   Receive frame errors: 0

   Receive FIFO errors: 0

   Receive missed errors: 0

   Total transmit errors: 0

   Transmit aborted errors: 0

   Transmit carrier errors: 0

   Transmit FIFO errors: 0

   Transmit heartbeat errors: 0

   Transmit window errors: 0

0 Kudos
veldthui
Enthusiast
Enthusiast

As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?

I have just used the Lenonvo specific ISO to install so would assume it has the correct driver for it's own network card. Not sure what you mean about the HCL?

Ran the backup and after the lockup checked the vmkernal log and only reference to the vmnic4 (the connection) is

2019-12-30T19:44:29.367Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated

2019-12-30T19:44:29.388Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

2019-12-30T19:44:29.388Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:68:f0:d1

and a bit further down

2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:68:f0:d1

2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

Only thing in hostd that looks unusual is

2019-12-30T20:14:15.688Z info hostd[2099523] [Originator@6876 sub=Libs opID=5c5015c6] NetstackInstanceImpl: congestion control algorithm: newreno

2019-12-30T20:14:17.566Z info hostd[2098897] [Originator@6876 sub=Vimsvc.TaskManager opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] Task Created : haTask--vim.vslm.host.CatalogSyncManager.queryCatalogChange-539196836

2019-12-30T20:14:17.567Z info hostd[2099521] [Originator@6876 sub=Default opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] Transfer to exception eraro code: 403, message:

2019-12-30T20:14:17.568Z info hostd[2099521] [Originator@6876 sub=Default opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] AdapterServer caught exception: N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound

--> )

0 Kudos
veldthui
Enthusiast
Enthusiast

I just did a  esxcli network nic get -n vmnic4 and the output is below. It says Pause RX: true and Pause TX: true.

Does this mean that something has paused the NIC and if so how do I unpause it?

[root@esxi67:~] esxcli network nic get -n vmnic4

   Advertised Auto Negotiation: true

   Advertised Link Modes: 1000None/Half, 1000None/Full, 10000None/Half, 10000None/Full, 40000None/Half, 40000None/Full, Auto

   Auto Negotiation: false

   Cable Type:

   Current Message Level: -1

   Driver Info:

         Bus Info: 0000:06:00:0

         Driver: nmlx4_en

         Firmware Version: 2.11.500

         Version: 3.17.13.1

   Link Detected: true

   Link Status: Up by explicit linkSet

   Name: vmnic4

   PHYAddress: 0

   Pause Autonegotiate: false

   Pause RX: true

   Pause TX: true

   Supported Ports:

   Supports Auto Negotiation: true

   Supports Pause: true

   Supports Wakeon: false

   Transceiver: external

   Virtual Address: 00:50:56:5b:ad:25

   Wakeon: None

[root@esxi67:~]

0 Kudos
veldthui
Enthusiast
Enthusiast

Okay it looks like it is a driver issue. After much more searching and reading I found a knowledge base article detailing the issue. Last updated 21 October.  Resolution says:

Symptoms

  • An ESXi host is experiencing full traffic loss
  • All Virtual Machine traffic using a Mellanox adapter stops
  • Mellanox adapter driver is in use nmlx4_en 3.15.11.6 and 3.16.11.6 and 3.17.13.1
  • Traffic is not passing over a Mellanox adapter but the link status shows as active
  • Both the vmkernel and VMs go unresponsive on the network.
  • Network Card MT27500 Family [ConnectX-3 and ConnectX-3 Pro Devices]

Cause

This is a driver related issue.

Impact / Risks

All network traffic can be lost when using this adapter and driver combination.

Resolution

This issue is resolved in later versions of the driver.
nmlx4_en 3.15.11.10 (6.0 driver)
nmlx4_en 3.16.11.10 (6.5 driver) or new releases (6.7 driver)

My current version is 3.17.13.1 so is clearly affected by this. The resolution says to use a later driver than 3.16.11.10 which clearly I am so it does not make sense.

One work around suggested was to downgrade the driver to 3.15.5.5. I have a BIOS/Firmware update to do on my server so if it is still having issues after that may try that.

Link to the knowledge base article is  https://kb.vmware.com/s/article/60421?lang=en_US

0 Kudos
veldthui
Enthusiast
Enthusiast

In light of the issue and it looks like it has been going on a while with no fix I am going to change the network card out with one that uses the Intel chipset.

My HPe is using 10GbE intel chipset and not having any issue so will swap the card out.

0 Kudos