11 Replies Latest reply on Dec 30, 2019 8:55 PM by veldthui

    10GB network port locks after Veeam backup

    veldthui Novice

      I have a 1GB network connection on my Lenovo X3650 M5 server and was originally backing up a couple of VM's using the community version of Veeam 9.5 Update 4b. This works perfectly and has no issue. I have exactly the same setup on an HPe DL360 Gen 9 and it also works perfectly.

       

      I upgraded the network on both and put in a mellonox 10GB card and set it up to have a management port on the 10GB connection. All fine so far.

       

      When I run the backup using Veeam on the HPe 10GB management network it works fne and backs up the VM's. However when I do the same with the Lenovo it completes the backup and then the IP stops responding. Can't ping it, can't get into the UI, nothing. Using the 1GB management network and checking the network stuff everything looks okay but to actually get it working I have shutdown and restart ESXi.

       

      The ESXi version is the Lenovo version and is 6.7 update 3 but it has done this with update 1 and 2 as well.

      If I use the 1GB management network for the backup everything is fine.

       

      I deleted the 10GB network stuff from ESXi and recreated them and same issue.

       

      Any ideas on what may be happening appreciated.

        • 1. Re: 10GB network port locks after Veeam backup
          ashishsingh1508 Enthusiast
          vExpert

          This requires logs to be checked.

           

          This is not a NIC capacity issue.

           

          Could you please check the NIC stats

           

          esxcli network nic stats get -n vmnicX

          • 2. Re: 10GB network port locks after Veeam backup
            T180985 Expert

            As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?

            • 3. Re: 10GB network port locks after Veeam backup
              blazilla Enthusiast
              vExpert

              Can you move the Management VMkernel Port to a 1 GbE NIC and test this again? Maybe it's a driver or firmware-related issue.

              Best regards
              Patrick

              https://www.vcloudnine.de
              • 4. Re: 10GB network port locks after Veeam backup
                veldthui Novice

                Can you move the Management VMkernel Port to a 1 GbE NIC and test this again? Maybe it's a driver or firmware-related issue.

                I have both a 1GbE and 10GbE Management. The 1GbE works perfectly but is slow when copying the backups which is why I want the 10GbE connection working. It is only a small number of VM's but still don't like it to take time because I start the backup manually. The 10GbE connection locks up. I had two 10GbE connections to have a fail over but it did not fail over. Removing the fail over allowed the network to run again until the next backup and then the remaining connection locked.

                • 5. Re: 10GB network port locks after Veeam backup
                  veldthui Novice

                  As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?

                  I am a linux noob and also ESXi but will see if I can track them down.

                  • 6. Re: 10GB network port locks after Veeam backup
                    blazilla Enthusiast
                    vExpert

                    Check if you're using the latest driver and firmware for your NICs. This sounds like an driver/ firmware issue.

                    Best regards
                    Patrick

                    https://www.vcloudnine.de
                    • 7. Re: 10GB network port locks after Veeam backup
                      veldthui Novice

                      Could you please check the NIC stats

                       

                      esxcli network nic stats get -n vmnicX

                      Here it is. This was just after a backup and the port was not responding.

                       

                      NIC statistics for vmnic4

                         Packets received: 28963399

                         Packets sent: 87516039

                         Bytes received: 29253117297

                         Bytes sent: 121131004446

                         Receive packets dropped: 0

                         Transmit packets dropped: 0

                         Multicast packets received: 811274

                         Broadcast packets received: 0

                         Multicast packets sent: 0

                         Broadcast packets sent: 0

                         Total receive errors: 0

                         Receive length errors: 0

                         Receive over errors: 0

                         Receive CRC errors: 0

                         Receive frame errors: 0

                         Receive FIFO errors: 0

                         Receive missed errors: 0

                         Total transmit errors: 0

                         Transmit aborted errors: 0

                         Transmit carrier errors: 0

                         Transmit FIFO errors: 0

                         Transmit heartbeat errors: 0

                         Transmit window errors: 0

                      • 8. Re: 10GB network port locks after Veeam backup
                        veldthui Novice

                        As above, you will need to check the vmkernel & hostd logs to see if anything stands out when this occurs. You may also want to check you have the correct VIB installed for the card, i assume you have checked the HCL?

                        I have just used the Lenonvo specific ISO to install so would assume it has the correct driver for it's own network card. Not sure what you mean about the HCL?

                         

                        Ran the backup and after the lockup checked the vmkernal log and only reference to the vmnic4 (the connection) is

                         

                        2019-12-30T19:44:29.367Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_RxQAlloc - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 1 is allocated

                        2019-12-30T19:44:29.388Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2114) MAC RX filter (class 1) at index 0 is applied on

                        2019-12-30T19:44:29.388Z cpu4:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueApplyFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2121) RX ring 1, QP[0x49], Mac address 00:50:56:68:f0:d1

                         

                        and a bit further down

                         

                        2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from

                        2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:68:f0:d1

                        2019-12-30T20:00:29.401Z cpu0:2097256)<NMLX_INF> nmlx4_en: vmnic4: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed

                         

                        Only thing in hostd that looks unusual is

                         

                        2019-12-30T20:14:15.688Z info hostd[2099523] [Originator@6876 sub=Libs opID=5c5015c6] NetstackInstanceImpl: congestion control algorithm: newreno

                        2019-12-30T20:14:17.566Z info hostd[2098897] [Originator@6876 sub=Vimsvc.TaskManager opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] Task Created : haTask--vim.vslm.host.CatalogSyncManager.queryCatalogChange-539196836

                        2019-12-30T20:14:17.567Z info hostd[2099521] [Originator@6876 sub=Default opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] Transfer to exception eraro code: 403, message:

                        2019-12-30T20:14:17.568Z info hostd[2099521] [Originator@6876 sub=Default opID=sps-Main-65991-131-73-3a-15dc user=vpxuser:JVNET.LOCAL\vpxd-extension-fb0a6774-4559-4040-a1e8-ed7b2a05ec83] AdapterServer caught exception: N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound

                        --> )

                        • 9. Re: 10GB network port locks after Veeam backup
                          veldthui Novice

                          I just did a  esxcli network nic get -n vmnic4 and the output is below. It says Pause RX: true and Pause TX: true.

                           

                          Does this mean that something has paused the NIC and if so how do I unpause it?

                           

                          [root@esxi67:~] esxcli network nic get -n vmnic4

                             Advertised Auto Negotiation: true

                             Advertised Link Modes: 1000None/Half, 1000None/Full, 10000None/Half, 10000None/Full, 40000None/Half, 40000None/Full, Auto

                             Auto Negotiation: false

                             Cable Type:

                             Current Message Level: -1

                             Driver Info:

                                   Bus Info: 0000:06:00:0

                                   Driver: nmlx4_en

                                   Firmware Version: 2.11.500

                                   Version: 3.17.13.1

                             Link Detected: true

                             Link Status: Up by explicit linkSet

                             Name: vmnic4

                             PHYAddress: 0

                             Pause Autonegotiate: false

                             Pause RX: true

                             Pause TX: true

                             Supported Ports:

                             Supports Auto Negotiation: true

                             Supports Pause: true

                             Supports Wakeon: false

                             Transceiver: external

                             Virtual Address: 00:50:56:5b:ad:25

                             Wakeon: None

                          [root@esxi67:~]

                          • 10. Re: 10GB network port locks after Veeam backup
                            veldthui Novice

                            Okay it looks like it is a driver issue. After much more searching and reading I found a knowledge base article detailing the issue. Last updated 21 October.  Resolution says:

                            Symptoms

                             

                             

                            • An ESXi host is experiencing full traffic loss
                            • All Virtual Machine traffic using a Mellanox adapter stops
                            • Mellanox adapter driver is in use nmlx4_en 3.15.11.6 and 3.16.11.6 and 3.17.13.1
                            • Traffic is not passing over a Mellanox adapter but the link status shows as active
                            • Both the vmkernel and VMs go unresponsive on the network.
                            • Network Card MT27500 Family [ConnectX-3 and ConnectX-3 Pro Devices]

                             

                            Cause

                            This is a driver related issue.

                             

                            Impact / Risks

                            All network traffic can be lost when using this adapter and driver combination.

                             

                            Resolution

                             

                            This issue is resolved in later versions of the driver.
                            nmlx4_en 3.15.11.10 (6.0 driver)
                            nmlx4_en 3.16.11.10 (6.5 driver) or new releases (6.7 driver)

                             

                            My current version is 3.17.13.1 so is clearly affected by this. The resolution says to use a later driver than 3.16.11.10 which clearly I am so it does not make sense.

                             

                            One work around suggested was to downgrade the driver to 3.15.5.5. I have a BIOS/Firmware update to do on my server so if it is still having issues after that may try that.

                             

                            Link to the knowledge base article is  https://kb.vmware.com/s/article/60421?lang=en_US

                            • 11. Re: 10GB network port locks after Veeam backup
                              veldthui Novice

                              In light of the issue and it looks like it has been going on a while with no fix I am going to change the network card out with one that uses the Intel chipset.

                              My HPe is using 10GbE intel chipset and not having any issue so will swap the card out.