ESXi 6.5 connectivity issue on PowerEdge R430

Kennylcf · ‎02-27-2017

Dear,

We encountering connectivity issue on newly install esxi 6.5 on PE R430

Whenever we preform file copy (30-40GB files size) from one of the VM Guest to another machine thought network (100MB), it end up with connectivity issue.

Both VM guest and host became inaccessible thought network.

ESX console can still be access, try with "restart management network" and use "testing management network" with result OK on default gateway or other machine.

But unable to access both guest and host thought network.

Connection become normal after shutdown / restart the esx host.

Check with VM guest OS event log, indicated that it should still be running during issue and following the auto shutdown/startup setting of esx host.

Look into log and found the following warning keep prompt during copy task perform.

cpu## :67701)WARNING: ntg3-throttled: Ntg3XmitPktList:372: vmnic0:TX ring full (0)

I wonder if anyone with suggestion on this issue? thx

ganeshgv · ‎02-28-2017

Hi Kennylcf,

I found some information about this issue on VMware vSphere 6.5 Release Notes.

Network becomes unavailable with full passthrough devices
If a native ntg3 driver is used on a passthrough Broadcom Gigabit Ethernet Adapter, the network connection will become unavailable.

Workaround:

Run the ntg3 driver in legacy mode:
1. Run the esxcli system module parameters set -m ntg3 -p intrMode=0 command.
2. Reboot the host.
Use the tg3 vmklinux driver as the default driver, instead of the native ntg3 driver.

Please find the the VMware vSphere 6.5 Release Notes link for your reference.

VMware vSphere 6.5 Release Notes

Regards,

Ganesh GV

Kennylcf · ‎02-28-2017

Thank Ganesh,

It likely that the situation mentioned in the notes not fully match, as we haven't configure the passthrough on the Ethernet adapter.

I will try it anyway on this few day, let you know the result then, thank a lot.

Kenny

chnb · ‎03-01-2017

Hi Kennylcf,

I am an engineer involved in the ntg3 NIC driver development, and I'd like to understand and help you resolve this issue.

The message "vmnic0:TX ring ful" means the NIC was asked to send traffic faster than the hardware can handle. This is sometimes observed during high TX (send) load situations, such as running network benchmarks, or if the link speed is low (100M/10M) relative to the load, and does not necessarily indicate a problem.

You reported a loss of network connectivity to guest and host. Can you describe the host's network configuration or share the results of the following commands in the ESXi console: "esxcfg-nics -l" and "esxcfg-vswitch -l"? What method did you use to copy the 30GB-40GB files (e.g. scp or NFS or some other protocols)? And what commands / steps did you use to determine the loss of connectivity?

Are you willing to share the kernel log (/scratch/log/vmkernel.log) around the time connectivity was lost? And if the issue is easily reproducible, could you increase the driver's log level, with command "vsish -e set /system/modules/ntg3/loglevels/ntg3 1", reproduce the problem, then share the kernel log?

Feel free to PM me regarding this issue.

Thanks,

Bo

Kennylcf · ‎03-01-2017

Thanks Chnb,

please find attached vmkernel log.

#### esxcfg-nics -l

Name PCI Driver Link Speed Duplex MAC Address MTU Description

vmnic0 0000:02:00.0 ntg3 Up 100Mbps Full 18:66:da:8a:a5:49 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

vmnic1 0000:02:00.1 ntg3 Up 100Mbps Full 18:66:da:8a:a5:4a 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

vmnic2 0000:03:00.0 ntg3 Down 0Mbps Half 18:66:da:8a:a5:4b 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

vmnic3 0000:03:00.1 ntg3 Down 0Mbps Half 18:66:da:8a:a5:4c 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

#### esxcfg-vswitch -l

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks

vSwitch0 3072 5 128 1500 vmnic0

PortGroup Name VLAN ID Used Ports Uplinks

VM Network 0 1 vmnic0

Management Network 0 1 vmnic0

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks

vSwitch1 3072 4 1024 1500 vmnic1

PortGroup Name VLAN ID Used Ports Uplinks

Management Network 2 0 1 vmnic1

### What method did you use to copy the 30GB-40GB files?

In our guest os, windows 2008, with a share folder for another windows 2008 (Physical machine) to have a daily access.

Files will copy from vm guest to the physical machine daily start from 2 am (HK Time)

### what commands / steps did you use to determine the loss of connectivity?

ping commands from windows 2008 / windows 7 physical machine

1) ping ip address of vm guest os and esx console: failed

2) ping ip address of default gateway: OK

3) ping ip address of desktop: OK

4) ping ip address of vm guest os and esx console, after move esx LAN cable to another port on switch: failed

5) ping ip address of vm guest os and esx console, with direct plug the esx LAN cable to laptop: failed

chnb · ‎03-02-2017

Thanks for the detailed info.

According to the kernel log, from 02-16 to 02-15, every day shortly after 18:00Z (e.g. 2am HK time as you mentioned) there's a period of heavy network traffic. Two reboots occurred in the period: around 2017-02-22T04Z, and 2017-02-25T03Z. Did the losses of connectivity occur sometime before each of the two reboots, or did it occur every day during or after the file copy? From the log I would assume the former, but if it's not, how was connectivity restored in those days without reboot?

Also, although probably not the cause of your issue, it's worth noting that by default the NIC will auto-negotiate for link speed/duplex, and in this case 100M/Full was the negotiated speed rather than the forced / manually set speed. Was the speed on switch port set to forced 100M/Full or auto-negotiation? If switch port speed is forced 100M/Full, and the NIC is set to auto-negotiate, there could be a duplex mismatch (Duplex mismatch - Wikipedia‌), which could severely impact Ethernet performance. To avoid it one should use either auto-negotiation or the same forced speed on both sides.

Thank you,

Bo

Kennylcf · ‎03-02-2017

Hi, chnb

Right, there were two reboot as we found the connectivity losses. System reboot likely our only choose to make it back to normal.

The copy task began on 16 Feb, and first issue encountered on 22 Feb and then 25 Feb.

The last issue encountered on 26 Feb and then we reboot on 27 Feb morning with copy task disabled.

System then keep running without issue until now.

It seem issue came more frequently and system won't back to normal even the copy task from other side ended and until reboot.

Copy task on 26 Feb should ended after the connectivity losses.

From 26 Feb to 27 Feb, system / network should be idle, but the connect yet recover until reboot.

We are using unmanageable 10/100 switch, so port speed / mode should be set by auto-negotiation.

Thanks,

Kenny

chnb · ‎03-02-2017

Hi Kenny,

Would you mind running the command "vsish -e set /system/modules/ntg3/loglevels/ntg3 1"? This would increase logging of the driver so we could know a lot more the next time the issue occurs.

And when it occurs again, could you try the following:

(1)Run command "vsish -e get /net/pNics/vmnic0/stats", and save the result.

(2)Try ping the host, and also try ping another machine from the host, then run the above command again, and save the result.

After you have gathered the above info, instead of doing a reboot, you might try the following command and see if it restores connectivity

(3)vmkload_mod -u ntg3; vmkload_mod ntg3; kill -HUP $(pidof vmkdevmgr)

The results of the commands and the kernel logs (starting from sometime before the loss of connectivity) should greatly help us diagnose the issue, and then we should be able to provide you with a resolution or at last a workaround.

Thank you very much for your patience.

Bo

regnor · ‎03-17-2017

Any updates on this topic? We're experiencing similar problems with BCM5719 NICs and I'll log a (HPE) case in the next hours.

chnb · ‎03-19-2017

Hi Regnor,

Could you describe you problem in more detail, and provide the kernel log? We had recently diagnosed a potentially related issue with the ntg3 driver, and its fix is currently under testing. With more information we could tell if the issue you experienced is the same one and suggest a solution.

Thanks,

Bo

regnor · ‎03-20-2017

Hi Bo,

we've migrated a host from 6.0 to 6.5 on a DL380 Gen9 with BCM5719.

Everything works fine, except if we move a single VM to the host two NICs are getting unusable.

vmnic1:TX ring full (0)

vmnic6:TX ring full (0)

The NICs still have a network link but the connection is broken. If we move the VM back to a 6.0 host the NICs get online again.

We were able to reproduce this problem on two different hosts and sometimes it's also caused be other VMs connected to the same vSwitch; until now only vmnic1 and vmnic6 were affected.

Can you give me an email address were I can send the VMKernel Log to?

chnb · ‎03-20-2017

I've made my email address public in my profile. Thanks.

CostinB1 · ‎03-23-2017

Hello Bo,

We are experiencing the same issues with 3 PowerEdge R730 servers and Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet network cards.

When VMs become unavailable we go in and remove the affected vmnic from the vSwitch to restore connectivity, vMotion all VMs to a separate host and reboot the host to restore the network card. Is the recommendation to "Run the ntg3 driver in legacy mode" still viable and if so what are the limitations that come with this change?

Also looking through our kernel log i found the below:

2017-03-23T09:45:50.596Z cpu3:67943 opID=c273bc60)Uplink: 9893: enabled port 0x4000026 with mac **removed**

2017-03-23T09:45:50.596Z cpu16:66176)Net: 2524: connected Shadow of vmnic3 to null config, portID 0x4000027

2017-03-23T09:45:50.596Z cpu16:66176)NetPort: 1660: enabled port 0x4000027 with mac **removed**

2017-03-23T09:45:50.780Z cpu11:65645)NetPort: 1879: disabled port 0x4000026

2017-03-23T09:45:50.780Z cpu15:337363)NetSched: 628: vmnic3-0-tx: worldID = 337363 exits

2017-03-23T09:45:50.780Z cpu11:65645)NetLB: 2246: Driver claims supporting 0 RX queues, and 0 queues are accepted.

2017-03-23T09:45:50.780Z cpu11:65645)NetLB: 2242: Driver claims supporting 0 TX queues, and 0 queues are accepted.

2017-03-23T09:45:50.780Z cpu11:65645)Uplink: 9893: enabled port 0x4000026 with mac **removed**

2017-03-23T09:45:50.780Z cpu11:65645)NetPort: 1879: disabled port 0x4000024

2017-03-23T09:45:50.780Z cpu18:337362)NetSched: 628: vmnic2-0-tx: worldID = 337362 exits

2017-03-23T09:45:50.780Z cpu11:65645)NetLB: 2246: Driver claims supporting 0 RX queues, and 0 queues are accepted.

2017-03-23T09:45:50.780Z cpu11:65645)NetLB: 2242: Driver claims supporting 0 TX queues, and 0 queues are accepted.

2017-03-23T09:45:50.780Z cpu11:65645)Uplink: 9893: enabled port 0x4000024 with mac **removed**

2017-03-23T09:47:47.796Z cpu28:70823)Vmxnet3: 17265: Disable Rx queuing; queue size 512 is larger than Vmxnet3RxQueueLimit limit of 64.

2017-03-23T09:47:47.796Z cpu28:70823)Vmxnet3: 17623: Using default queue delivery for vmxnet3 for port 0x400000f

2017-03-23T09:47:47.796Z cpu28:70823)NetPort: 1660: enabled port 0x400000f with mac **removed**

2017-03-23T09:56:01.485Z cpu17:70563)WARNING: ntg3-throttled: Ntg3XmitPktList:372: vmnic3:TX ring full (0)

In this case vmnic3 lost connectivity.

Thank you

Costin

csmith201 · ‎03-24-2017

I hope there is a solution for this soon since I too am affected.

chnb · ‎03-28-2017

Hi Costin,

The problem and workaround mentioned in the release note applies only to the ntg3 NICs used in PCI pass-through mode, so likely doesn't apply to you or most other users.

We have observed two "types" of loss of connectivity issue with ntg3 NICs. One of them has been fixed (though the the fix has not been released yet), while the other is still being worked on. If you could provide the following information, we would be able to identify the type of the problem, and suggest a suitable workaround until the release of the fix. A KB article regarding the issue will likely be created soon.

1. How do you reproduce the problem? In particular, what type of network workload was running before the loss of connectivity? Is the issue / reliably reproducible?

2. If you could reproduce the loss of connectivity on a particular ntg3 NIC vmnicX, try the following:

2.1 Run "vsish -e set /system/modules/ntg3/loglevels/ntg3 1" (this will increase logging by the driver)

2.2 Run "esxcli network nic register dump -n vmnicX" (save the output, which is the NIC's HW register dump before the issue occurs)

2.3 Run the workload until connectivity is lost on vmnicX

[If you can run vm-support and provide the support bundle, do that and skip the rest of the steps.]

2.4 Run "vsish -e get /net/pNics/vmnicX/stats" (save the output, which is the NIC statistics)

2.5 Wait for a minute or so, then run 2.2 and 2.4 again.

2.6 Save the vmkernel logs (vmkernel.log, and possible vmkernel.Y.gz files if the vmkernel.log did not contains logs during the connectivity loss.

In another word, collect NIC's registers dumps before and after the loss of connectivity, and NIC's statistics after the loss of connectivity and again some time later, and the kernel logs. You may share them here or send them to me (see my profile). If you have collected the support bundle, I can provide instructions to send the bundle to VMware (see KB #2070100) if you have not yet filed support with VMware and received an SR number.

Thank you very much,

Bo

t40t40 · ‎03-30-2017

How close is are we to a solution?

Will it help installing a intel netcard?

chnb · ‎03-30-2017

Hi t40t40,

The problem is associated with the ntg3 driver (used for BCM5717, 5718, 5719 and 5720 NICs), so those NICs using other drivers will not be impacted. It is also affecting only specific types of workload under particular situations - the vast majority of hosts using the driver would likely never experience the issue. The problem is being actively worked on.

If you believe you have encountered this problem, we'd highly appreciate it if you could provide more information per post #14 of this thread. It will help us solve this issue soon.

Thanks,

Bo

CostinB1 · ‎04-03-2017

Hello Bo,

Thank you for the steps provided.

I was finally able to reproduce the issue and collected the logs. All the information and logs(with vm-support) have been added to a support ticket opened with Vmware 17424623804.

What i can say regarding the steps to reproduce this is:

1.Disable all but one network cards for the physical server.

2. Leave just 2 VMs running on the ESXi host

3. Using jMeter create 1000 connections x 100 cycles to target a website running in IIS on one of the VMs.

4.Almost immediately both servers become unavailable over the network and the ESXi logs show the "watchdog" alerts for the only active vmnicX.

Please let me know if i can provide more details.

Kind Regards,

Costin

fvdw · ‎04-09-2017

Same issue here with a HP Proliant DL360 Gen8. It seems to only happen when using Youtube from an Apple TV 4. It's hard to reproduce, but just happens from time to time.

chnb · ‎04-13-2017

Hi fvdw,

Sorry for the delay in response. Do you happen to also run F5 BIG IP Virtual Edition VM on the affected host?

To fvdm and other people who encountered this problem,

We are still working on this issue in collaboration with Broadcom. There has been some progress but we have no ETA of a fix yet. If you have encountered this issue, please contact VMware support and provide a vm-support pack that contains a kernel coredump after your host experienced loss of connectivity. This is done by the following:

(1) Ensure at least one of the ntg3 (BCM5719/BCM5720) NICs is experiencing loss of connectivity.

(2) In ESXi command line, run "vsish -e set /system/debugging/performLiveCoreDump 1".

(3) Run "vm-support", then unload the support bundle.

If the host is not currently experiencing loss of connectivity, but you could reproduce the issue, I suggest you also run "vsish -e set /system/modules/ntg3/loglevels/ntg3 1" which would increase ntg3's loglevel in the kernel logs for the collected support bundle.

-

Note the steps above is different from those in my earlier posts. Please also indicate with VMware support (and perhaps on here) if you are using F5 BIG IP Virtual Edition VM on the affected ESXi host.

If the issue is affecting your production system and you urgently need a workaround, please indicate such with VMware support (or email me, see my profile), and we can (most likely) provide you with a workaround. Please understand that we do not wish to publish the workaround here until we fully understand the scope of issue. By then we will provide the workaround in a KB article, and it describe the specific conditions to which the workaround is applicable.

Thank you,

Bo