Dear,
Strange issue on my system DL380Gen8 (not on VMHCL). It looses all external connectivity. When I logon through the out of band interface (ILO) ; I can still ping all VM's and they are running.
However the external interfaces looses all connectivity (kernel + VMWare guest uplinks).
Restart management agents doesn't help.
When I look at my kernel log I see my NVMe card + NTG3 (network driver) complaining :
2017-02-01T01:23:39.574Z cpu9:68121)User: 3089: sfcb-smx: wantCoreDump:sfcb-smx signal:6 exitCode:0 coredump:enabled
2017-02-01T01:23:39.703Z cpu9:68121)UserDump: 3024: sfcb-smx: Dumping cartel 68117 (from world 68121) to file /var/core/sfcb-smx-zdump.002 ...
2017-02-01T01:23:41.992Z cpu9:68121)UserDump: 3172: sfcb-smx: Userworld(sfcb-smx) coredump complete.
2017-02-01T10:20:26.125Z cpu2:69084)nvme:nvmeCoreLogError:370:command failed: 0x43077bd885f0.
2017-02-01T10:22:27.081Z cpu2:68970)nvme:nvmeCoreLogError:370:command failed: 0x43077bd70bf0.
2017-02-01T10:24:28.580Z cpu2:68970)nvme:nvmeCoreLogError:370:command failed: 0x43077bd71370.
2017-02-01T10:26:31.329Z cpu2:69175)nvme:nvmeCoreLogError:370:command failed: 0x43077bd71970.
2017-02-01T10:28:32.559Z cpu2:69175)nvme:nvmeCoreLogError:370:command failed: 0x43077bd71f70.
2017-02-01T10:30:49.130Z cpu2:68998)nvme:nvmeCoreLogError:370:command failed: 0x43077bd72570.
2017-02-01T10:32:50.089Z cpu2:69195)nvme:nvmeCoreLogError:370:command failed: 0x43077bd72b70.
2017-02-01T10:34:53.349Z cpu2:69134)nvme:nvmeCoreLogError:370:command failed: 0x43077bd73170.
2017-02-01T10:36:54.443Z cpu2:69040)nvme:nvmeCoreLogError:370:command failed: 0x43077bd73770
2017-02-01T16:18:36.497Z cpu1:68999)WARNING: ntg3-throttled: Ntg3XmitPktList:372: vmnic0:TX ring full (0)
2017-02-01T16:18:45.193Z cpu22:65645)ntg3:vmnic0:Ntg3UplinkReset:665:Ntg3UplinkReset
2017-02-01T16:18:45.193Z cpu22:65645)ntg3:vmnic0:Ntg3UplinkQuiesceIO:647:Ntg3UplinkQuiesceIO
2017-02-01T16:18:45.193Z cpu22:65645)ntg3:vmnic0:Ntg3UplinkStartIO:623:Ntg3UplinkStartIO
2017-02-01T16:18:55.193Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkReset:665:Ntg3UplinkReset
2017-02-01T16:18:55.193Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkQuiesceIO:647:Ntg3UplinkQuiesceIO
2017-02-01T16:18:55.193Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkStartIO:623:Ntg3UplinkStartIO
2017-02-01T16:19:05.195Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkReset:665:Ntg3UplinkReset
2017-02-01T16:19:05.195Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkQuiesceIO:647:Ntg3UplinkQuiesceIO
2017-02-01T16:19:05.195Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkStartIO:623:Ntg3UplinkStartIO
2017-02-01T15:50:50.684Z cpu9:68980)WARNING: NetPort: 1932: failed to disable port 0x2000005 on vSwitch0: Busy
2017-02-01T15:50:50.684Z cpu9:68980)NetSched: 701: 0x2000002: received a force quiesce for port 0x2000005, dropped 727 pkts
2017-02-01T15:50:50.685Z cpu9:68980)NetPort: 1879: disabled port 0x2000005
2017-02-01T15:50:50.688Z cpu9:68980)Vmxnet3: 17265: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
2017-02-01T15:50:50.688Z cpu9:68980)Vmxnet3: 17623: Using default queue delivery for vmxnet3 for port 0x2000005
2017-02-01T15:50:50.688Z cpu9:68980)NetPort: 1660: enabled port 0x2000005 with mac 00:50:56:a4:3e:25
2017-02-01T15:50:50.699Z cpu9:68980)NetPort: 1879: disabled port 0x2000005
2017-02-01T15:50:50.701Z cpu9:68980)Vmxnet3: 17265: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
2017-02-01T15:50:50.701Z cpu9:68980)Vmxnet3: 17623: Using default queue delivery for vmxnet3 for port 0x2000005
2017-02-01T15:50:50.701Z cpu9:68980)NetPort: 1660: enabled port 0x2000005 with mac 00:50:56:a4:3e:25
2017-02-01T15:50:56.216Z cpu0:68971)WARNING: ntg3-throttled: Ntg3XmitPktList:372: vmnic0:TX ring full (0)
when I restart the box, all goes fine again for sometimes 1 day, 1 week... unclear... somebody an idea?
See the following on ESXi 6.5 release notes: VMware vSphere 6.5 Release Notes
- Network becomes unavailable with full passthrough devices
If a native ntg3 driver is used on a passthrough Broadcom Gigabit Ethernet Adapter, the network connection will become unavailable.Workaround:
- Run the ntg3 driver in legacy mode:
- Run the esxcli system module parameters set -m ntg3 -p intrMode=0 command.
- Reboot the host.
- Use the tg3 vmklinux driver as the default driver, instead of the native ntg3 driver.
Hi tbraes,
What kind of network workload was running when the issue occurred? For example, was there vMotion, or file transfer with NFS or scp, etc? And was the NIC running at 1000M speed or 10M/100M?
Has the issue occurred again? If it does occur, could you try the following?
In the ESXi shell, type: vsish -e get /net/pNics/vmnic0/stats. Wait for a minute or so, and type that again, and provide the output of both? Having the kernel log around the first "vmnic0:TX ring full" message will also very helpful.
Feel free to PM me if you like. Thanks.
Same here, with a Proliant DL360 Gen8. It happens only when streaming youtube videos. It's hard to reproduce, it just happens. Sometimes it works for days and then all network connectivity is coming to an halt. Can't ping any VM or the host itselfd anymore. A reboot of the host solves the issue.
Frank.
Hi Frank,
Can you share the vmkernel log around the time of the loss of connectivity?
Thanks,
Bo
I have the same problem in Dell R620 with Esxi 6.5.
Maybe broadcom nic result it.
Who can solve it?
Hi All,
I am also facing the same issue with Dell PowerEdge R720 and ESXi 6.5 with Broadcom 5720 NICs and with ntg3 driver.
Any fast resolution will be much appreciated.
I was going through some of the articles, i think there is a issue with ESXi 6.5 with Broadcom as many vmware link says to disable native driver and use tg3 driver.
Kindly confirm is it the problem with native driver.
How can we confirm whether the broadcom card is running on pass through mode.
Hi prashant_s,
Have you tried updating ntg3 to version 4.1.2.0? It's downloadable at https://my.vmware.com/group/vmware/details?productId=614&downloadGroup=DT-ESX65-BROADCOM-NTG3-4120 .Regarding pass-through, if you didn't explicitly configured the NIC to pass-through, it's not.
Thanks,
Bo
I am going to update now as i am going on a remote with customer.
https://communities.vmware.com/thread/553243
https://communities.vmware.com/thread/563731
https://communities.vmware.com/message/2672256#2672256
https://communities.vmware.com/thread/557471?start=15&tstart=0
But is there a problem with ESXi 6.5 and Broadcom ntg3 driver or Broadcom 5720 or 5719 card?
Yes, there were problems as the ntg3 driver is brand new & written from scratch using the new native driver model, but the 4.1.2.0 update should fix the ones most users encounter.
HI chnb,
Can you let me know what will be the issue in such a scenario?
Hi,
I have a similar problem.
ESXI 6.5 on Gen8 with two NICs :
I use this : (Updated) HPE-ESXi-6.5.0-OS-Release-iso-650.9.6.0.28 (Hewlett Packard Enterprise)
[root@ESXI2:~] esxcfg-nics -l
Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:03:00.0 tg3 Up 1000Mbps Full a0:b3:cc:df:1c:9f 1500 Broadcom Corporation NetXtreme BCM5723 Gigabit Ethernet
vmnic1 0000:02:00.0 r8168 Up 1000Mbps Full 00:e0:4c:80:1a:50 9000 Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express Gigabit Ethernet controller
When i set MTU 9000 on the Network interface on the guest. The guest reboot...
I analysed esxi logs, but i don't see error.
I created vswitch and port group after, at MTU 9000.
Any ideas ?
HI Snamidro,
Today we got the confirmation from customer that there was some vlan issue in his network to which servers were connected and after untagging the vlan now no retransmission.
I even replicated the setup in my lab, with 3 Dell servers which had BCM5720 and Intel i350 and connected to a plain 1G switch and i didnt notice a single retransmission.
I even updated the NIC driver on customers setup, network firmware still same issue.
My next POA to customer was to bypass his Core Switch and connect a plain switch between the servers. So finally it came out to be configuration issue.
But vmware has to think about a proper solution, as customer claims no issues with ESXi 6 but only issue when they updated to ESXI 6.5
I hope my POA would help you.
I am trying with another NIC Vmkernel, because i see MTU changed... And one of my NIC don't support jumbo frame....
So your advice is to downgrade to 6.0, i can downgrade easily ? Or must do clean install ?
EDIT: I tried to disable ipV6 too.
Result in few minutes.
But are you losing connectivity? because mine was packet drop, there is no connectivity issue.
If you are losing connectivity, does the connectivity lose if you change the jumbo frame?
if you dont change the MTU, you are able to connect properly?
We should be able to change the MTU in network card, but what is the config on the switch port? Is it supporting the changed MTU?
To resume :
ESXI 6.5
I have a guest with two network card.
One with E1000 on Vswitch1 with nic0 (mtu 1500)
One with VMXNet3 on Vswitch2 with nic1 (mtu 9000)
The MTU is OK on the each nic and vswitch.
[root@ESXI2:~] esxcfg-vswitch -l
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch0 1536 5 128 1500 vmnic0
PortGroup Name VLAN ID Used Ports Uplinks
VM Network 0 1 vmnic0
Management Network 0 1 vmnic0
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
HA 1536 4 1024 9000 vmnic1
PortGroup Name VLAN ID Used Ports Uplinks
HA 0 1 vmnic1
[root@ESXI2:~] esxcfg-nics -l
Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:03:00.0 tg3 Up 1000Mbps Full a0:b3:cc:df:1c:9f 1500 Broadcom Corporation NetXtreme BCM5723 Gigabit Ethernet
vmnic1 0000:02:00.0 r8168 Up 1000Mbps Full 00:e0:4c:80:1a:50 9000 Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express Gigabit Ethernet controller
[root@ESXI2:~]
My guest is working with these setting, but jumbo frames are not enabled. Ping working fine on each IP.
So i want to enable on the second card, because MTU 9000 is set.
When i set MTU 9000 on this card on the guest, the guest reboot, and the setting is not kept.
I lose ping because guest reboot, and ping is OK after reboot finished....
So i would like to know why when i change MTU, the guest reboot ? It's a bug ?
I don't use the good way ?
my Broadcom doesnt support MTU 9000, i know. But Realtek is ok. I have same card on another machine not with ESXI and no problem.
PS: I disabled ipV6 too on vswitch.
I have some doubt about my vmkernel :
[root@ESXI2:~] esxcfg-vmknic -l
Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type NetStack
vmk0 Management Network IPv4 192.168.0.151 255.255.255.0 192.168.0.255 00:e0:4c:80:1a:50 1500 65535 true STATIC defaultTcpipStack
[root@ESXI2:~]
Logs from ESXI when i change MTU on guest :
2017-06-23T20:35:15.284Z cpu1:88725)Vmxnet3: 17265: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
2017-06-23T20:35:15.284Z cpu1:88725)Vmxnet3: 17623: Using default queue delivery for vmxnet3 for port 0x3000007
2017-06-23T20:35:15.284Z cpu1:88725)NetPort: 1660: enabled port 0x3000007 with mac 00:0c:29:44:93:21
2017-06-23T20:35:18.218Z cpu0:66070)Uplink: 4622: vmnic0: Non TSO L2 payload size exceeds uplink MTU. FrameLen: 9014, L3 header offset: 14