VMware Cloud Community
tbraes
Contributor
Contributor

Vsphere ESXi 6.5 external network connectivity lost on DL380Gen8

Dear,

Strange issue on my system DL380Gen8 (not on VMHCL). It looses all external connectivity. When I logon through the out of band interface (ILO) ; I can still ping all VM's and they are running.

However the external interfaces looses all connectivity (kernel + VMWare guest uplinks).

Restart management agents doesn't help.

When I look at my kernel log I see my NVMe card + NTG3 (network driver) complaining :

2017-02-01T01:23:39.574Z cpu9:68121)User: 3089: sfcb-smx: wantCoreDump:sfcb-smx signal:6 exitCode:0 coredump:enabled

2017-02-01T01:23:39.703Z cpu9:68121)UserDump: 3024: sfcb-smx: Dumping cartel 68117 (from world 68121) to file /var/core/sfcb-smx-zdump.002 ...

2017-02-01T01:23:41.992Z cpu9:68121)UserDump: 3172: sfcb-smx: Userworld(sfcb-smx) coredump complete.

2017-02-01T10:20:26.125Z cpu2:69084)nvme:nvmeCoreLogError:370:command failed: 0x43077bd885f0.

2017-02-01T10:22:27.081Z cpu2:68970)nvme:nvmeCoreLogError:370:command failed: 0x43077bd70bf0.

2017-02-01T10:24:28.580Z cpu2:68970)nvme:nvmeCoreLogError:370:command failed: 0x43077bd71370.

2017-02-01T10:26:31.329Z cpu2:69175)nvme:nvmeCoreLogError:370:command failed: 0x43077bd71970.

2017-02-01T10:28:32.559Z cpu2:69175)nvme:nvmeCoreLogError:370:command failed: 0x43077bd71f70.

2017-02-01T10:30:49.130Z cpu2:68998)nvme:nvmeCoreLogError:370:command failed: 0x43077bd72570.

2017-02-01T10:32:50.089Z cpu2:69195)nvme:nvmeCoreLogError:370:command failed: 0x43077bd72b70.

2017-02-01T10:34:53.349Z cpu2:69134)nvme:nvmeCoreLogError:370:command failed: 0x43077bd73170.

2017-02-01T10:36:54.443Z cpu2:69040)nvme:nvmeCoreLogError:370:command failed: 0x43077bd73770

2017-02-01T16:18:36.497Z cpu1:68999)WARNING: ntg3-throttled: Ntg3XmitPktList:372: vmnic0:TX ring full (0)

2017-02-01T16:18:45.193Z cpu22:65645)ntg3:vmnic0:Ntg3UplinkReset:665:Ntg3UplinkReset

2017-02-01T16:18:45.193Z cpu22:65645)ntg3:vmnic0:Ntg3UplinkQuiesceIO:647:Ntg3UplinkQuiesceIO

2017-02-01T16:18:45.193Z cpu22:65645)ntg3:vmnic0:Ntg3UplinkStartIO:623:Ntg3UplinkStartIO

2017-02-01T16:18:55.193Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkReset:665:Ntg3UplinkReset

2017-02-01T16:18:55.193Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkQuiesceIO:647:Ntg3UplinkQuiesceIO

2017-02-01T16:18:55.193Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkStartIO:623:Ntg3UplinkStartIO

2017-02-01T16:19:05.195Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkReset:665:Ntg3UplinkReset

2017-02-01T16:19:05.195Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkQuiesceIO:647:Ntg3UplinkQuiesceIO

2017-02-01T16:19:05.195Z cpu21:65645)ntg3:vmnic0:Ntg3UplinkStartIO:623:Ntg3UplinkStartIO

2017-02-01T15:50:50.684Z cpu9:68980)WARNING: NetPort: 1932: failed to disable port 0x2000005 on vSwitch0: Busy

2017-02-01T15:50:50.684Z cpu9:68980)NetSched: 701: 0x2000002: received a force quiesce for port 0x2000005, dropped 727 pkts

2017-02-01T15:50:50.685Z cpu9:68980)NetPort: 1879: disabled port 0x2000005

2017-02-01T15:50:50.688Z cpu9:68980)Vmxnet3: 17265: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.

2017-02-01T15:50:50.688Z cpu9:68980)Vmxnet3: 17623: Using default queue delivery for vmxnet3 for port 0x2000005

2017-02-01T15:50:50.688Z cpu9:68980)NetPort: 1660: enabled port 0x2000005 with mac 00:50:56:a4:3e:25

2017-02-01T15:50:50.699Z cpu9:68980)NetPort: 1879: disabled port 0x2000005

2017-02-01T15:50:50.701Z cpu9:68980)Vmxnet3: 17265: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.

2017-02-01T15:50:50.701Z cpu9:68980)Vmxnet3: 17623: Using default queue delivery for vmxnet3 for port 0x2000005

2017-02-01T15:50:50.701Z cpu9:68980)NetPort: 1660: enabled port 0x2000005 with mac 00:50:56:a4:3e:25

2017-02-01T15:50:56.216Z cpu0:68971)WARNING: ntg3-throttled: Ntg3XmitPktList:372: vmnic0:TX ring full (0)

when I restart the box, all goes fine again for sometimes 1 day, 1 week... unclear... somebody an idea?

Reply
0 Kudos
16 Replies
rcporto
Leadership
Leadership

See the following on ESXi 6.5 release notes: VMware vSphere 6.5 Release Notes

  • Network becomes unavailable with full passthrough devices
    If a native ntg3 driver is used on a passthrough Broadcom Gigabit Ethernet Adapter, the network connection will become unavailable.Workaround:
    • Run the ntg3 driver in legacy mode:
      1. Run the esxcli system module parameters set -m ntg3 -p intrMode=0 command.
      2. Reboot the host.
    • Use the tg3 vmklinux driver as the default driver, instead of the native ntg3 driver.
---

Richardson Porto
Senior Infrastructure Specialist
LinkedIn: http://linkedin.com/in/richardsonporto
Reply
0 Kudos
chnb
VMware Employee
VMware Employee

Hi tbraes,

What kind of network workload was running when the issue occurred? For example, was there vMotion, or file transfer with NFS or scp, etc? And was the NIC running at 1000M speed or 10M/100M?

Has the issue occurred again? If it does occur, could you try the following?

In the ESXi shell, type: vsish -e get /net/pNics/vmnic0/stats. Wait for a minute or so, and type that again, and provide the output of both? Having the kernel log around the first "vmnic0:TX ring full" message will also very helpful.

Feel free to PM me if you like. Thanks.

Reply
0 Kudos
fvdwestelaken
Contributor
Contributor

Same here, with a Proliant DL360 Gen8. It happens only when streaming youtube videos. It's hard to reproduce, it just happens. Sometimes it works for days and then all network connectivity is coming to an halt. Can't ping any VM or the host itselfd anymore. A reboot of the host solves the issue.

Frank.

Reply
0 Kudos
chnb
VMware Employee
VMware Employee

Hi Frank,

Can you share the vmkernel log around the time of the loss of connectivity?

Thanks,

Bo

Reply
0 Kudos
myofficer
Contributor
Contributor

I have the same problem in Dell R620 with Esxi 6.5.

Maybe broadcom nic result it.

Who can solve it?

Reply
0 Kudos
prashant_s
Contributor
Contributor

Hi All,

I am also facing the same issue with Dell PowerEdge R720 and ESXi 6.5 with Broadcom 5720 NICs and with ntg3 driver.

Any fast resolution will be much appreciated.

I was going through some of the articles, i think there is a issue with ESXi 6.5 with Broadcom as many vmware link says to disable native driver and use tg3 driver.

Kindly confirm is it the problem with native driver.

How can we confirm whether the broadcom card is running on pass through mode.

Reply
0 Kudos
chnb
VMware Employee
VMware Employee

Hi prashant_s,

Have you tried updating ntg3 to version 4.1.2.0? It's downloadable at https://my.vmware.com/group/vmware/details?productId=614&downloadGroup=DT-ESX65-BROADCOM-NTG3-4120 .Regarding pass-through, if you didn't explicitly configured the NIC to pass-through, it's not.

Thanks,

Bo

Reply
0 Kudos
prashant_s
Contributor
Contributor

I am going to update now as i am going on a remote with customer.

https://communities.vmware.com/thread/553243

https://communities.vmware.com/thread/563731

https://communities.vmware.com/message/2672256#2672256

https://communities.vmware.com/thread/557471?start=15&tstart=0

But is there a problem with ESXi 6.5 and Broadcom ntg3 driver or Broadcom 5720 or 5719 card?

Reply
0 Kudos
chnb
VMware Employee
VMware Employee

Yes, there were problems as the ntg3 driver is brand new & written from scratch using the new native driver model, but the 4.1.2.0 update should fix the ones most users encounter.

Reply
0 Kudos
prashant_s
Contributor
Contributor

Hi, I have updated the NIC driver to 4.1.2 and network firmware as well.

I have attached the document.

Reply
0 Kudos
prashant_s
Contributor
Contributor

HI chnb,

Can you let me know what will be the issue in such a scenario?

Reply
0 Kudos
snamidro
Contributor
Contributor

Hi,

I have a similar problem.
ESXI 6.5 on Gen8 with two NICs :

I use this :   (Updated) HPE-ESXi-6.5.0-OS-Release-iso-650.9.6.0.28 (Hewlett Packard Enterprise) 

[root@ESXI2:~] esxcfg-nics -l

Name    PCI          Driver      Link Speed      Duplex MAC Address       MTU    Description

vmnic0  0000:03:00.0 tg3         Up   1000Mbps   Full   a0:b3:cc:df:1c:9f 1500   Broadcom Corporation NetXtreme BCM5723 Gigabit Ethernet

vmnic1  0000:02:00.0 r8168       Up   1000Mbps   Full   00:e0:4c:80:1a:50 9000   Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express Gigabit Ethernet controller

When i set MTU 9000 on the Network interface on the guest. The guest reboot...

I analysed esxi logs,  but i don't see error.
I created vswitch and port group after, at MTU 9000.

Any ideas ?

Reply
0 Kudos
prashant_s
Contributor
Contributor

HI Snamidro,

Today we got the confirmation from customer that there was some vlan issue in his network to which servers were connected and after untagging the vlan now no retransmission.

I even replicated the setup in my lab, with 3 Dell servers which had BCM5720 and Intel i350 and connected to a plain 1G switch and i didnt notice a single retransmission.

I even updated the NIC driver on customers setup, network firmware still same issue.

My next POA to customer was to bypass his Core Switch and connect a plain switch between the servers. So finally it came out to be configuration issue.

But vmware has to think about a proper solution, as customer claims no issues with ESXi 6 but only issue when they updated to ESXI 6.5

I hope my POA would help you.

Reply
0 Kudos
snamidro
Contributor
Contributor

I am trying with another NIC Vmkernel, because i see MTU changed... And one of my NIC don't support jumbo frame....

So your advice is to downgrade to 6.0, i can downgrade easily ? Or must do clean install ?

EDIT: I tried to disable ipV6 too.
Result in few minutes.

Reply
0 Kudos
prashant_s
Contributor
Contributor

But are you losing connectivity? because mine was packet drop, there is no connectivity issue.

If you are losing connectivity, does the connectivity lose if you change the jumbo frame?

if you dont change the MTU, you are able to connect properly?

We should be able to change the MTU in network card, but what is the config on the switch port? Is it supporting the changed MTU?

Reply
0 Kudos
snamidro
Contributor
Contributor

To resume :

ESXI 6.5

I have a guest with two network card.
One with E1000 on Vswitch1 with nic0 (mtu 1500)

One with VMXNet3 on Vswitch2 with nic1 (mtu 9000)

The MTU is OK on the each nic and vswitch.

[root@ESXI2:~] esxcfg-vswitch  -l

Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks

vSwitch0         1536        5           128               1500    vmnic0

  PortGroup Name        VLAN ID  Used Ports  Uplinks

  VM Network            0        1           vmnic0

  Management Network    0        1           vmnic0

Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks

HA               1536        4           1024              9000    vmnic1

  PortGroup Name        VLAN ID  Used Ports  Uplinks

  HA                    0        1           vmnic1

[root@ESXI2:~] esxcfg-nics -l

Name    PCI          Driver      Link Speed      Duplex MAC Address       MTU    Description

vmnic0  0000:03:00.0 tg3         Up   1000Mbps   Full   a0:b3:cc:df:1c:9f 1500   Broadcom Corporation NetXtreme BCM5723 Gigabit Ethernet

vmnic1  0000:02:00.0 r8168       Up   1000Mbps   Full   00:e0:4c:80:1a:50 9000   Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express Gigabit Ethernet controller

[root@ESXI2:~]

My guest is working with these setting, but jumbo frames are not enabled. Ping working fine on each IP.
So i want to enable on the second card, because MTU 9000 is set.

When i set MTU 9000 on this card on the guest, the guest reboot, and the setting is not kept.

I lose ping because guest reboot, and ping is OK after reboot finished....

So i would like to know why when i change MTU, the guest reboot ? It's a bug ?

I don't use the good way ?

my Broadcom doesnt support MTU 9000, i know. But Realtek is ok. I have same card on another machine not with ESXI and no problem.

PS: I disabled ipV6 too on vswitch.

I have some doubt about my vmkernel :

[root@ESXI2:~] esxcfg-vmknic -l

Interface  Port Group/DVPort/Opaque Network        IP Family IP Address                              Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type                NetStack

vmk0       Management Network                      IPv4      192.168.0.151                           255.255.255.0   192.168.0.255   00:e0:4c:80:1a:50 1500    65535     true    STATIC              defaultTcpipStack

[root@ESXI2:~]

Logs from ESXI when i change MTU on guest :

2017-06-23T20:35:15.284Z cpu1:88725)Vmxnet3: 17265: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.

2017-06-23T20:35:15.284Z cpu1:88725)Vmxnet3: 17623: Using default queue delivery for vmxnet3 for port 0x3000007

2017-06-23T20:35:15.284Z cpu1:88725)NetPort: 1660: enabled port 0x3000007 with mac 00:0c:29:44:93:21

2017-06-23T20:35:18.218Z cpu0:66070)Uplink: 4622: vmnic0: Non TSO L2 payload size exceeds uplink MTU. FrameLen: 9014, L3 header offset: 14

Reply
0 Kudos