VMware Cloud Community
Kennylcf
Contributor
Contributor

ESXi 6.5 connectivity issue on PowerEdge R430

Dear,

We encountering connectivity issue on newly install esxi 6.5 on PE R430

Whenever we preform file copy (30-40GB files size) from one of the VM Guest to another machine thought network (100MB),  it end up with connectivity issue.

Both VM guest and host became inaccessible thought network.

ESX console can still be access, try with "restart management network" and use "testing management network" with result OK on default gateway or other machine.

But unable to access both guest and host thought network.

Connection become normal after shutdown / restart the esx host.

Check with VM guest OS event log, indicated that it should still be running during issue and following the auto shutdown/startup setting of esx host.

Look into log and found the following warning keep prompt during copy task perform.

cpu## :67701)WARNING: ntg3-throttled: Ntg3XmitPktList:372: vmnic0:TX ring full (0)

I wonder if anyone with suggestion on this issue? thx

Tags (1)
50 Replies
atbt
Contributor
Contributor

We are seeing the same issue with our Broadcom BCM5720 using ntg3.  A case has been logged as well.

0 Kudos
FrancWest
Contributor
Contributor

Hi,

I'm not using a F5 VM on the host. Today it happened again during watching youtube videos on my appletv, but since I also loose internet connectivity during the issue (the firewall is also a VM) I was unable to execute the debug steps you mention. I've now saved them locally and will execute them once the issue occurs again.

Frank.

0 Kudos
FrancWest
Contributor
Contributor

Hi,

just curious, why is a F5 BIG IP VM related to this issue?

I'm running a Sophos UTM vm and it looks like the traffic passing trhough it causes the ESXi 6.5 hosts nic to hang/crash. Since the issue only happens when watching youtube using my Apple TV 4 the only thing related to ESXi is that the traffic is passing through a VM running on it.

Frank.

0 Kudos
vecnar
Contributor
Contributor

Hi Bo,

We have dell R430 with 4 BCM5720 nics.

I reproduced the issue twice with server 2012 r2 with vmxnet3 nic by copying 89.2GB file using robocopy to smb share on nas over Fast ethernet switches. Server is a Domain controller and file server which has shares used for folder redirection by around 15 users. It took 1 -2 hours to reproduce the issue.

1) I reproduced the issue twice on vmnic3 and vmnic2

2) I created port group and assigned vmnic1 which didn't have problems accessing lan and connected vm to it.

3) enabled logging "vsish -e set /system/modules/ntg3/loglevels/ntg3 1"

4) Using robocopy started copying 89.2GB file to smb share on nas 19:10 27/04/2017

5) Confirmed that vm is using vmnic1 by executing esxtop and pressing "n"

6) Issue occured between 8-9 PM

7) Run "vm-support" and generated the file.

As i don't have access to account with support subscription I can't create service request.

Where would you like me to upload the files?

How can I disabled logging that i enabled at step 4?

Thanks in advance

0 Kudos
chnb
VMware Employee
VMware Employee

Hi all,

Just a quick update as I'm out of office. We had root-caused the issue and the fix had been under internal testing for two weeks. We'll release the fixed driver soon. I'll provide more information about it next week.

Thanks,

Bo

0 Kudos
atbt
Contributor
Contributor

Thanks for the update, Bo.  I was just provided with ntg3  4.1.2.0 a few days ago from the technician assigned to my ticket.  Is the fixed version planned for release the same version?

0 Kudos
FreddyFredFred
Hot Shot
Hot Shot

Is this issue just with the ntg3 driver or is it possible this also happens with bnx2x as well? I've got a colleague seeing the network stop working under continuous  load after a few hours of testing so was wondering if this issue might extend to other Broadcom/qlogic drivers.

0 Kudos
chnb
VMware Employee
VMware Employee

atbt​:

Yes, the fixed driver planned for release will be the same as ntg3 4.1.2 provided to you.

FreddyFredFred​:

This issue exists only in ntg3. bnx2x is very different. There appears to be an issue of bnx2x losing connectivity intermittently, which is currently under very active investigation by my colleagues. It has not yet been confirmed as a driver issue. Please file a ticket with VMware support so we could understand and help you with your specific issue.

0 Kudos
myofficer
Contributor
Contributor

I have the same problem in Dell R620 with Esxi 6.5. nic is BCM5720.

   Driver Info:

         Bus Info: 0000:01:00:0

         Driver: ntg3

         Firmware Version: bc 1.39 ncsi 1.3.16.0

         Version: 4.1.0.0

when will the fixed driver release?

thank you.

0 Kudos
robertquast
Contributor
Contributor

Seeing almost same issues. Trying to track down "ntg3 4.1.2" driver and not having much luck with either vendors.  Any chance it is available publicly?  ty

0 Kudos
myofficer
Contributor
Contributor

Where can get ntg3 4.1.2?

Can you send it to my email myofficer@126.com?

Thank you

0 Kudos
chnb
VMware Employee
VMware Employee

Hi,

The 4.1.2 driver has not yet been released for general use. We are getting the fixed driver included in the next update release of vSphere 6.5. We are also trying to release the fixed driver by itself as soon as possible, but there are some processes to go through and I can't say when it will be available. I will update here once we released it.

Thanks,

Bo

0 Kudos
fvdwestelaken
Contributor
Contributor

Why not release it to at least us here in the forum who are experiencing the issue? We can test the driver before it's released to general public, this way we can tell if it's really solved. We are struggling with this issue since the release of 6.5, 6 months already.

0 Kudos
myofficer
Contributor
Contributor

We can not wait it and hope release as soon as possible.

It would be better if you can give it to us for test before release finally.

0 Kudos
Scissor
Virtuoso
Virtuoso

If you want to get the driver early you should open a support request with VMware so they can keep track of which customers are testing it.  Otherwise waiting for the official release is the correct thing to do.

If you are running into serious problems while you wait then you could always install a different brand of NIC in your server as a workaround.  I have had good experience with Intel NICs.

chnb
VMware Employee
VMware Employee

I fully understand the frustration, and please know that we are doing everything to expedite the official release of the fix. We have verified the fix with several customers who contacted VMware support.

We found two separate issues which may cause loss of connection in ntg3, and both are fixed in the upcoming driver. One of them is related to the handling of malformed TSO packets, which I believe is the cause of trouble for most people in this thread. For this issue a workaround is to turn off TSO, which may cause a small increase in CPU load in the ESXi host. To turn off TSO, run the ESXi console command "esxcli network nic tso set -e 0 -n vmnicX" for every vmnicX using the ntg3 driver. This setting is not persistent after host reboot. If you used this workaround, and either find it not working or encounter further issues, please let me know.

Thank you.

Bo

bitwolf
Contributor
Contributor

Bo,

we have a couple of R430s on 4.1.0.0 with the same issue, tried the suggested workaround to disable TSO but it didn't make any difference, the card keeps going through cycles of

Ntg3UplinkReset/Ntg3UplinkQuiesceIO/Ntg3UplinkStartIO according to vmkernel.log. What is the other issue that the upcoming driver is supposed to fix, and do you have updates on the release date?

0 Kudos
chnb
VMware Employee
VMware Employee

Hi,

ntg3 4.1.2.0 is now available for download at https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESX65-BROADCOM-NTG3-4120&productId=614 . It should fix the loss of connectivity issues reported to us. Please file support or report here if it did not fix it for you, or if you encountered other problems related to the driver.

Thank you,

Bo

0 Kudos
myofficer
Contributor
Contributor

thank you so much. we can try it.

0 Kudos
fvdw
Contributor
Contributor

Unfortunately the new driver didn't help. I just lost network connectivity again. Again while watching youtube on my Apple TV going through a virtual Sophos UTM. I was able to collect the logging using the commands earlier in this thread. So if you're interested I can upload them. The vmkernel.log reports this at the moment of loss of network connectivity:

2017-05-27T01:23:14.379Z cpu6:142894)NetLB: 2242: Driver claims supporting 0 TX queues, and 0 queues are accepted.

2017-05-27T01:23:14.379Z cpu6:142894)NetLB: 2246: Driver claims supporting 0 RX queues, and 0 queues are accepted.

2017-05-27T01:23:14.380Z cpu6:142894)NetLB: 2242: Driver claims supporting 0 TX queues, and 0 queues are accepted.

2017-05-27T01:23:14.380Z cpu6:142894)NetLB: 2246: Driver claims supporting 0 RX queues, and 0 queues are accepted.

2017-05-27T01:23:14.380Z cpu6:142894)NetLB: 2242: Driver claims supporting 0 TX queues, and 0 queues are accepted.

2017-05-27T01:23:14.380Z cpu6:142894)NetLB: 2246: Driver claims supporting 0 RX queues, and 0 queues are accepted.

2017-05-27T01:23:14.381Z cpu6:142894)NetLB: 2242: Driver claims supporting 0 TX queues, and 0 queues are accepted.

2017-05-27T01:23:14.381Z cpu6:142894)NetLB: 2246: Driver claims supporting 0 RX queues, and 0 queues are accepted.

2017-05-27T01:23:14.382Z cpu6:142894)WARNING: Tcpip_Vmk: 781: vmk_get_gateway failed with error = 0x2d, status = 0xbad0105

2017-05-27T01:23:14.382Z cpu6:142894)WARNING: Tcpip_Vmk: 781: vmk_get_gateway failed with error = 0x2d, status = 0xbad0105

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 96: get connection stats failed with error code 195887136

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 96: get connection stats failed with error code 195887136

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 96: get connection stats failed with error code 195887136

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136

2017-05-27T01:23:14.383Z cpu6:142894)Tcpip_Vmk: 96: get connection stats failed with error code 195887136

I have the correct driver installed:

[root@esxi:~] vmkload_mod -s ntg3 | grep Version

Version: 4.1.2.0-1vmw.650.0.0.4598673

I'm running vSphere 6.5 on an HP Proliant DL360p gen8

Franc.

0 Kudos