Solved: Strange Ethernet error that I can't find any info ...

kbonnel · ‎02-22-2017

I am using ESXi 6.5, with a supported Intel CT PCIE adapter. (among some other supported adapters). I started noticing some of my VMs not being accessible, and I tracked it down to them trying to use this NIC via vmnic0 (which is this CT adapter). I also noticed that ESXi would show the adapter as offline, even though I can see it connected to my Cisco switch (both lights and via the console). If I disconnect the cable, reconnect, it will reappar as functional within ESXi. Given some time, the same issue will appear. I have swapped cables, and tested different switch ports, all with the same results.

I checked dmesg output via the console, and I keep getting the following repeated errors. I cannot find any info on it:

2017-02-22T16:25:09.994Z cpu2:65908)INFO (ne1000): false RX hang detected on vmnic0

2017-02-22T16:25:30.094Z cpu2:65908)INFO (ne1000): false RX hang detected on vmnic0

2017-02-22T16:27:10.195Z cpu2:65908)INFO (ne1000): false RX hang detected on vmnic0

2017-02-22T16:27:30.296Z cpu1:65908)INFO (ne1000): false RX hang detected on vmnic0

2017-02-22T16:32:10.399Z cpu0:65908)INFO (ne1000): false RX hang detected on vmnic0

2017-02-22T16:34:15.600Z cpu1:65908)INFO (ne1000): false RX hang detected on vmnic0

2017-02-22T16:39:31.204Z cpu2:65908)INFO (ne1000): false RX hang detected on vmnic0

Any ideas? Maybe the adapter is going bad?

Thanks for any input/comments.

MattiasN81 · ‎03-02-2017

I had the same problems in on of my labs, its not the exact same NIC but when i shifted from ne1000 driver to e1000e driver i got an huge improvment.

Try to disable the ne1000 and use e1000e instead using this command, "esxcli system module set --enabled=false --module=ne1000", the reboot the host.

VMware Certified Professional 6 - DCV VMware VTSP Software Defined Storage Dell Blade Server Solutions - EMEA Certified Dell PowerEdge Server Solutions - EMEA Certfied Dell Certified Storage Deployment Professional Dell EMC Proven Professional If you found my answers useful please consider marking them as Helpful or Correct

View solution in original post

SterlingHealthc · ‎02-22-2017

My limited understanding from reading a linux driver thread with a similar issue: https://sourceforge.net/p/e1000/bugs/416/

Is that the false RX hang means its a software flow control issue and not hardware. This could be a driver issue, or could be the NIC getting overwhelmed or the CPU overloaded and not able to respond in time? You may want to try disabling flow control and see if that helps. But that may mean packets just get dropped if the CPU is too busy.

This card uses the e1000e chip, its desktop class. If you are pushing a lot of data through it, plus fully utilizing the host CPUs. I suppose in theory it could just be overtaxed and the CPU isn't responding to the interrupts quick enough. You may want to consider a NIC with more hardware offloading.

kbonnel · ‎02-22-2017

‌Thank you for the info, it is appreciated. Unfortunately this particular adapter still gives off these errors even when it is unused. It is very strange indeed.

mcmurm · ‎02-28-2017

I have this same NIC and started experiencing the same problem after upgrading to ESXi 6.5. It is provoked by pushing volume over the NIC. If you let the server idle without much traffic then you won't experience the issue. As soon as you start stressing the NIC, you'll see this occur.

Unfortunately I've not found a fix other than maybe purchasing another NIC.

kbonnel · ‎02-28-2017

I wanted to give an update in case it might help somebody else. I no longer seem to be getting the errors, and the adapter has not dropped it connection to the switch, so it seems to be working fine. I did three things to my system, so I can't say which one actually fixed the problem.

This is what I did, in this order:

1. I downloaded the latest firmware update from intel (bootutil) and flashed my 2 intel NIC adapters to the latest level. The adapter that was having a problem was somewhere on the 1.3.x version, and the update broought all of my adapters up to the 1.5.xx level I believe. (can't remember the exact numbers, just remember it seemed like a big jump).

2. I moved the problem NIC to a different PCI-E slot. It was originally in a PCI-E 8/16 slot, and I moved it too a PCI-E 1 slot

3. I did a fresh install of ESXI 6.5 onto a new USB adapter and re-imported my configuration.

Since then, I have not gotten the error. I want to say that I did 1 and 2 at roughly the sametime, and that I didn't get any errors in my origional ESXi 6.5 upgrade installation (from 5.5 to 6 and then to 6.0), but I can't recall.

I check every day to see if it pops back up, but so far it has been about a week and no issues.

mcmurm · ‎03-01-2017

I also updated the firmware on mine from 1.3.x to 1.5.8. Unfortunately it did not solve my problem. As soon as I stress the NIC, the machine goes offline and I start seeing those same errors in my kernel log. This goes on until I reboot the server.

2017-03-02T04:08:02.168Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:08:07.168Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:08:07.168Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:08:12.168Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:08:12.168Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:08:17.168Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:08:17.168Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:08:22.168Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:08:22.168Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:08:27.168Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:08:27.168Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:08:32.168Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:08:32.168Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:08:37.168Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:08:37.168Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:08:42.169Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:08:42.169Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:08:47.169Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:08:47.169Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:08:52.169Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:08:52.169Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:08:57.169Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:08:57.169Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:09:00.024Z cpu2:67908)WARNING: NetPort: 1932: failed to disable port 0x2000006 on vSwitch0: Busy

2017-03-02T04:09:00.024Z cpu2:67908)NetSched: 701: 0x2000002: received a force quiesce for port 0x2000006, dropped 482 pkts

2017-03-02T04:09:00.024Z cpu2:67908)NetPort: 1879: disabled port 0x2000006

2017-03-02T04:09:00.026Z cpu2:67908)Vmxnet3: 17265: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.

2017-03-02T04:09:00.026Z cpu2:67908)Vmxnet3: 17623: Using default queue delivery for vmxnet3 for port 0x2000006

2017-03-02T04:09:00.027Z cpu2:67908)NetPort: 1660: enabled port 0x2000006 with mac 00:0c:29:db:36:c9

2017-03-02T04:09:02.169Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:09:02.169Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:09:07.169Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:09:07.169Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

2017-03-02T04:09:12.169Z cpu0:65926)INFO (ne1000): false TX hang detected on vmnic1

2017-03-02T04:09:12.169Z cpu0:65926)INFO (ne1000): false RX hang detected on vmnic1

MattiasN81 · ‎03-02-2017

I had the same problems in on of my labs, its not the exact same NIC but when i shifted from ne1000 driver to e1000e driver i got an huge improvment.

Try to disable the ne1000 and use e1000e instead using this command, "esxcli system module set --enabled=false --module=ne1000", the reboot the host.

VMware Certified Professional 6 - DCV VMware VTSP Software Defined Storage Dell Blade Server Solutions - EMEA Certified Dell PowerEdge Server Solutions - EMEA Certfied Dell Certified Storage Deployment Professional Dell EMC Proven Professional If you found my answers useful please consider marking them as Helpful or Correct

kbonnel · ‎03-02-2017

(hopefully this doesn't end up as a double post, as my reply via email didn't seem to come through.)

I will try in the next day or two to push all my traffic out of the affected adapter to see if the same thing happens to me. Will report back.

UPDATE: Since I was already up very early, I decided to disable my two other adapters and only run under the Intel Gigabit CT PCI-E adapter that I was having issues with. I have been pushing all of my data through it for the past 30 minutes, and have not gotten the error messages yet. I am usually pushing 4 - 10 MB/s through my adapters on a continuous basis with certain streaming devices, so I will keep it running throughout the morning and report back.

kbonnel · ‎03-02-2017

Thanks for this. I started googling a little more, and realized that many others were doing the same thing in order to make the Intel NUC adapter I219V work.

I have 3 adapters in my system, and my I217-LM and the Gigabit CT PCI-E adapter are using the ne1000, while my Intel 82541PI (PCI adapter) is using the e1000 driver.

I wonder which driver I was using in ESX 6.0 for the ne1000 adapters (e1000e maybe), or if the ne1000 got an update in 6.5 which is resulting in these errors.

I may try disabling ne1000, but I wanted to see what the main differences between the two are, and which ones is supposed to be "better".

MattiasN81 · ‎03-02-2017

The ne1000 driver came with 6.5.

i dont think the ne1000 is a better driver, quite the opposite if you have a NIC designed for desktops.

it does however have support for more NICs than e1000e but most of them is server branded cards

VMware Certified Professional 6 - DCV VMware VTSP Software Defined Storage Dell Blade Server Solutions - EMEA Certified Dell PowerEdge Server Solutions - EMEA Certfied Dell Certified Storage Deployment Professional Dell EMC Proven Professional If you found my answers useful please consider marking them as Helpful or Correct

mcmurm · ‎03-02-2017

Lots of good information guys. I appreciate the community cooperation.

Mattias - I'll try your suggestion later tonight to use the e1000e driver and will report back my results.

kbonnel · ‎03-02-2017

Thank you for the info.

I disabled the ne1000 and am now running on the e1000e. I will keep an eye on it and report back any findings.

As a side note, with the upgrade to ESXi 6.5, I also had an issue with disk latency, slow speeds, etc. I found a couple of sites that reported the same issues when using builtin sata adapters, like my LYNX Point AHCI controller on my Lenovo TS140. I was getting many, many, many errors in my dmesg output regarding disk slowness, response times, etc. Apparently the newly updated vmw_ahci driver, which is supposed to support many new features, seems to have some bugs (like the ne1000 driver). I disabled this driver and went back to the original sata-ahci driver, and so far that appears to have fixed all of my disk latency and speed problems.

It is just funny, as I was not even going down the path of driver issues on my SATA side, but after reading about disabling the ne1000 to use the other driver, I started searching on the SATA side and found the exact same command to disable the newer ahci driver.

Here is more info:

http://anthonyspiteri.net/homelab-supermicro-5020d-tnt4-storage-driver-performance-issues-and-fix/

mcmurm · ‎03-02-2017

Interesting, I've been getting the same disk latency errors too kbonnel. ESXi 6.5 seems to hate us both. Good idea on reverting to the older SATA controller driver. I'll give that a try too after I adjust the NIC driver and test per Mathias' suggestion.

mcmurm · ‎03-02-2017

After switching to the e1000e driver I've not have any further issues. Great suggestion Mattias.

MattiasN81 · ‎03-03-2017

Damn, never thought of doing the same with AHCI, reverting to the old achi driver worked like a charm, no more crappy errors on the drives

thanks for the enlighement :smileygrin:

VMware Certified Professional 6 - DCV VMware VTSP Software Defined Storage Dell Blade Server Solutions - EMEA Certified Dell PowerEdge Server Solutions - EMEA Certfied Dell Certified Storage Deployment Professional Dell EMC Proven Professional If you found my answers useful please consider marking them as Helpful or Correct

marcog86 · ‎01-14-2021

I tryed to put the change via ssh but now i can't access to esxi because I lose driver.

Now, must I reinstall ESXI? 🙂

All

Strange Ethernet error that I can't find any info on.