BaumMeister
Contributor
Contributor

dead I/O on igb-nic (ESXi 6.7)

Hi,

I'm running a homelab with ESXi 6.7 (13006603). I got three nics in my host, two are onboard and one is an Intel ET 82576 dual-port pci-e card. All nics are assigned to the same vSwitch; actually only one is connected to the (physical) switch atm.

When I'm using one of the 82576 nics and put heavy load on it (like backing up VMs via Nakivo B&R) the nic stops workign after a while and is dead/Not responding anymore. Only a reboot of the host or (much easier) physically reconnecting the nic (cable out, cable in) solves the problem.

I was guessing there is a driver issue, so I updated to the latest driver by intel:

[root@esxi:~] /usr/sbin/esxcfg-nics -l

Name    PCI          Driver      Link Speed      Duplex MAC Address       MTU    Description

vmnic0  0000:04:00.0 ne1000      Down 0Mbps      Half   00:25:90:a7:65:dc 1500   Intel Corporation 82574L Gigabit Network Connection

vmnic1  0000:00:19.0 ne1000      Up   1000Mbps   Full   00:25:90:a7:65:dd 1500   Intel Corporation 82579LM Gigabit Network Connection

vmnic2  0000:01:00.0 igb         Down 0Mbps      Half   90:e2:ba:1e:4d:c6 1500   Intel Corporation 82576 Gigabit Network Connection

vmnic3  0000:01:00.1 igb         Down 0Mbps      Half   90:e2:ba:1e:4d:c7 1500   Intel Corporation 82576 Gigabit Network Connection

[root@esxi:~] esxcli software vib list|grep igb

net-igb                        5.2.5-1OEM.550.0.0.1331820            Intel   VMwareCertified   2019-06-16

igbn                           0.1.1.0-4vmw.670.2.48.13006603        VMW     VMwareCertified   2019-06-07

Unfortunately this didn't solve the problem.

However ... this behaviour doesn't occur, when I'm using one of the nics using the ne1000 driver.

Any idea how to solve the issue?

(... or at least dig down to it's root?)

Thanks a lot in advance.

Regards

Chris

PS: I found another thread which might be connected to my problem: Stopping I/O on vmnic0  Same system behaviour, same driver.

27 Replies
Madmax01
Expert
Expert

I had now this lovely Issue as well.

igb 5.3.3..

Host had 12 Days uptime after upgrade from ancient version 6.0.  and then started to get the Issue.

i downgraded now the Driver to 5.3.2 ,... so far is fine. Will see next 12 days hows going.

Hows for everyone else working?

Intel seems itself have an 5.3.6,... but not seeing ported to esxi as vib

thx

max

0 Kudos
VirtualSlam
Contributor
Contributor

I've still been good on 5.3.2 as of today. I've been running constant pings across multiple paths and larger transfers of 10GB downloads and 30GB uploads each day without issue.

Madmax01
Expert
Expert

;( happend again for 5.3.2 driver on my side.    My god is so impressive that working condition just breaks with newer versions.    i need to replace the card - sensless.

@virtualslam:    Which FW you're on?.

as i have  1.2  and seems issue with it

Best regards

Max

0 Kudos
VirtualSlam
Contributor
Contributor

I have a 4 port Dell version with FW 1.77 and a 2 port Supermicro version with FW 1.13.1. I am running the tests across each card though and with the same port configuration I had before switching to 5.3.2. Sorry to hear it didn't help you. I'm still skeptical of the reliability of these NICs in ESXi. But it is just for a lab and I will need newer NICs one day when I switch to ESXi 7 since they aren't supported anymore anyway.

0 Kudos
VirtualSlam
Contributor
Contributor

Well there we go. Pushed it just that much harder with a template deployment and the nic crashed. So 5.3.2 does not make it stable enough. I guess it's time to shop for some new nics.

0 Kudos
Smitty0001
Contributor
Contributor

I wanted to thank everyone for this thread.  I am having the same issue.  Upgraded one host to 6.5U3 (15256549) and thought everything was fine.  Ran it for a couple weeks and didn't notice any issue.  Upgraded three more hosts to the same build and the VM's just started dropping off the network left and right.  After banging my head into the wall I finally found this thread.  Sure enough, I moved my traffic to an onboard Broadcom and the problem stopped.  I have eight Intel Gigabit ET Quad Port Cards that show up as Intel Corporation 82576 Gigabit Network Connection..two quad port cards per machine.  We have a four port port-channel for production traffic that was having fits.  Driver was the igb 5.3.3 driver.  We were thinking about downgrading it to 5.3.2 but after reading all the comments here...seems like pretty much all the versions do not work once you go to 6.5U3.

We just purchased eight new Intel I350-T4 cards to replace the Intel 82576's.  We have replaced them on one host and so far it seems to have fixed the issue.  We were able to recreate the issue by copying a 50GB file off of one of our VM's...it would pretty much take down the network each time with 82576's.  We tested...

Distributed vSwitch - 4 Port Port Channel using Intel 82576's - Copy File...high packet loss and eventually the VM's would lose connectivity.

Distributed vSwitch - 1 Port Trunk using Intel 82576 - Copy File...same problem

Standard vSwitch - 1 Port Trunk using Intel 82576 - Copy File...same problem...lost connectivity

Distributed vSwitch - 1 Port Trunk using Broadcom - Copy File - Works

Standard vSwitch - 1 Port Trunk using Broadcom - Copy File - Works.

Finally

Distributed vSwitch - 4 Port Port Channel using Intel I350-T4's - Copy File - Works.

So...it definitely seems to be related to the 6.5U3 or igb driver.  Those NIC's were working fine before the upgrade.  I hated wasting the money on eight new quad port NIC's but my VMware support case so far has gone nowhere and I had production equipment down.

Thank you all for the info you posted.  I still have a case open with VMware but now that I am swapping the NIC's to resolve the issue, they will probably end up just closing it.

0 Kudos
VirtualSlam
Contributor
Contributor

Just as a follow-up to my situation. I got Intel I350 nics and it has worked well in every metric that I tested before. pfSense can now use vmxnet3. Storage vMotions are using both of the vnics that I have given it to use and is not losing connection like before. And lastly backups that would cause it to lose connections are working without issue as well.

0 Kudos
dz0077
Contributor
Contributor

Thank you, it's been a long time coming, and following this method, 2 Dell C2100 servers, VMware ESXi, 6.5.0, 10719125 have been operating normally for 19 days.

0 Kudos