VMware Cloud Community
shuguet
Enthusiast
Enthusiast

NIC failing during traffic

Hello there,

I've recently received what I hope would be my new lab rig, this SuperMicro 1U Twin Server : http://www.supermicro.nl/products/system/1U/5016/SYS-5016TI-TF.cfm

It's not on the HCL, but it's still server grade material :

- Xeon X3440

- ECC RAM

- Dual Intel 82574L nic onboard

Most if not all of these components are on the HCL, and truth is, ESXi installs without complaining and works out of the box.

Problem is, when using the NICs (for example, I'm using an iSCSI SAN), the NIC "crash", but the ESXi keeps running.

I've attached an extract of the vmkernel during the following sequence :

- Boot a VM

- Start a Debian installation inside the VM

At some point during the installation of Debian the nic used for iSCSI traffic just crash, and the following lines can be seen in the log :

2012-02-27T20:15:31.595Z cpu2:2050)WARNING: NetSched: 1713: Scheduler [0x4100045c5b80] lock up [stopped=0] for vmnic3:
2012-02-27T20:15:31.595Z cpu2:2050)WARNING: NetSched: 1723: detected at 866999 while last xmit at 861775 and 2/37092 packets/bytes in flight [window full 1] and binary heap size 1 [stress 0]
2012-02-27T20:15:31.595Z cpu2:2050)WARNING: NetSched: 1732: Packets completion seem stuck, issuing reset on vmnic3 [stress 0]

The following lines are iSCSI software initiator complaining about not being able to write to the LUN.

I've also done the same test with the other onboard nic and no iSCSI traffic (ie with VM Network traffic), to the same result.

Any ideas on how to fix the problem ?

I have another dual port card in the server I intend to use in the meantime, but loosing 2 onboard nics that are on the HCL makes me sick ...

Sylvain.

Sylvain Huguet vExpert 2014, 2013, 2012 & 2011 VCP4&5/VTSP4/VSP4&5 Nutanix NPP/NPSE/NPSR
0 Kudos
16 Replies
marcelo_soares
Champion
Champion

Can't you try to upgrade mother BIOS and/or the nic firmware? Maybe you already have a correction to this issue available...

Marcelo Soares
0 Kudos
shuguet
Enthusiast
Enthusiast

Hello,

In fact I already checked, and I'm at the latest revision available form BIOS/NIC/IPMI firmware.

The output of ethtool -i vmnic2 indicate that I'm running the latest version of the Intel driver supported in the HCL.

Sylvain.

Sylvain Huguet vExpert 2014, 2013, 2012 & 2011 VCP4&5/VTSP4/VSP4&5 Nutanix NPP/NPSE/NPSR
0 Kudos
shuguet
Enthusiast
Enthusiast

No idea ? Anyone ?

I can give any additional details that may be necessary.

Sylvain.

Sylvain Huguet vExpert 2014, 2013, 2012 & 2011 VCP4&5/VTSP4/VSP4&5 Nutanix NPP/NPSE/NPSR
0 Kudos
TobiaszJason
Contributor
Contributor

We are experiencing the same issue on hardware we just purchased.  I have just started digging into it...

I saw your other post at: http://communities.vmware.com/message/1997632#1997632.  FYI, we are using a Dell PowerEdge R710 (http://www.dell.com/downloads/global/products/pedge/en/server-poweredge-r710-specs-en.pdf) with two Broadcom 5709 Dual Port 1GbE NIC w/TOE iSCSI, PCIe-4 (430-3260).

If you find out any information about your issue, if you wouldn't mind updating this thread that would be helpful.  I'll do the same.  BTW, I saw this: http://www.vmug.nl/phpbb/viewtopic.php?t=5680.  It's in Dutch, but I think they are saying there is some incompatibility with the NICs he was using.

0 Kudos
shuguet
Enthusiast
Enthusiast

Still no luck getting this to work.

I just applied a bunch a patches, at least 3 of them related to the e1000e driver, and the problem is still not fixed.

012-04-01T13:03:33.645Z cpu6:9136)WARNING: NetSched: 1817: Scheduler [0x4100188b79c0] lock up [stopped=0] for vmnic3:

2012-04-01T13:03:33.645Z cpu6:9136)WARNING: NetSched: 1827: detected at 4035995 while last xmit at 4030121 and 1/60 packets/bytes in flight [window full 0] and binary heap size 0 [stress 0]

2012-04-01T13:03:33.645Z cpu6:9136)WARNING: NetSched: 1836: Packets completion seem stuck, issuing reset on vmnic3 [stress 0]

Any idea is welcome...

Sylvain.

Sylvain Huguet vExpert 2014, 2013, 2012 & 2011 VCP4&5/VTSP4/VSP4&5 Nutanix NPP/NPSE/NPSR
0 Kudos
Datto
Expert
Expert

0 Kudos
shuguet
Enthusiast
Enthusiast

I don't think it's related, because these links applies to a similar problem but at the virtual machine level (vnic), where my problem is at the physical nic level (vmnic).

Thanks for the info anyway, it was worth looking into it.

Sylvain.

Sylvain Huguet vExpert 2014, 2013, 2012 & 2011 VCP4&5/VTSP4/VSP4&5 Nutanix NPP/NPSE/NPSR
0 Kudos
Datto
Expert
Expert

You might also look at this thread and see if it relates to what you're experiencing:

http://communities.vmware.com/message/1430032

Datto

0 Kudos
shuguet
Enthusiast
Enthusiast

I tried with all three IntMode (0,1,2) for the e1000e driver, with no luck.

Using the IntMode=0,0 (There are two NICs in this box), the problem was worse as the NIC failed just by loading a simple web page on the only VM using this vmnic.

The other two modes didn't do much to help either, the NIC kept failing after somewhere between 100Mb to 400Mb of traffic passing on the vmnic.

At least I have things to try!

Keep it going if you have any other thought on the matter Smiley Happy

Sylvain.

Sylvain Huguet vExpert 2014, 2013, 2012 & 2011 VCP4&5/VTSP4/VSP4&5 Nutanix NPP/NPSE/NPSR
0 Kudos
Datto
Expert
Expert

You might run a network cable tester or swap the network cable to see if there's any positive effect.

Datto

0 Kudos
shuguet
Enthusiast
Enthusiast

Got the very same problem on 2 boards, and on all onboard NICs on these 2 boards.

No problem whatsoever with the additionnal, not onboard, NICs.

Beside, changing/checking the cables was one of my first step in troobleshooting the problem Smiley Happy

I'll see if I can get someone from Supermicro to look into it, but as they do not officially support this board with ESXi, I'm thinking it's a lost battle anyway.

Sylvain.

Sylvain Huguet vExpert 2014, 2013, 2012 & 2011 VCP4&5/VTSP4/VSP4&5 Nutanix NPP/NPSE/NPSR
0 Kudos
Datto
Expert
Expert

I did once have some Dell servers that had NICs that wouldn't work unless I shut off the USB ports in the system BIOS and exhibited a sympton similar to your system. If for some reason shutting off the USB ports works, you may be able to re-assign the interrupts to get an arrangement that works for USB also.

Datto

0 Kudos
thomasq
Contributor
Contributor

Hi Sylvian,

Did you get any further with this issue as I'm tearing my hair out with this exact problem too - in my case a Supermicro X8S16-F and I too bought it specfically because everything was supported in ESXi...

I've purchased another Intel NIC and this works perfectly but having two onboard NICs just sitting there is very frustrating (especially as I had plans for them). (Rant over)

In any case, does anyone have any further suggestions?

0 Kudos
shuguet
Enthusiast
Enthusiast

Hi thomasq,

As a matter of fact, yes I have it working fine now.

After trying to fix the problem myself, I opened a case with Supermicro somewhere in June.

They were able to reproduce the problem, then got in touch with Intel, and they had me try a couple of firmwares, until the last one they sent me at the end of July fixed my problem.

I didn't have the chance to try it until 3 days ago, but I can report that it is working fine so far, even when put under a lot of stress (I switched my iSCSI network on to it, and stressed it with a steady 90+Mb/s throughput for 12 hours straight).

I recommend that you check with Supermicro, as their support staff has really been helpful on this one.

I hope you'll get your NICs working!

Sylvain.

Sylvain Huguet vExpert 2014, 2013, 2012 & 2011 VCP4&5/VTSP4/VSP4&5 Nutanix NPP/NPSE/NPSR
0 Kudos
thomasq
Contributor
Contributor

Sylvain,

Thank you so much for the reply - I had given up hope!

regards,

Thomas.

-- Thomas B. Quillinan

0 Kudos
zhangyh2013
Contributor
Contributor

Hi Sylvain,

          Did you mean Supermicro update a BIOS version to you and solved the problem? Any further details? I met a similar problem with the following log in vmkwarning.log several month ago and really drived me mad. Any suggestion? Thanks in advance.

2016-04-15T09:43:22.985Z cpu26:33758)WARNING: LinNet: netdev_watchdog:3474: NETDEV WATCHDOG: vmnic9: transmit timed out

2016-04-15T09:43:27.036Z cpu48:1887628)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3921: vmnic8: failed to push the coalescing settings cranking upinflight window to infinite: Failure

2016-04-15T09:43:28.037Z cpu14:1887636)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3921: vmnic9: failed to push the coalescing settings cranking upinflight window to infinite: Failure

2016-04-15T09:43:34.000Z cpu24:33758)WARNING: LinNet: netdev_watchdog:3474: NETDEV WATCHDOG: vmnic9: transmit timed out

2016-04-15T09:43:38.037Z cpu22:1887666)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3921: vmnic8: failed to push the coalescing settings cranking upinflight window to infinite: Failure

2016-04-15T09:43:38.041Z cpu80:1887668)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3921: vmnic9: failed to push the coalescing settings cranking upinflight window to infinite: Failure

2016-04-15T09:43:44.012Z cpu58:33751)WARNING: LinNet: netdev_watchdog:3474: NETDEV WATCHDOG: vmnic9: transmit timed out

2016-04-15T09:43:44.046Z cpu72:1887668)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3863: vmnic9 : scheduler(0x410c9034c2f0)/device(0x410b5da37940) 0/1 lock up [stopped=0]:

2016-04-15T09:43:44.046Z cpu72:1887668)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3874: detected at 682017011 while last xmit at 682011996 and 4401 bytes in flight [window 1500000 bytes] and last enqueued/dequeued at 682016995/682011996 [st$

2016-04-15T09:43:44.046Z cpu72:1887668)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3890: vmnic9: packets completion seems stuck, issuing reset

Regards,

Felix

0 Kudos