Re: NIC Crash

mkohlmann · ‎02-23-2010

I'm going nuts trying to figure out why I can't get these Intel NICs to be stable. Server set-up

Supermicro X8SIL-F
Xeon X3440
8GB DDR3-1333 ECC
4GB CF Card for ESXi Install
Dual Intel 82574L Gigabit NICs

After transferring a small amount of data to local storage via the datastore browser or via Veeam's FastSCP the NIC will crash after ~30MB to 1GB. This is a supported NIC and why I purchased this particular board. It's a similar problem I had with a Realtek NIC on an Atom board I was playing around with but that card wasn't supported and a driver tweak solved the issue. Does anyone have any ideas on how to make this work? I may have to go to Xenserver but ESXi's setup is better for me.

Thanks

Mark

geddam · ‎02-23-2010

Couple of questions...

What is the driver version used for this NIC? It should be (e1000e version 0.4.1.7)...See I/O Comptability guide for more info....

What is the build of ESX4i you are using...are you using server vendor specific image or VMware downloaded image....Is ESXi bundled with server or you installed it on local drive?

Ramesh. Geddam

VCP 3&4, MCTS (Hyper-V)

Thanks,, Ramesh. Geddam,

mkohlmann · ‎02-24-2010

The 82574L is listed as being compatible in the I/O portion of the HCL.

ethtool -i vmnic2

driver: e1000e

version: 0.4.1.7-NAPI

firmware-version: 1.9-0

bus-info: 0000:04:00.0

ethtool -i vmnic3

driver: e1000e

version: 0.4.1.7-NAPI

firmware-version: 1.9-0

bus-info: 0000:05:00.0

This is a downloaded and installed copy of ESXi. I've installed it on a 4GB CF card and 100GB hard drive and both have the same problem.

/sbin # uname -a

VMkernel esxi.local 4.0.0 #1 SMP Release build-219382 Dec 22 2009 19:18:55 x86_64 unknown

I'm not even running any VMs yet. I'm just trying to copy some ISOs to the datastore so I can create some VMs. Sometimes the NIC will crash and the management console will still operate, other times the management console is completely locked and a reboot is required.

ESX4 lists a driver of 0.4.1.7.2vmx but that driver doesn't appear to be available for ESXi4.

I am totally stumped. I see others have reported problems with these NICs but this is ridiculous. I tried the IntMode=0,0,0,0 setting as mentioned on the VMWare KB but it hasn't made a difference.

This is a server motherboard with an Intel 3420 chipset.

Any help would be appreciated. I can't imagine I'm the only one having issues, the 82574L NIC is used on a number of server motherboards from a variety of vendors.

geddam · ‎02-24-2010

How many NIC cards you have on board....Are they quadport?

Thanks,, Ramesh. Geddam,

geddam · ‎02-24-2010

Also have tried the older driver version for this NIC in ESX4i...

http://downloads.vmware.com/d/details/esx_35u5_intel_82575_82576_dt/dGViZGolaGJkZXBo#version_history

Thanks

Ramesh. Geddam

VCP 3&4, MCTS (Hyper-V)

Thanks,, Ramesh. Geddam,

mkohlmann · ‎02-24-2010

Its two 82574L NICs. I changed the Plug and Play settings in the BIOS and that reassigned the IRQs to the NICs which ESXi recognized as two new cards and no longer recognized the old IRQs.

Intel and VMWare list the 82575/6 driver as being different than the 82574L but I'll install it anyway.

Mark

mkohlmann · ‎02-26-2010

Well, I installed CentOS5 and experienced some NIC stability issues. Downloaded the Intel driver v1.1.21a and no longer had any issues. Installed Windows Server 2008 R2 x64 and haven't seen any issues. I guess I'll have to run VMWare Server 2 until VMWare decides it prudent to fix the driver for one of the most common NICs on the planet.

RootWyrm · ‎02-26-2010

Mark;

The problem you are seeing is specifically caused by a bug in the interrupt handling of the e1000 driver. To be overly technical about it, the problem appears to be in the RSS Interrupt Masking routines, which results in a loss or corruption of the MSIX IVAR table. It's not necessary to actually confirm the bug - the X8SIL-F is exactly what I found it on, but you can go to the unsupported console and type "ethtool -t vmnic1 offline" where vmnic1 is NOT an active management interface. You should immediately get a PSoD:

PCPU 0 locked up. Failed to ack TLB invalidate (0 others locked up).

cr2=0x0 cr3=0x400ed000 cr4=0x16c

*0:8353/ethtool 1:6231/sfcbd 2:4109/helper1-0 3:5287/sfcbd

The bug is known to VMware at this point, and they are actively investigating and working on the problem. Unfortunately, I don't have any sort of ETA on when a fix will be available.

Other operating systems will be unstable until you perform a reset of the IPMI, and set the IPMI Network to DEDICATED. This is an absolute requirement; if the IPMI is operating in Shared mode, the BMC will cause problems continuously. There are issues with other OS drivers due to poor programming practices and poor quality code, but I have confirmed FreeBSD, Windows Server 2003 and Windows Server 2008 as having absolutely no problems.

Hatclub · ‎03-09-2010

I too am experiencing the same issue with the x8sil-f board.

I am running the latest system BIOS, and the latest VMWare ESXi 4 (complete with both of the patches released since 4.0.0u1).

I've set the IPMI to dedicated and this has helped but has not completely alleviated the issue - we are still seeing the management interface cease to function after a day or two of uptime. When this happens, the VM I currently have deployed on the system is still accessible (runs through the second NIC present on these boards).

I am a complete newbie to troubleshooting ESXi as to be honest when I have picked hardware from the HCL I've never had an issue. I don't have access to a shell, by the looks of things, but I can see over the IPMI that there is a VTY with the vmkernel log on it which contains:

1:06:20:18.631 cpu0:4302)WARNING: LinNet: netdev_watchdog NETDEV_WATCHDOG: vmnic0: transmit timed out

1:06:20:18.632 cpu0:4302)BUG: warning at vmkdrivers/src26/vmklinux26/vmware/linux_net.c:3255/netdev_watchdog() (inside vmklinux)

1d6h uptime seems about right for the time that the interface ceased functioning.

Scrolling back shows that the system is using e1000e driver v0.4.1.7-NAPI, and the interfaces are mapped on PCI-E (to save me some typing I've cut the timestamps and process id info)

Loading module e1000e ...

Elf: 2320: (e1000e) symbols tagged as (GPL)

module heap : Initial heap size : 102400, max heap size: 4194304

module heap e1000e: creation succeeded. id = 0x41000f400000

(snip module skb heap info)

(6)e1000e: Intel(R) PRO/1000 Network Driver - 0.4.1.7-NAPI

PCI: driver e1000e is looking for devices

(snip lots of unsuccessful probes)

VMK_PCI: 1103: device 004:00.0 allocated 2 vectors (intrType 3)

VMK_PCI: 739: device 004:00.0 capType 16 capIndex 224

(6)000:04:00.0: vmnic0: (PCI Express:2.5GB/s:Width x1) 00:25:90:00:58:a8

(6)000:04:00.0: vmnic0: Intel(R) PRO/1000 Network Connection

(6)000:04:00.0: vmnic0: MAC: 4, PHY: 8, PBA No: 0101ff-0ff

PCI: driver e1000e claimed device 0000:04:00.0

PCI: Registering network device 0000:04:00.0

VMK_PCI: 638: Device 004:00.00 name: vmnic0

LinPCI: LinuxPCI_DeviceClaimed: Device 4:0 claimed.

(snip vmnic1 which is identical but on device 5:0)

(snip some lines for other module loads, entropy devices etc)

(6)0000:04:00.0: vmnic0: Link is up 100 Mbps Full Duplex, Flow Control: None

(6)0000:04:00.0: vmnic0: 10/100 speed: disabling TSO

(snip storage driver loading - arcmsr, but I do not believe this to be related as I was getting interface crashes without this driver just installing to a flash drive)

(6)0000:05:00.0: vmnic1: Link is Up 1000 Mbps Full Duplex, Flow Control: None

I later see entries where the vswitches are loaded, and then the following which I assume relates to the bringing up of the management interface as the mac address it mentions is the one belonging to vmnic0:

Tcpip_Interface: 824: NIC supports Tso

Tcpip_Interface: 831: Stack supports TSO. MSS (minus TCP options) = 65535

Tcpip_Interface: 839: NIC support TX checksum offloading

Tcpip_Interface: 845: NIC supports Scatter-Gather transmits

Tcpip_Vmk: 186: vmk0: Ethernet address: 00:25:90:00:58:a8

Tcpip_Interface: 902: ether attach complete

Now I noticed the remark about vmnic0 having TSO disabled as it was not line-up at 1Gbps but when I originally had this system on my workbench I had vmnic0 attached to a gig port, so the problem occurs regardless of linespeed (and inherent TSO state).

When I look in 'configure management network' and select a network adapter I can see that vmnic0 is claiming to be disconnected even though the physical link is up and the link light is on on the port on the server. Flapping the port has no effect, restarting the managment network on the console has no effect.

Does anyone have any ideas or any further light to shed on this issue yet? I have left the system in this state so any further debugging that is required can be performed.

TIA

Phil

Hatclub · ‎03-09-2010

...and now my VM load has vanished off the network as well. (though vmnic1 is still showing connected unlike the management nic)

Is it worth trying to roll in a newer e1000e oem module that adds support for other cards in the assumption that it's newer and may work?

Hatclub · ‎07-27-2010

FYI, this issue appears to be resolved with the drivers present in ESX 4.1.0. Presume that ESXi has the same driver bundle and should therefore also work as expected.

shuguet · ‎02-28-2012

Sorry to dig up such an old thread, but I'm curious to know if you used ESXi 5 on this hardware, as I'm having exactly the same problem with an X8SIT-F Supermicro motherboard and the onboard 82574L NICs...

Sylvain.

Sylvain Huguet vExpert 2014, 2013, 2012 & 2011 VCP4&5/VTSP4/VSP4&5 Nutanix NPP/NPSE/NPSR

All

NIC Crash