VMware Cloud Community
F128
Contributor
Contributor

ESXi 6.0u2 and Intel X710-DA2 NIC - motherboard nonmaskable interrupt on system reboot with HP ProLiant DL380p Gen8

Hi All Smiley Happy

We have recently purchased two new HP ProLiant DL380p Gen8 servers and as these come with quad 1Gbe on-board I have also added Intel X710-DA2 NIC on each to benefit from faster speeds to our storage. Genuine Intel 10GBase-SR SFP+ modules have been also installed.

First as there is no i40e driver in ESXi 6.0u2 (3620759) I had to load the drivers in order to make ESXi recognize my cards. I downloaded the latest i40e version 1.4.26 available on VMware website through that link VMware Compatibility Guide - I/O Device Search‌‌ and installed the vib. As reboot was required I have rebooted the system and after the reboot the adapters were now assigned a vmnic numbers and were available to claim for usage.

Just to make sure I have also downloaded the latest NVM updater from Intel website to verify I am running on the most recent firmware. So I downloaded the updater for firmware version 5.02 from Download NVM Update Utility for Intel® Ethernet Converged Network Adapter XL710 & X710 Series‌ and using the ESX updater verified that both cards are running on the latest firmware (they were already up-to-date).

Eventually I continued on my deployment, but as I needed to reboot one of the hosts I hit a PURPLE screen with the error: "LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed. This may be a hardware problem; please contact your hardware vendor." (check attached file esxi-crash.JPG)

To further-troubleshoot I have verified that I am running on the latest BIOS version on the DL380p Gen8 servers (P70), and I also had the latest iLO4 version (2.30). I have tried also the following:

1) Adjusting the power options for the server to full performance

2) Using different Intel driver for VMware (I tried with all available versions: 1.4.26, 1.3.45, 1.3.38, 1.2.48, 1.2.22, 1.1.1)

Neither one of these helped and each time I would reboot any of the hosts I would hit the same Purple screen with the same error.

Eventually in BIOS I noticed that even-though the X710 NIC is recognized as Ethernet adapter in IRQ settings, in the PCI enable/disable list the NIC is shown as unknown PCI device. (check attached files: dl380p-gen8-01.PNG & dl380p-gen8-02.PNG)

I am also attaching the output of /usr/lib/vmware/vm-support/bin/swfw.sh (attached file output-swfw.txt). In that file is clearly visible that I am running 1.4.26 and my firmware version is 5.02.

It is good mentioning that network traffic is just fine, the only time I get that purple screen is when I reboot either one of the hosts. Also NIC ports are not teamed.

Any ideas or suggestions are highly appreciated!

Best,

KG

Reply
0 Kudos
14 Replies
cesprov
Enthusiast
Enthusiast

What firmware is on your X710-DA2 NICs?  The early versions of the X710-series firmware had some pretty major issues that I personally experienced for almost a year while I waited for Dell to release updated firmware, despite Intel releasing several iterations of the firmware and driver over that same period of time.

My issues were addressed somewhere around firmware 4.37 or 4.38 (thanks Intel developer forums!).  I see from the compatibility list you posted that the current firmware is somewhere north of 5.02 now.  My current firmware is still v4.53 now.  You can find your version by running:

esxcli network nic list  (to get a vmnicX number that corresponds to an X710 NIC)

ethtool -i vmnic0  (to get the details of vmnicX, including firmware and driver versions)


One thing I should also mention.  According to the VMware tech I spoke to way back when, that VMware compatibility list shows both a driver version and a firmware version.  You should be using the driver that matches the firmware as this is how VMware tested it as sometimes there are driver changes that rely on firmware changes.  So if you're running say the 1.4.26 driver with the 4.42 firmware, you might see issues.

Also, what ISO did you install from?  I don't think the HP customized version of the 6.0U2 ISO is out yet.  That probably will have contained the i40e driver in it but you can't count on the VMware ISO to contain all drivers.

Edit:  Doh!  Sorry, I didn't see that you already posted your NIC firmware/driver versions and you're on the current versions.  The issues I experienced with older firmware was with TSO/LRO.  I disabled both at the time to work around my issues and never re-enabled them after I upgraded my firmware so maybe it''s possibly still related?  You can try disabling both using the following commands to see if it makes a difference:

esxcli system settings advanced set -o /Net/UseHwTSO -i 0

esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0

Reboot afterwards.  Use 1 to re-enable with another reboot.

Reply
0 Kudos
mkramerbs
Enthusiast
Enthusiast

Did you every get this fixed? We are still having an issue where the system is "falling off the network" as in no longer being able to ping the vmk IP or VM IPs hosted by the X710 card.  It ususally happens fairly quickly after a reboot, but can range from minutes to hours before it occurs.  No PSOD occurs, just the NIC dying.  We can access the console via a lights out "out of band" connection and issue commands such as esxtop, dmesg, etc.  But cannot ping out or in.  Only fix is to reboot the host.

We have already confirmed TSO is disabled, and it still occurs.

https://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2126909&s... 0 115026018

On boot you still see the i40e driver issue a PF reset.

We are on the latest driver and firmware...

ethtool -i vmnic3

driver: i40e

version: 1.4.26

firmware-version: 5.04 0x800024c6 0.0.0

bus-info: 0000:81:00.1

grep -i lro /etc/vmware/esx.conf

/adv/Net/TcpipDefLROEnabled = "0"

grep -i tso /etc/vmware/esx.conf

/adv/Net/UseHwTSO = "0"

/adv/Net/UseHwTSO6 = "0"

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

mkramerbs,

Where'd you get your Intel X710 cards from?  Are they actual Intel NICs or are they re-branded ones that came with a particular brand of server?

The reason I ask is that we continued to have issues with PF resets with our X710s after I posted the reply above.  Sometimes the i40e driver issues the TX PF reset and the reset fixes the issue quickly and the outage is unnoticeable.  But sometimes the PF reset doesn't work and it seems the driver gets stuck in a loop and continues to PF reset without ever succeeding, in which case we have a host crash that won't HA the VMs until we power off the host.  After doing a lot of research on this, I was reasonably confident that this was an uncovered bug with the i40e driver.  Long story short, after convincing my server vendor to address this issue with Intel, there *may* be a beta i40e driver that addresses PF reset issues.  Whether the PF reset issues addressed are the same issues you are experiencing I can't say for sure.  I don't have the beta driver as I haven't signed the NDA so I can't even confirm it works.  I just know it exists and why it exists because I instigated the fix.  I have no idea when it goes GR but it will eventually go up on vmware.com when it does.


Circling back, how you would go about getting this beta driver if you wanted it would depend on where you got the X710 cards.  If they are Intel cards (meaning you can flash their firmware using the firmware from Intel's site), then you need to open a case with Intel.  If you got re-branded ones included in a server from a server vendor, contact the vendor as Intel won't even talk to you really.

Reply
0 Kudos
mkramerbs
Enthusiast
Enthusiast

Thanks for the response.  These cards are direct from Intel and loaded into a SuperMicro chassis.  I have opened a ticket with VMware and with Intel.  Hopefully one of them will supply me with a newer driver to resolve the issue for good.  It is rather painful at the moment. Smiley Sad

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

*cough* *cough* driver v1.66 *cough* *cough*

Reply
0 Kudos
mkramerbs
Enthusiast
Enthusiast

No love from Intel or VMware on a driver v1.66. Both of their support techs are saying 1.4.26 is the latest Smiley Sad

Reply
0 Kudos
hostasaurus
Enthusiast
Enthusiast

Found this thread while debugging less than ideal network performance on a Dell R530 with the dual port Dell-branded version of the X710.  I was able to get the firmware up to 5.02, not easily since the servers are vsphere and Dell only packages their version of the firmware in windows and RHEL format, so had to boot a CentOS Live dvd just to flash it.  In any case, thanks to this thread, I found disabling LRO and TSO improved the iSCSI performance I was primarily interested in; gained about 500 Mbit/sec in either direction.  Still seeing better out of our Cisco UCS servers, but that wasn't suitable for the project I'm working on so just going to live with it and hope for continued improvements.

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

v1.4.28 has been released and supposedly contains the fix in the beta 1.6.6 driver I am using.  You can find it here:  VMware Compatibility Guide - I/O Device Search.  I have not installed 1.4.28 yet but the 1.6.6 beta driver seemed to clear up a lot of my issues.  My recommendation is to make sure you're on at least firmware 5.02 or better yet 5.04.

Reply
0 Kudos
anvanster
Enthusiast
Enthusiast

Hi cesprov,

where did you get this beta driver?

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

Sorry for the late reply.  I got the beta driver from Intel.  But the 1.6.6 versioning was some sort of internal versioning.  It was officially released as 1.4.28 and can be found here:

VMware Compatibility Guide - I/O Device Search

Reply
0 Kudos
F128
Contributor
Contributor

@cesprov I have installed that driver but unfortunately I still experience the same issue - PURPLE screen with the error: "LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed. This may be a hardware problem; please contact your hardware vendor."

Currently using firmware 5.05 on Intel X710-2 and i40e version 1.4.28 ESXi driver:

ethtool -i vmnic4

driver: i40e

version: 1.4.28

firmware-version: 5.05 0x80002892 1.1568.0

And to avoid any references to KB, I am re-iterating that I have also tried disabling the following HW offloads without any success, and as one can see in the KB the purple screen for these has totally different messages.

esxcli system settings advanced set -o /Net/UseHwTSO -i 0

esxcli system settings advanced set -o /Net/UseHwTSO6 -i 0

esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0

esxcli system settings advanced set -o /Net/Vmxnet3HwLRO -i 0

Did you by any chance manage to get around that issue, or are you still experiencing it?

Reply
0 Kudos
clunsford1
Contributor
Contributor

I know this is a little old, but I am attempting to upgrade our Host from Esxi 5.5 to Esxi 6.0, and we are the getting almost the exact same error as the original poster.

The only difference is our Hosts are Dell Poweredge R815, with AMD Opteron CPU 6176

Reply
0 Kudos
F128
Contributor
Contributor

I am already running iLO Firmware 2.10 Smiley Sad Still stuck at that purple screen on reboot. If I remove the X710 NIC from the server everything works just fine...

Reply
0 Kudos