VMware Cloud Community
Amfab
Contributor
Contributor

ESXi 6.0 PSoD issues (even with Express Patch 6)

So last weekend I encountered the infamous PSoD / Vmxnet3 issue on our 6.0u2 host:

PSOD_cropped.PNG

This occurred after a newly created hardware version 11 VM with a Vmxnet3 NIC was put into production. The PSoD backtrace was consistent with kb2144685 so I proceeded to install Express Patch 6 to fix the issue. It worked fine for a week so I assumed it was fixed...

Last night I woke up to another PSoD:

PSOD2_cropped.png

And then another shortly after booting up the "Veeam" VM again (this time with a E1000 NIC instead of Vmxnet3):

PSOD3_cropped.PNG

I exported the System Logs from vSphere client after both crashes last night, in the hopes that someone could help me figure out the cause. Unfortunately I just found out we only have a Subscription support contract with VMware so their phone support wouldn't help me. Perhaps someone here could?

P.S. All the other VM's on this host are hardware version 10 or older, as they were migrated from a ESXi 5.5.0 host. I've not once seen those VM's listed in a PSoD backtrace, only the new "Veeam" running hardware version 11. The host is back up and running sans said VM and so far so good. Pretty sure I'll be downgrading to 5.5.0 this weekend after all this :smileyangry:

Edit: I've opened a support case with Dell ProSupport who have offered to have a VMware expert there take a look at the System logs. Will update here what they find.

Reply
0 Kudos
9 Replies
Amfab
Contributor
Contributor

So Dell's expert said it was caused by a page fault in the i40e driver (Intel X710 QP NIC) and that I should update the driver and firmware. Fingers crossed :smileyshocked:

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

Update to the latest i40e driver:

VMware Compatibility Guide - I/O Device Search

Also, Dell should have the 5.02 X710 NIC firmware posted by now.  It shows up on their site as v17.5.11.

After you have upgraded the driver and the firmware, disable TSO/TSO6/LRO as per this KB:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=21269...

Reply
0 Kudos
cypherx
Hot Shot
Hot Shot

These kind of things have me very nervous about updating my 5 other Dell R620 ESXi 5.0 servers to 6.  I have these 5 servers on QLE3242 NICs though and heavily used on those 10gbe interfaces, especially when Veeam does its native NFS transport to offload data to an Exagrid storage appliance.


You would think there would be better isolation between VM's and the kernel.  One VM should not take down an entire host which could be running other production machines.  I wonder if some kind of isolation or "compartmentalized" Kernel approach or VM's each spawn their own VM kernel (nested virtualization) to further isolate each other from these kind of things.  I don't dare use Hardware Version 11 as I don't see any benefits from it since you have to use the web client to make changes to it vs the simple and effective C# client.

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

You say you have QLE3242 NICs in your Dell R620s.  These are QLogic cards so I assume they are not using the i40e driver?  The i40e driver is for the newish line of Intel NICs, generally the X710 family of NICs, which I believe is supplanting the X5x0-series (ixgbe driver) of Intel NICs.  If they don't use the i40e driver, you won't have this particular problem.  That's not to say whatever driver the QLogic cards use won't have it's own issues but you shouldn't see this particular issue.

We've had a Dell R630, since May/June 2015, with an X710 (DA4 I believe offhand), where two of the NICs are for iSCSI and the other two for normal network traffic.  The NICs were problematic from day one.  Aside from the issues I mentioned above, which are well documented by now and should fix your immediate issue, there are still outstanding issues with that NIC that causes TX/transmit failures which cause the NIC to stop transmitting, and can essentially crash the host but the VMs won't HA until you bounce the affected host.  I had to throw a fit with Dell to get them to address these issues with Intel.  Dell has a beta 1.66 driver for the i40e that seems to have fixed those issues (or I just haven't had a recurrence yet), along with an upgrade to firmware 5.02, so not sure if it's the driver itself or the firmware.  The driver's not public yet, no timeframe on when it goes public.  If you run into those same issues ("TX driver issue detected, PF reset issued" in the vmkernel.log), open a Dell case and have it escalated as the front line techs probably won't know about this and will have you run all sorts of unrelated time-filling tasks.  Be adamant about escalation.  There are people there that know about this.

Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

v1.4.28 of the i40e driver is posted at VMware Compatibility Guide - I/O Device Search.  It supposedly contains the fixes I mentioned above.  The beta version 1.6.6 must be an internal number or something.

Reply
0 Kudos
sweater
Enthusiast
Enthusiast

I'm currently dealing with this issue as well - we've had half a dozen hosts fail over the last 3 days.

No one knows anything about a 1.6.6 driver available from anyone (we're using Dell FC630's with those Intel x710's).

If anything there's a workaround posted for this issue - which is awesome, I love having to hack ESXi to get it to work with what should be standard NICs.

ESXi host that uses Intel Corporation Ethernet Controller X710 for 10GbE SFP+ NIC fails with a purpl...

This is a known issue affecting ESXi 5.x and 6.x.

To work around this issue, disableTSO,TSO6, and LRO on the ESXi host. For more information, see Understanding TCP Segmentation Offload (TSO) and Large Receive Offload (LRO) in a VMware environment...

To disable TSO:

  1. Run this command to determine if the hardware TSO is enabled on the host:

    esxcli system settings advanced list -o /Net/UseHwTSO
  2. Run this command to disable TSO at the host level:

    esxcli system settings advanced set -o /Net/UseHwTSO -i 0
  3. Run this command to disable TSO6 at the host level:

    esxcli system settings advanced set -o /Net/UseHwTSO6 -i 0

To disable LRO:

  1. Run this command to determine if LRO is enabled for the VMkernel adapters on the host:

    esxcli system settings advanced list -o /Net/TcpipDefLROEnabled

  2. Run this command to disable LRO for all VMkernel adatpers on a host:esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i
Reply
0 Kudos
cesprov
Enthusiast
Enthusiast

v1.6.6 was an internal beta version apparently.  It was released publicly as 1.4.28.  Get it at VMware Compatibility Guide - I/O Device Search. 

The i40e-based NICs were somewhat unstable prior to driver v1.4.28 and firmware v5.02.  Make sure you're on at least those versions at a minimum.  Dell should have the 5.04 firmware for the X710s on their site by now too.

Reply
0 Kudos
jonawang39
Contributor
Contributor

Try i40e version 2.0.6

Driver version: 2.0.6

Supported ESXi release: 6.0

Compatible ESXi versions: 6.5

New hardware supported:

- Add new devices support for specific OEMs

- Add XXV710 25G device support

- Add X722 device support

Bug fixes:

- Fix duplicate mulicast packet issue

- Fix PSOD caused by small TSO segmentation

VMware Compatibility Guide - I/O Device Search

Reply
0 Kudos
TheHevy
Contributor
Contributor

Reply
0 Kudos