VMware Cloud Community
fr33tk
Contributor
Contributor

PSOD with intel X710-da2 network sfp+ card with driver i40e

Hi,

Having issue with esxi 5.5.

On heavy load we are getting PSOD, see attached screenshot.

a glipse to log

2015-04-10T11:36:04.476Z cpu0:33363)<6>i40e 0000:03:00.0: TX driver issue detected, PF reset issued

2015-04-10T11:36:06.762Z cpu9:33364)<6>i40e 0000:03:00.0: i40e_open: Registering netqueue ops

Driver and firmware versions are current stable, vmware certified.

# ethtool -i vmnic2

driver: i40e

version: 1.2.22

firmware-version: f4.33.31377 a1.2 n4.89 e191b

bus-info: 0000:81:00.0

Would be nice if someone shares some insights.

Thanks.

Tags (2)
13 Replies
vNEX
Expert
Expert

Hi,

have you checked your firmware version with the latest NVM Update Utility:

https://downloadcenter.intel.com/download/24769

_________________________________________________________________________________________ If you found this or any other answer helpful, please consider to award points. (use Correct or Helpful buttons) Regards, P.
0 Kudos
fr33tk
Contributor
Contributor

Firmware was updated with NVM update utility to proposed newest current one.

firmware-version: f4.33.31377 a1.2 n4.89 e191b

0 Kudos
vNEX
Expert
Expert

its quite obvious from the backtrace that the PF 14 was invoked by the i40e driver while "it sends buffer on TX ring"

result was --> TX driver issue detected, PF reset issued


for some reason the requested page wasn't successfully loaded into the memory so next you have to distinguish between hardware/software fault

you will need to compare few samples of consecutive PSOD screens.


Then its quite simple if the error message (stack info 0x4123c...) information vary between vmkernel errors (PSOD) its likely a hardware issue.

If error messages  between failures remains the same its likely a software issue.


In addition for both scenarios please check if there is always the same CPU or world involved across these failures.


Please post vmkernel.log from the time before the fault and right after the fault. (if its reproducible, hope it is on heavy load)

Also try to reproduce the issue with TSO or LRO (or both) disabled on that host.

# esxcli system settings advanced set -o /Net/UseHwTSO -i 0

# esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0

to verify its disabled run:

esxcli system settings advanced list -o /Net/UseHwTSO

esxcli system settings advanced list -o /Net/TcpipDefLROEnabled


For additional info about these setting refer to this KB:

VMware KB: Understanding TCP Segmentation Offload (TSO) and Large Receive Offload (LRO) in a VMware ...

As a last resort if you have proper SnS you can file a SR to VMware Support they will need your core dump.

VMware - How to File a Support Request Online | VMware | &amp;#268;esk&amp;#225; republika

VMware KB: Extracting a core dump file from the VMKCore diagnostic partition following a purple diag...

_________________________________________________________________________________________ If you found this or any other answer helpful, please consider to award points. (use Correct or Helpful buttons) Regards, P.
fr33tk
Contributor
Contributor

More detailed log

2015-04-13T10:35:55.904Z cpu0:33376)<6>i40e 0000:81:00.0: TX driver issue detected, PF reset issued

2015-04-13T10:35:57.779Z cpu8:33365)<6>i40e 0000:81:00.0: i40e_open: Registering netqueue ops

2015-04-13T10:35:57.957Z cpu8:33365)IRQ: 540: 0x41 <i40e-vmnic2-TxRx-0> exclusive, flags 0x10

2015-04-13T10:35:57.957Z cpu8:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff41, flags 0x10

2015-04-13T10:35:57.957Z cpu8:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None

2015-04-13T10:35:57.968Z cpu15:32812)WARNING: ScsiDeviceIO: 1223: Device naa.624a9370c1829d1a68a5e2dc0001101c performance has deteriorated. I/O latency increased from average value of 5418 microseconds to 267228 microseconds.

2015-04-13T10:35:58.149Z cpu8:33365)IRQ: 540: 0x42 <i40e-vmnic2-TxRx-1> exclusive, flags 0x10

2015-04-13T10:35:58.149Z cpu8:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff42, flags 0x10

2015-04-13T10:35:58.149Z cpu8:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None

2015-04-13T10:35:58.160Z cpu9:32806)ScsiDeviceIO: 1203: Device naa.624a9370c1829d1a68a5e2dc0001101c performance has improved. I/O latency reduced from 267228 microseconds to 52308 microseconds.

2015-04-13T10:35:58.346Z cpu10:33365)IRQ: 540: 0x43 <i40e-vmnic2-TxRx-2> exclusive, flags 0x10

2015-04-13T10:35:58.346Z cpu10:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff43, flags 0x10

2015-04-13T10:35:58.346Z cpu10:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None

2015-04-13T10:35:58.546Z cpu10:33365)IRQ: 540: 0x44 <i40e-vmnic2-TxRx-3> exclusive, flags 0x10

2015-04-13T10:35:58.546Z cpu10:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff44, flags 0x10

2015-04-13T10:35:58.546Z cpu10:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None

2015-04-13T10:35:58.745Z cpu10:33365)IRQ: 540: 0x45 <i40e-vmnic2-TxRx-4> exclusive, flags 0x10

2015-04-13T10:35:58.745Z cpu10:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff45, flags 0x10

2015-04-13T10:35:58.745Z cpu10:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None

2015-04-13T10:35:58.942Z cpu10:33365)IRQ: 540: 0x46 <i40e-vmnic2-TxRx-5> exclusive, flags 0x10

2015-04-13T10:35:58.942Z cpu10:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff46, flags 0x10

2015-04-13T10:35:58.942Z cpu10:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None

2015-04-13T10:35:59.135Z cpu11:33365)IRQ: 540: 0x47 <i40e-vmnic2-TxRx-6> exclusive, flags 0x10

2015-04-13T10:35:59.136Z cpu11:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff47, flags 0x10

2015-04-13T10:35:59.136Z cpu11:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None

2015-04-13T10:35:59.345Z cpu14:33365)IRQ: 540: 0x48 <i40e-vmnic2-TxRx-7> exclusive, flags 0x10

2015-04-13T10:35:59.345Z cpu14:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff48, flags 0x10

2015-04-13T10:35:59.345Z cpu14:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None

2015-04-13T10:36:00.025Z cpu0:35872)World: 14302: VC opID hostd-6cc7 maps to vmkernel opID 8c2bcc7d

2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 1 not allocated

2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 2 not allocated

2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 3 not allocated

2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 4 not allocated

2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 5 not allocated

2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 6 not allocated

2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 7 not allocated

2015-04-13T10:36:00.903Z cpu7:32836)NetPort: 1632: disabled port 0x3000002

2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_get_supported_feature: netq features supported: QueuePair RSS_DYN Latency Dynamic Pre-Emptible

2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_get_supported_filter_class: supporting next generation VLANMACADDR filter

2015-04-13T10:36:00.903Z cpu7:32836)Uplink: 6529: enabled port 0x3000002

So far implemented esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0 option and no crashes. Nor i40e driver messages in vmkernel.log.

This fix was implemented before your post, so i was running since friday (04.10) stable.

0 Kudos
CaziBrasga
Contributor
Contributor

We were getting similar PSODs every few hours with X710 NICs and the i40e driver (v1.2.22). Installed the latest VMware i40e driver v1.2.48 and we haven't had any crashes since.

gt2718
Contributor
Contributor

We were getting consistent crashes nightly at the same time, probably due to scheduled processes.

Our PSOD screencaps also contained several references to i40e.  We too were running the old driver,  1.2.22.

I updated to 1.2.48 plus disabled LRO and TSO.   Crossing my fingers that this fixes it.

0 Kudos
Simon_HamiltonW
Contributor
Contributor

What SFP+ are you all using?  Intel branded ones or 3rd party? 

Trying to figure out my options for Direct Attach cables.

0 Kudos
gt2718
Contributor
Contributor

Updated one host with the 1.2.48 driver and left 1 host on the 1.2.22 driver.   Overnight, only the host with the 1.2.22 driver crashed.

Updated that host with the 1.2.48 driver.  No more crashes.

Looks like that fixed it.

0 Kudos
gt2718
Contributor
Contributor

Not using SFP modules due to the short distance.  We're just using 5m passive Twinax, Dell branded ones, although we have some Cisco ones I'm going to eventually switch them out with.

0 Kudos
Simon_HamiltonW
Contributor
Contributor

Thanks, I'd seen reports of the X710's not showing up without Intel branded SFP+ installed.  I have a bunch of generic and Cisco DA cables I want to use.

CaziBrasga
Contributor
Contributor

Using 3rd party Dell compatible SFP+ DAC. No problems, other than the unrelated i40e driver issue.

0 Kudos
FragKing
Contributor
Contributor

There is a new Driver Version (2.0.6), from the Release Notes:

Fix PSOD caused by small TSO segmentation

https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESXI60-INTEL-I40E-206&productId=491#prod...

Has anyone tried this already and enabled TSO/LRO?

Thanks

Michel

http://www.quadrotech-it.com
0 Kudos
TheHevy
Contributor
Contributor

0 Kudos