Hi,
Having issue with esxi 5.5.
On heavy load we are getting PSOD, see attached screenshot.
a glipse to log
2015-04-10T11:36:04.476Z cpu0:33363)<6>i40e 0000:03:00.0: TX driver issue detected, PF reset issued
2015-04-10T11:36:06.762Z cpu9:33364)<6>i40e 0000:03:00.0: i40e_open: Registering netqueue ops
Driver and firmware versions are current stable, vmware certified.
# ethtool -i vmnic2
driver: i40e
version: 1.2.22
firmware-version: f4.33.31377 a1.2 n4.89 e191b
bus-info: 0000:81:00.0
Would be nice if someone shares some insights.
Thanks.
Hi,
have you checked your firmware version with the latest NVM Update Utility:
https://downloadcenter.intel.com/download/24769
Firmware was updated with NVM update utility to proposed newest current one.
firmware-version: f4.33.31377 a1.2 n4.89 e191b
its quite obvious from the backtrace that the PF 14 was invoked by the i40e driver while "it sends buffer on TX ring"
result was --> TX driver issue detected, PF reset issued
for some reason the requested page wasn't successfully loaded into the memory so next you have to distinguish between hardware/software fault
you will need to compare few samples of consecutive PSOD screens.
Then its quite simple if the error message (stack info 0x4123c...) information vary between vmkernel errors (PSOD) its likely a hardware issue.
If error messages between failures remains the same its likely a software issue.
In addition for both scenarios please check if there is always the same CPU or world involved across these failures.
Please post vmkernel.log from the time before the fault and right after the fault. (if its reproducible, hope it is on heavy load)
Also try to reproduce the issue with TSO or LRO (or both) disabled on that host.
# esxcli system settings advanced set -o /Net/UseHwTSO -i 0
# esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0
to verify its disabled run:
esxcli system settings advanced list -o /Net/UseHwTSO
esxcli system settings advanced list -o /Net/TcpipDefLROEnabled
For additional info about these setting refer to this KB:
As a last resort if you have proper SnS you can file a SR to VMware Support they will need your core dump.
VMware - How to File a Support Request Online | VMware | &#268;esk&#225; republika
More detailed log
2015-04-13T10:35:55.904Z cpu0:33376)<6>i40e 0000:81:00.0: TX driver issue detected, PF reset issued
2015-04-13T10:35:57.779Z cpu8:33365)<6>i40e 0000:81:00.0: i40e_open: Registering netqueue ops
2015-04-13T10:35:57.957Z cpu8:33365)IRQ: 540: 0x41 <i40e-vmnic2-TxRx-0> exclusive, flags 0x10
2015-04-13T10:35:57.957Z cpu8:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff41, flags 0x10
2015-04-13T10:35:57.957Z cpu8:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
2015-04-13T10:35:57.968Z cpu15:32812)WARNING: ScsiDeviceIO: 1223: Device naa.624a9370c1829d1a68a5e2dc0001101c performance has deteriorated. I/O latency increased from average value of 5418 microseconds to 267228 microseconds.
2015-04-13T10:35:58.149Z cpu8:33365)IRQ: 540: 0x42 <i40e-vmnic2-TxRx-1> exclusive, flags 0x10
2015-04-13T10:35:58.149Z cpu8:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff42, flags 0x10
2015-04-13T10:35:58.149Z cpu8:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
2015-04-13T10:35:58.160Z cpu9:32806)ScsiDeviceIO: 1203: Device naa.624a9370c1829d1a68a5e2dc0001101c performance has improved. I/O latency reduced from 267228 microseconds to 52308 microseconds.
2015-04-13T10:35:58.346Z cpu10:33365)IRQ: 540: 0x43 <i40e-vmnic2-TxRx-2> exclusive, flags 0x10
2015-04-13T10:35:58.346Z cpu10:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff43, flags 0x10
2015-04-13T10:35:58.346Z cpu10:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
2015-04-13T10:35:58.546Z cpu10:33365)IRQ: 540: 0x44 <i40e-vmnic2-TxRx-3> exclusive, flags 0x10
2015-04-13T10:35:58.546Z cpu10:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff44, flags 0x10
2015-04-13T10:35:58.546Z cpu10:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
2015-04-13T10:35:58.745Z cpu10:33365)IRQ: 540: 0x45 <i40e-vmnic2-TxRx-4> exclusive, flags 0x10
2015-04-13T10:35:58.745Z cpu10:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff45, flags 0x10
2015-04-13T10:35:58.745Z cpu10:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
2015-04-13T10:35:58.942Z cpu10:33365)IRQ: 540: 0x46 <i40e-vmnic2-TxRx-5> exclusive, flags 0x10
2015-04-13T10:35:58.942Z cpu10:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff46, flags 0x10
2015-04-13T10:35:58.942Z cpu10:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
2015-04-13T10:35:59.135Z cpu11:33365)IRQ: 540: 0x47 <i40e-vmnic2-TxRx-6> exclusive, flags 0x10
2015-04-13T10:35:59.136Z cpu11:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff47, flags 0x10
2015-04-13T10:35:59.136Z cpu11:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
2015-04-13T10:35:59.345Z cpu14:33365)IRQ: 540: 0x48 <i40e-vmnic2-TxRx-7> exclusive, flags 0x10
2015-04-13T10:35:59.345Z cpu14:33365)VMK_VECTOR: 218: Registered handler for interrupt 0xff48, flags 0x10
2015-04-13T10:35:59.345Z cpu14:33365)<6>i40e 0000:81:00.0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
2015-04-13T10:36:00.025Z cpu0:35872)World: 14302: VC opID hostd-6cc7 maps to vmkernel opID 8c2bcc7d
2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 1 not allocated
2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 2 not allocated
2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 3 not allocated
2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 4 not allocated
2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 5 not allocated
2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 6 not allocated
2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_free_tx_queue: Tx queue 7 not allocated
2015-04-13T10:36:00.903Z cpu7:32836)NetPort: 1632: disabled port 0x3000002
2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_get_supported_feature: netq features supported: QueuePair RSS_DYN Latency Dynamic Pre-Emptible
2015-04-13T10:36:00.903Z cpu7:32836)<6>i40e 0000:81:00.0: i40e_get_supported_filter_class: supporting next generation VLANMACADDR filter
2015-04-13T10:36:00.903Z cpu7:32836)Uplink: 6529: enabled port 0x3000002
So far implemented esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0 option and no crashes. Nor i40e driver messages in vmkernel.log.
This fix was implemented before your post, so i was running since friday (04.10) stable.
We were getting similar PSODs every few hours with X710 NICs and the i40e driver (v1.2.22). Installed the latest VMware i40e driver v1.2.48 and we haven't had any crashes since.
We were getting consistent crashes nightly at the same time, probably due to scheduled processes.
Our PSOD screencaps also contained several references to i40e. We too were running the old driver, 1.2.22.
I updated to 1.2.48 plus disabled LRO and TSO. Crossing my fingers that this fixes it.
What SFP+ are you all using? Intel branded ones or 3rd party?
Trying to figure out my options for Direct Attach cables.
Updated one host with the 1.2.48 driver and left 1 host on the 1.2.22 driver. Overnight, only the host with the 1.2.22 driver crashed.
Updated that host with the 1.2.48 driver. No more crashes.
Looks like that fixed it.
Not using SFP modules due to the short distance. We're just using 5m passive Twinax, Dell branded ones, although we have some Cisco ones I'm going to eventually switch them out with.
Thanks, I'd seen reports of the X710's not showing up without Intel branded SFP+ installed. I have a bunch of generic and Cisco DA cables I want to use.
Using 3rd party Dell compatible SFP+ DAC. No problems, other than the unrelated i40e driver issue.
There is a new Driver Version (2.0.6), from the Release Notes:
Fix PSOD caused by small TSO segmentation
Has anyone tried this already and enabled TSO/LRO?
Thanks
Michel
If possible, we recommend using the latest i40en 1.5.6 driver and 6.01 firmware.
Here are the links to the downloads.