Hello All,
After install ESXi 6.0 via "VMware-VMvisor-Installer-6.0.0.update02-4192238.x86_64-Dell_Customized-A04" on Dell PowerEdge R730, everything is okay !!!
But, if I try to install ESXi 6.5 via "VMware-VMvisor-Installer-6.5.0-4564106.x86_64-Dell_Customized-A00" on Dell PowerEdge R730, my vmnic0 and vmnic1 (Intel X710 DP 10Gb DA/SFP+) are not working !!!
Does anybody know the solution ?
Thanks for your reply.
[root@R730:~] esxcli network nic list
Name PCI Device Driver Admin Status Link Status Speed Duplex MAC Address MTU Description
------ ------------ ------ ------------ ----------- ----- ------ ----------------- ---- ---------------------------------------------------------
vmnic0 0000:01:00.0 i40en Up Up 10000 Full f8:bc:12:05:85:d0 1500 Intel Corporation Ethernet Controller X710 for 10GbE SFP+
vmnic1 0000:01:00.1 i40en Up Up 10000 Full f8:bc:12:05:85:d2 1500 Intel Corporation Ethernet Controller X710 for 10GbE SFP+
vmnic2 0000:0c:00.0 igbn Up Down 0 Half f8:bc:12:05:85:f0 1500 Intel Corporation Gigabit 4P X710/I350 rNDC
vmnic3 0000:0c:00.1 igbn Up Up 1000 Full f8:bc:12:05:85:f1 1500 Intel Corporation Gigabit 4P X710/I350 rNDC
vusb0 Pseudo cdce Up Up 100 Full 18:fb:7b:5d:d5:ee 1500 DellTM iDRAC Virtual NIC USB Device
[root@R730:~] esxcli network nic get -n vmnic0
Advertised Auto Negotiation: false
Advertised Link Modes: 1000BaseT/Full, 10000BaseT/Full, 10000BaseT/Full, 40000BaseCR4/Full, 40000BaseSR4/Full
Auto Negotiation: false
Cable Type:
Current Message Level: -1
Driver Info:
Bus Info: 0000:01:00:0
Driver: i40en
Firmware Version: 5.04 0x800024bc 17.5.11
Version: 1.1.0
Link Detected: true
Link Status: Up
Name: vmnic0
PHYAddress: 0
Pause Autonegotiate: false
Pause RX: false
Pause TX: false
Supported Ports:
Supports Auto Negotiation: false
Supports Pause: false
Supports Wakeon: true
Transceiver:
Virtual Address: 00:50:56:52:57:61
Wakeon: MagicPacket(tm)
After 5 fault free days, I'm fairly confident now that I've seen the last of both the disconnections and PSODs.
Action summary (in maintenance mode):
1) Disable the i40en driver
esxcli system module set --enabled=false --module=i40en
2) Disable TSO/LRO globally
esxcli system settings advanced set -o /Net/UseHwTSO -i 0
esxcli system settings advanced set -o /Net/UseHwTSO6 -i 0
esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0
3) Explicitly disable both HW & SW LRO on Vmxnet 3.
Probably not necessary, but I wanted LRO really, truly, most sincerely dead.
esxcli system settings advanced set -o /Net/Vmxnet3HwLRO -i 0
esxcli system settings advanced set -o /Net/Vmxnet3SwLRO -i 0
4) Reboot, then confirm use of i40e driver
esxcli system module list | grep i40e
i40e true true
i40en false false
esxcli network nic list
Name PCI Device Driver Admin Status Link Status Speed Duplex MAC Address MTU Description
------ ------------ ------ ------------ ----------- ----- ------ ----------------- ---- ---------------------------------------------------------
vmnic0 0000:02:00.0 ntg3 Up Up 1000 Full 14:18:77:43:23:78 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic1 0000:02:00.1 ntg3 Up Up 1000 Full 14:18:77:43:23:79 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic2 0000:03:00.0 ntg3 Up Up 1000 Full 14:18:77:43:23:7a 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic3 0000:03:00.1 ntg3 Up Up 1000 Full 14:18:77:43:23:7b 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic4 0000:06:00.0 i40e Up Up 10000 Full 3c:fd:fe:03:ac:e0 1500 Intel Corporation Ethernet Controller X710 for 10GbE SFP+
vmnic5 0000:06:00.1 i40e Up Up 10000 Full 3c:fd:fe:03:ac:e2 1500 Intel Corporation Ethernet Controller X710 for 10GbE SFP+
5) Confirm non-use of TSO/LRO
esxcli system settings advanced list -o /Net/UseHwTSO
Path: /Net/UseHwTSO
Type: integer
Int Value: 0
Default Int Value: 1
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: When non-zero, use pNIC HW TSO offload if available
esxcli system settings advanced list -o /Net/UseHwTSO6
Path: /Net/UseHwTSO6
Type: integer
Int Value: 0
Default Int Value: 1
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: When non-zero, use pNIC HW IPv6 TSO offload if available
esxcli system settings advanced list -o /Net/TcpipDefLROEnabled
Path: /Net/TcpipDefLROEnabled
Type: integer
Int Value: 0
Default Int Value: 1
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: LRO enabled for TCP/IP
esxcli system settings advanced list -o /Net/Vmxnet3HwLRO
Path: /Net/Vmxnet3HwLRO
Type: integer
Int Value: 0
Default Int Value: 1
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: Whether to enable HW LRO on pkts going to a LPD capable vmxnet3
esxcli system settings advanced list -o /Net/Vmxnet3SwLRO
Path: /Net/Vmxnet3SwLRO
Type: integer
Int Value: 0
Default Int Value: 1
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: Whether to perform SW LRO on pkts going to a LPD capable vmxnet3
My ESXi version:
esxcli system version get
Product: VMware ESXi
Version: 6.5.0
Build: Releasebuild-5224529
Update: 0
Patch: 15
My R430 firmware, per IDRAC:
Integrated Dell Remote Access Controller | 2.41.40.40 | ||
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:78 | 7.10.64 | ||
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:79 | 7.10.64 | ||
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:7A | 7.10.64 | ||
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:7B | 7.10.64 | ||
Intel(R) Ethernet Converged Network Adapter X710 - 3C:FD:FE:03:AC:E0 | 17.5.12 | ||
Intel(R) Ethernet Converged Network Adapter X710 - 3C:FD:FE:03:AC:E2 | 17.5.12 | ||
BIOS | 2.3.4 | ||
Lifecycle Controller | 2.41.40.40 | ||
Dell 32 Bit uEFI Diagnostics, version 4239, 4239A29, 4239.37 | 4239A29 | ||
Dell OS Driver Pack, 16.10.10, A00 | 16.10.10 | ||
OS COLLECTOR 1.1, OSC_1.1, A00 | OSC_1.1 | ||
System CPLD | 1.0.3 | ||
PERC H330 Mini | 25.5.0.0019 | ||
Dell 12Gbps HBA | 13.17.03.00 |
Actually, the issues are related to device driver, you can customize the image and adding your drivers on that.
Here is an example: https://www.vmguru.com/2015/04/how-to-build-a-custom-image-with-vsphere-esxi-image-builder-cli/
I actually have this same issue. What driver needs to be loaded for the NIC card to work? I found that vlan tagging does not work. If I untag the port, everything works; however, I need multiple VLAN's on the nic card.
This only happens on the X710 (X710-DA4). I have a X520 on the same server and it does not have this issue.
Solved! At least for me. BTW, I am using the Dell Customized image as well.
I had to remove the i40en driver using the following command:
esxcli software vib remove -n i40en
(wait about two minutes for esxi to remove the drivers)
Reboot the server
After removing the driver, the X710 NICS's defaulted to the i40e driver (without the n). The VLAN tagging started to work.
This worked on all four of my servers. I hope this helps you.
FYI:
Also, before removing the driver, I did try the following commands which did not change anything. After rebooting the server, the i40en was still enabled. Removing the i40en driver was the only thing that worked.
esxcli system module set --enabled=false --module=i40en
I also see issues with the X710 (dual port) after an upgrade to ESXi 6.5 (DELL customized). In my case, the nic works normally for a few days after host boot, and then silently fails. As I am using vlans extensively, I presume that this is due to delayed onset of the vlan tagging issue.
I had no issues at all under ESXi 6.0 with the i40e driver. The combination worked flawlessly.
I am going to try disabling the i40en driver and fall back to the i40e, but it will take a week or so of uptime to assert with some confidence that this is the solution.
Does anyone have any better advice?
esxcli network nic get -n vmnic4
Advertised Auto Negotiation: false
Advertised Link Modes: 1000BaseT/Full, 10000BaseT/Full, 10000BaseT/Full
Auto Negotiation: false
Cable Type:
Current Message Level: -1
Driver Info:
Bus Info: 0000:06:00:0
Driver: i40en
Firmware Version: 5.05 0x80002899 17.0.12
Version: 1.1.0
Link Detected: true
Link Status: Up
Name: vmnic4
PHYAddress: 0
Pause Autonegotiate: false
Pause RX: false
Pause TX: false
Supported Ports:
Supports Auto Negotiation: false
Supports Pause: false
Supports Wakeon: true
Transceiver:
Virtual Address: 00:50:56:13:ac:e0
Wakeon: MagicPacket(tm)
Successfully disabled the i40en driver, now running the i40e driver upon reboot. So far, so good. Time will tell.
I originally thought this was a vlan tagging issue, but further investigation revealed the nic goes entirely silent. I have link but nothing else. VMs communicating through the same vswitch remain in communication with each other. Everything beyond the nic is disconnected.
Much to my disgust, I also encountered the problem described in ESXi 6.5 connectivity issue on PowerEdge R430, as my standby nics running the ntg3 driver also failed several days after I physically disconnected the problematic X710. I will likely disable the ntg3 too, and fall back to the tg3 driver.
Good day
Im sitting in the same boat on the current issue.
So i have removed the i40en VIB and the host reverted to the i40e driver ..... this seems to solve the problem, but after time I see host and VM disconnects.
log extracts :
2017-04-14T19:29:25.825Z cpu39:66210)WARNING: LinNet: netdev_watchdog:3688: NETDEV WATCHDOG: vmnic1: transmit timed out
2017-04-14T19:29:25.825Z cpu39:66210)<6>i40e 0000:01:00.1: tx_timeout: VSI_seid: 390, Q 0, NTC: 0x1e9, HWB: 0x1e9, NTU: 0x57, TAIL: 0x57, INT: 0x1
2017-04-14T19:29:25.825Z cpu39:66210)<6>i40e 0000:01:00.1: tx_timeout recovery level 1, hung_queue 0
2017-04-14T19:29:25.825Z cpu39:66210)WARNING: at vmkdrivers/src_92/vmklinux_92/vmware/linux_net.c:3717/netdev_watchdog() (inside vmklinux)
2017-04-14T19:29:25.825Z cpu39:66210)Backtrace for current CPU #39, worldID=66210, fp=0x4307acd91500
2017-04-14T19:29:25.825Z cpu39:66210)0x43915511be50:[0x418033303f71]vmk_LogBacktraceMessage@vmkernel#nover+0x29 stack: 0x4307acd81e48, 0x418033a818ad, 0xe68, 0x4307acd91500, 0x439100001022
2017-04-14T19:29:25.825Z cpu39:66210)0x43915511be70:[0x418033a818ad]watchdog_work_cb@com.vmware.driverAPI#9.2+0x27d stack: 0x439100001022, 0x418033d9a3c8, 0x418033a8185a, 0x43915511bef0, 0xc0000000
2017-04-14T19:29:25.825Z cpu39:66210)0x43915511bed0:[0x418033aa2e28]vmklnx_workqueue_callout@com.vmware.driverAPI#9.2+0xe0 stack: 0x4307acdbc9c0, 0x417fc4e0c5c0, 0x8000000000001014, 0x418033a81630, 0x418033aa2e1d
2017-04-14T19:29:25.825Z cpu39:66210)0x43915511bf50:[0x4180332c93ee]helpFunc@vmkernel#nover+0x4b6 stack: 0x4301a0fff050, 0x0, 0x0, 0x0, 0x1014
2017-04-14T19:29:25.825Z cpu39:66210)0x43915511bfe0:[0x4180334c8c95]CpuSched_StartWorld@vmkernel#nover+0x99 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2017-04-14T19:29:25.836Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 390 Tx ring 0 disable timeout
2017-04-14T19:29:25.849Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 399 Tx ring 1 disable timeout
2017-04-14T19:29:25.861Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 400 Tx ring 2 disable timeout
2017-04-14T19:29:25.873Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 401 Tx ring 3 disable timeout
2017-04-14T19:29:25.886Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 402 Tx ring 4 disable timeout
2017-04-14T19:29:25.898Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 403 Tx ring 5 disable timeout
2017-04-14T19:29:25.910Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 404 Tx ring 6 disable timeout
2017-04-14T19:29:25.923Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 405 Tx ring 7 disable timeout
2017-04-14T19:29:26.125Z cpu5:66220)<6>i40e 0000:01:00.1: PF reset failed, -15
2017-04-14T19:29:29.862Z cpu21:69904)NetSched: 701: 0x2000002: received a force quiesce for port 0x2000008, dropped 7 pkts
2017-04-14T19:29:30.834Z cpu44:66219)WARNING: LinNet: netdev_watchdog:3688: NETDEV WATCHDOG: vmnic0: transmit timed out
2017-04-14T19:29:30.834Z cpu44:66219)<6>i40e 0000:01:00.0: tx_timeout: VSI_seid: 391, Q 0, NTC: 0x13f, HWB: 0x13f, NTU: 0x194, TAIL: 0x194, INT: 0x1
2017-04-14T19:29:30.834Z cpu44:66219)<6>i40e 0000:01:00.0: tx_timeout recovery level 1, hung_queue 0
2017-04-14T19:29:30.845Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 391 Tx ring 0 disable timeout
2017-04-14T19:29:30.857Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 392 Tx ring 1 disable timeout
2017-04-14T19:29:30.870Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 393 Tx ring 2 disable timeout
2017-04-14T19:29:30.883Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 394 Tx ring 3 disable timeout
2017-04-14T19:29:30.895Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 395 Tx ring 4 disable timeout
2017-04-14T19:29:30.907Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 396 Tx ring 5 disable timeout
2017-04-14T19:29:30.920Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 397 Tx ring 6 disable timeout
2017-04-14T19:29:30.932Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 398 Tx ring 7 disable timeout
2017-04-14T19:29:31.384Z cpu36:66205)<6>i40e 0000:01:00.0: PF reset failed, -15
This is happening randomly on 3 hosts - so a hardware error is mostly eliminated.
The next step is to update firmware on the Dell M630 hosts. This will update the x710 firmware to 17.5.12 from the current 17.5.11 and revert back to the i40en driver.
Current support case open with VMWare and Dell.
Will update the outcomes.
Thanks everyone for the information shared till date.
My host, (a Dell R430), continues to work normally after 4 days of uptime with the i40e driver. That's somewhat longer than I was seeing with the i40en. Still watching and waiting.
just an update :
i have done the firmware update and reverted back to i40en.
No Luck .
So im back with the i40e driver on the new x710 firmware and Dell OS Driver Pack
fingers crossed.
I have not seen the disconnections since reverting to the i40e driver. However, I did abruptly start getting PSODs, with roughly the same frequency of occurrence.
I have since disabled TSO/LRO support. I have not seen either type of failure for the last few days, but that's hardly inspiring at this point.
After 5 fault free days, I'm fairly confident now that I've seen the last of both the disconnections and PSODs.
Action summary (in maintenance mode):
1) Disable the i40en driver
esxcli system module set --enabled=false --module=i40en
2) Disable TSO/LRO globally
esxcli system settings advanced set -o /Net/UseHwTSO -i 0
esxcli system settings advanced set -o /Net/UseHwTSO6 -i 0
esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0
3) Explicitly disable both HW & SW LRO on Vmxnet 3.
Probably not necessary, but I wanted LRO really, truly, most sincerely dead.
esxcli system settings advanced set -o /Net/Vmxnet3HwLRO -i 0
esxcli system settings advanced set -o /Net/Vmxnet3SwLRO -i 0
4) Reboot, then confirm use of i40e driver
esxcli system module list | grep i40e
i40e true true
i40en false false
esxcli network nic list
Name PCI Device Driver Admin Status Link Status Speed Duplex MAC Address MTU Description
------ ------------ ------ ------------ ----------- ----- ------ ----------------- ---- ---------------------------------------------------------
vmnic0 0000:02:00.0 ntg3 Up Up 1000 Full 14:18:77:43:23:78 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic1 0000:02:00.1 ntg3 Up Up 1000 Full 14:18:77:43:23:79 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic2 0000:03:00.0 ntg3 Up Up 1000 Full 14:18:77:43:23:7a 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic3 0000:03:00.1 ntg3 Up Up 1000 Full 14:18:77:43:23:7b 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic4 0000:06:00.0 i40e Up Up 10000 Full 3c:fd:fe:03:ac:e0 1500 Intel Corporation Ethernet Controller X710 for 10GbE SFP+
vmnic5 0000:06:00.1 i40e Up Up 10000 Full 3c:fd:fe:03:ac:e2 1500 Intel Corporation Ethernet Controller X710 for 10GbE SFP+
5) Confirm non-use of TSO/LRO
esxcli system settings advanced list -o /Net/UseHwTSO
Path: /Net/UseHwTSO
Type: integer
Int Value: 0
Default Int Value: 1
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: When non-zero, use pNIC HW TSO offload if available
esxcli system settings advanced list -o /Net/UseHwTSO6
Path: /Net/UseHwTSO6
Type: integer
Int Value: 0
Default Int Value: 1
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: When non-zero, use pNIC HW IPv6 TSO offload if available
esxcli system settings advanced list -o /Net/TcpipDefLROEnabled
Path: /Net/TcpipDefLROEnabled
Type: integer
Int Value: 0
Default Int Value: 1
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: LRO enabled for TCP/IP
esxcli system settings advanced list -o /Net/Vmxnet3HwLRO
Path: /Net/Vmxnet3HwLRO
Type: integer
Int Value: 0
Default Int Value: 1
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: Whether to enable HW LRO on pkts going to a LPD capable vmxnet3
esxcli system settings advanced list -o /Net/Vmxnet3SwLRO
Path: /Net/Vmxnet3SwLRO
Type: integer
Int Value: 0
Default Int Value: 1
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: Whether to perform SW LRO on pkts going to a LPD capable vmxnet3
My ESXi version:
esxcli system version get
Product: VMware ESXi
Version: 6.5.0
Build: Releasebuild-5224529
Update: 0
Patch: 15
My R430 firmware, per IDRAC:
Integrated Dell Remote Access Controller | 2.41.40.40 | ||
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:78 | 7.10.64 | ||
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:79 | 7.10.64 | ||
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:7A | 7.10.64 | ||
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:7B | 7.10.64 | ||
Intel(R) Ethernet Converged Network Adapter X710 - 3C:FD:FE:03:AC:E0 | 17.5.12 | ||
Intel(R) Ethernet Converged Network Adapter X710 - 3C:FD:FE:03:AC:E2 | 17.5.12 | ||
BIOS | 2.3.4 | ||
Lifecycle Controller | 2.41.40.40 | ||
Dell 32 Bit uEFI Diagnostics, version 4239, 4239A29, 4239.37 | 4239A29 | ||
Dell OS Driver Pack, 16.10.10, A00 | 16.10.10 | ||
OS COLLECTOR 1.1, OSC_1.1, A00 | OSC_1.1 | ||
System CPLD | 1.0.3 | ||
PERC H330 Mini | 25.5.0.0019 | ||
Dell 12Gbps HBA | 13.17.03.00 |
That's good info. I ran across the same issues with r630/x710. When I had first got the hosts and installed ESXi 5.5, I was getting PSOD's and had to disable TSO/LRO. I was hoping after upgrading to 6.5 I would be able to re-enable it. I guess I will leave it disabled after reading this.
Had the same network outage issue with Dell R830 and the 10g Intel X710. Out of the box the esxi 6.5 uses the i40en driver, we has several issues with he latest version of the i40en. at the end we disabled the i40en driver and now we hope that the i40e driver will be stable, at least Dell support confirmed.
Perfect, now we have PSOD with the i40e driver.
the PSOD could be this ? ESXi host that uses Intel Corporation Ethernet Controller X710 for 10GbE SFP+ NIC fails with PSOD
Disabling TSO TSO6 LRO
esxcli system settings advanced set -o /Net/UseHwTSO -i 0
esxcli system settings advanced set -o /Net/UseHwTSO6 -i 0
esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0
Hey rzuber78,
we have been experiencing similar problems with our X710-DA4 NICs ever since we got them.
In short:
We currently have an open Support Request 17530352108 with VMware. We also had another SR 17479731106 a while ago where VMware suggested we switch from i40e 2.0.6 to i40en 1.3.1.
Unfortunately, in our current SR, the support engineer couldn't really help us other than telling us the following:
I then escalated the SR and had a brief conversation with the manager who - besides asking me whether I knew what the VMware HCL was (seriously?) - asked me to try the i40e 2.0.6 driver once again, which I refused to do, because that's the driver we had the PSODs with and the driver another support engineer explicitly told us to move away from. The manager also suggested we involve the server OEM, which - at least in our case - would be pointless, as we're using Intel retail NICs.
It's been a day since then and I have not heard from VMware support since.
I also contacted Intel support who politely told me that Intel does not support VMware drivers directly since ESXi 5.x.
And if you think that's bad, have a look at the following Intel communities post, where someone has been fighting these issues with various firmware/driver combinations since 2+ years: Intel X710-DA4 / VMware ESXi 6.5u1 - Malicious ... |Intel Communities To quote directly from that post:
I've had PSOD's and NIC PF peset issues with all the NVM Firmware versions & Drivers I've tried for the past 2 years.
NVM / i40e Driver Versions I've tried.
4.42 / 1.2.48
4.53 / 1.3.38 & 1.3.45
5.02 / 1.4.26
5.04 / 1.4.28
5.05 / 2.0.6
5.05 / 1.31 (i40en)
At first Intel Engineering said many of my issues were known and kept delaying me until NVM 5.02 / 1.4.26 which they expected would resolve them. That release at least made the cards someone stable but the PSOD's and NIC PF resets still happen too frequently (PSOD's occur at least once a week across one of my 12 hosts).
Quite frankly, I've come to the conclusion that neither VMware support nor Intel are willing or able to help us with that problem and our only way out is replacing all of the NICs with hardware from a different vendor.
It seems people are still having issues with this card & driver. All I can add to the discussion at this point is that my problems went away after the changes I described previously. I have not had a single PSOD or hung network for the past 3 months. They were almost daily events before.
It is possible that I am simply not stressing the network hard enough to expose additional difficulties. My average load is only 0.5 Gb/s, with spikes around 2 Gb/s.
Hi cjckalb,
Thank you for the detailed description.
The NIC behaves totally identical here.
We fresh installed esxi 6.5 onto the DELL R830 with the NIC and it was using the i40en driver (i40en 1.1.0-1vmw.650.0.0.4564106) out of the box and we have been running very light workload for 2 months without any problem.
After we migrated some more VMs onto the ESX the network outage issue occurred after around a week, then again in 10 days.
Following the outage I have upgraded the i40en to the latest available ( 1.3.1-5vmw.650.1.26.5969303 ) but within hours we have had network outage.
I was not touching the NIC Firmware at all as it was already the latest available on delivery (firmware-version: 5.05 0x80002885 17.5.12)
And i found this forum and have read what KellyGreen posted. and eventually used the same workaround as he did.
I have filed a case with both DELL and VMware. Dell recommended to use the i40e driver version 1.4.26.
VMware Support is complaining that my FW is too new and not on HCL, that mine is 5.05 and the HCL has only 5.02.
I have installed the net-i40e 1.4.28-1OEM.550.0.0.1331820 and had to disable the i40en driver with:
esxcli system module set --enabled=false --module=i40en
After about a day or two we landed with PSOD,
similar to this ESXi host that uses Intel Corporation Ethernet Controller X710 for 10GbE SFP+ NIC fails with PSOD
So i have followed and disabled the TSO TSO6 LRO:
esxcli system settings advanced set -o /Net/UseHwTSO -i 0
esxcli system settings advanced set -o /Net/UseHwTSO6 -i 0
esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0
After this we think the driver is stable now but in any case we have purchased some extra 10G Briadcom NICs.
Some tech details:
Our X710 NICs
[root@esx:~] vmkchdev -l | grep vmnic | grep 1572
0000:01:00.0 8086:1572 1028:1f99 vmkernel vmnic0
0000:01:00.1 8086:1572 1028:0000 vmkernel vmnic1
Driver and FW we are using:
[root@esx:~] ethtool -i vmnic0
driver: i40e
version: 2.0.6
firmware-version: 5.05 0x80002885 17.5.12
bus-info: 0000:01:00.0
With i40e 2.0.6 we still managed to get a PSOD on one of the ESX servers with LRO/TSO/TSO6 disabled. While I suspect this might be due to us using VMware NSX (VXLAN), that is still speculation at this point. I'm also kind of unwilling to buy modern day NICs and then disable all of these performance features.
In the meantime, 18 days after first opening my service request, I managed to get VMware support to actually do what their policy says they would do - contact Intel via TSAnet (Multi Vendor Support from TSANet | Vendor-Neutral Technical Support Alliance & Community). Yay. Might actually get the first status update tomorrow. However, I'm not deluded enough to expect this to lead anywhere, as that would require someone to actually fix one of the drivers. I don't think that's gonna happen unless they already have a working driver up their sleeve.
Since we cannot really afford any network instability on this multi-tenant environment, I have also ordered some Broadcom NICs, the first of which will go into testing within the next days.
P.S.: If support asks you to downgrade your firmware to 5.02 again, you might want to remind them of the fact that anything older than 5.05 is prone to a denial-of-service vulnerability (Intel® Product Security Center). At least with the retail X(L)710, the HCL also clearly states: "Firmware versions listed are the minimum supported versions."
We also have been experiencing issues with the X710s I am in the process of changing drivers and working with support. This has also caused some heartburn due to these NICs being used on our vSAN cluster.
I wrote up a quick and dirty script to change out the driver and change the advanced settings:
$vmhost = get-vmhost vmhost1
$esxcli = Get-EsxCli -v2 -VMHost $vmhost
$a = $esxcli.system.module.set.CreateArgs()
$a.enabled = $false
$a.module = "i40en"
$esxcli.system.module.set.invoke($a)
$a = $esxcli.system.settings.advanced.set.CreateArgs()
$a.option = "/Net/UseHwTSO"
$a.intvalue = 0
$esxcli.system.settings.advanced.set.invoke($a)
$a.option = "/Net/UseHwTSO6"
$esxcli.system.settings.advanced.set.invoke($a)
$a.option = "/Net/TcpipDefLROEnabled"
$esxcli.system.settings.advanced.set.invoke($a)
$a.option = "/Net/Vmxnet3HwLRO"
$esxcli.system.settings.advanced.set.invoke($a)
$a.option = "/Net/Vmxnet3SwLRO"
$esxcli.system.settings.advanced.set.invoke($a)
Thanks to cjckalb for keeping this post updated and pushing support to get a fix!