VMware Cloud Community
CKF1028
Enthusiast
Enthusiast
Jump to solution

Install ESXi 6.5 on R730 with Intel X710

Hello All,

After install ESXi 6.0 via "VMware-VMvisor-Installer-6.0.0.update02-4192238.x86_64-Dell_Customized-A04" on Dell PowerEdge R730, everything is okay !!!

But, if I try to install ESXi 6.5 via "VMware-VMvisor-Installer-6.5.0-4564106.x86_64-Dell_Customized-A00" on Dell PowerEdge R730, my vmnic0 and vmnic1 (Intel X710 DP 10Gb DA/SFP+) are not working !!!

Does anybody know the solution ?

Thanks for your reply.

[root@R730:~] esxcli network nic list

Name    PCI Device    Driver  Admin Status  Link Status  Speed  Duplex  MAC Address        MTU  Description                                          

------  ------------  ------  ------------  -----------  -----  ------  -----------------  ----  ---------------------------------------------------------

vmnic0  0000:01:00.0  i40en  Up            Up          10000  Full    f8:bc:12:05:85:d0  1500  Intel Corporation Ethernet Controller X710 for 10GbE SFP+

vmnic1  0000:01:00.1  i40en  Up            Up          10000  Full    f8:bc:12:05:85:d2  1500  Intel Corporation Ethernet Controller X710 for 10GbE SFP+

vmnic2  0000:0c:00.0  igbn    Up            Down            0  Half    f8:bc:12:05:85:f0  1500  Intel Corporation Gigabit 4P X710/I350 rNDC          

vmnic3  0000:0c:00.1  igbn    Up            Up            1000  Full    f8:bc:12:05:85:f1  1500  Intel Corporation Gigabit 4P X710/I350 rNDC          

vusb0  Pseudo        cdce    Up            Up            100  Full    18:fb:7b:5d:d5:ee  1500  DellTM iDRAC Virtual NIC USB Device              

[root@R730:~] esxcli network nic get -n vmnic0

  Advertised Auto Negotiation: false

  Advertised Link Modes: 1000BaseT/Full, 10000BaseT/Full, 10000BaseT/Full, 40000BaseCR4/Full, 40000BaseSR4/Full

  Auto Negotiation: false

  Cable Type:

  Current Message Level: -1

  Driver Info:

        Bus Info: 0000:01:00:0

        Driver: i40en

        Firmware Version: 5.04 0x800024bc 17.5.11

        Version: 1.1.0

  Link Detected: true

  Link Status: Up

  Name: vmnic0

  PHYAddress: 0

  Pause Autonegotiate: false

  Pause RX: false

  Pause TX: false

  Supported Ports:

  Supports Auto Negotiation: false

  Supports Pause: false

  Supports Wakeon: true

  Transceiver:

  Virtual Address: 00:50:56:52:57:61

  Wakeon: MagicPacket(tm)

Tags (1)
1 Solution

Accepted Solutions
KellyGreen
Contributor
Contributor
Jump to solution

After 5 fault free days, I'm fairly confident now that I've seen the last of both the disconnections and PSODs.

Action summary (in maintenance mode):

1) Disable the i40en driver

esxcli system module set --enabled=false --module=i40en

2) Disable TSO/LRO globally

esxcli system settings advanced set -o /Net/UseHwTSO -i 0

esxcli system settings advanced set -o /Net/UseHwTSO6 -i 0

esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0

3) Explicitly disable both HW & SW LRO on Vmxnet 3.

    Probably not necessary, but I wanted LRO really, truly, most sincerely dead.

esxcli system settings advanced set -o /Net/Vmxnet3HwLRO -i 0

esxcli system settings advanced set -o /Net/Vmxnet3SwLRO -i 0

4) Reboot, then confirm use of i40e driver

esxcli system module list | grep i40e

i40e                                true        true

i40en                              false       false

esxcli network nic list

Name    PCI Device    Driver  Admin Status  Link Status  Speed  Duplex  MAC Address         MTU  Description

------  ------------  ------  ------------  -----------  -----  ------  -----------------  ----  ---------------------------------------------------------

vmnic0  0000:02:00.0  ntg3    Up            Up            1000  Full    14:18:77:43:23:78  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

vmnic1  0000:02:00.1  ntg3    Up            Up            1000  Full    14:18:77:43:23:79  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

vmnic2  0000:03:00.0  ntg3    Up            Up            1000  Full    14:18:77:43:23:7a  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

vmnic3  0000:03:00.1  ntg3    Up            Up            1000  Full    14:18:77:43:23:7b  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

vmnic4  0000:06:00.0  i40e    Up            Up           10000  Full    3c:fd:fe:03:ac:e0  1500  Intel Corporation Ethernet Controller X710 for 10GbE SFP+

vmnic5  0000:06:00.1  i40e    Up            Up           10000  Full    3c:fd:fe:03:ac:e2  1500  Intel Corporation Ethernet Controller X710 for 10GbE SFP+

5) Confirm non-use of TSO/LRO

esxcli system settings advanced list -o /Net/UseHwTSO

   Path: /Net/UseHwTSO

   Type: integer

   Int Value: 0

   Default Int Value: 1

   Min Value: 0

   Max Value: 1

   String Value:

   Default String Value:

   Valid Characters:

   Description: When non-zero, use pNIC HW TSO offload if available

esxcli system settings advanced list -o /Net/UseHwTSO6

   Path: /Net/UseHwTSO6

   Type: integer

   Int Value: 0

   Default Int Value: 1

   Min Value: 0

   Max Value: 1

   String Value:

   Default String Value:

   Valid Characters:

   Description: When non-zero, use pNIC HW IPv6 TSO offload if available

esxcli system settings advanced list -o /Net/TcpipDefLROEnabled

   Path: /Net/TcpipDefLROEnabled

   Type: integer

   Int Value: 0

   Default Int Value: 1

   Min Value: 0

   Max Value: 1

   String Value:

   Default String Value:

   Valid Characters:

   Description: LRO enabled for TCP/IP

esxcli system settings advanced list -o /Net/Vmxnet3HwLRO

   Path: /Net/Vmxnet3HwLRO

   Type: integer

   Int Value: 0

   Default Int Value: 1

   Min Value: 0

   Max Value: 1

   String Value:

   Default String Value:

   Valid Characters:

   Description: Whether to enable HW LRO on pkts going to a LPD capable vmxnet3

esxcli system settings advanced list -o /Net/Vmxnet3SwLRO

   Path: /Net/Vmxnet3SwLRO

   Type: integer

   Int Value: 0

   Default Int Value: 1

   Min Value: 0

   Max Value: 1

   String Value:

   Default String Value:

   Valid Characters:

   Description: Whether to perform SW LRO on pkts going to a LPD capable vmxnet3

My ESXi version:

esxcli system version get

   Product: VMware ESXi

   Version: 6.5.0

   Build: Releasebuild-5224529

   Update: 0

   Patch: 15

My R430 firmware, per IDRAC:

Integrated Dell Remote Access Controller2.41.40.40
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:787.10.64
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:797.10.64
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:7A7.10.64
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:7B7.10.64
Intel(R) Ethernet Converged Network Adapter X710 - 3C:FD:FE:03:AC:E017.5.12
Intel(R) Ethernet Converged Network Adapter X710 - 3C:FD:FE:03:AC:E217.5.12
BIOS2.3.4
Lifecycle Controller2.41.40.40
Dell 32 Bit uEFI Diagnostics, version 4239, 4239A29, 4239.374239A29
Dell OS Driver Pack, 16.10.10, A0016.10.10
OS COLLECTOR 1.1, OSC_1.1, A00OSC_1.1
System CPLD1.0.3
PERC H330 Mini25.5.0.0019
Dell 12Gbps HBA13.17.03.00

View solution in original post

42 Replies
DavoudTeimouri
Virtuoso
Virtuoso
Jump to solution

Actually, the issues are related to device driver, you can customize the image and adding your drivers on that.

Here is an example: https://www.vmguru.com/2015/04/how-to-build-a-custom-image-with-vsphere-esxi-image-builder-cli/

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/
litjjones
Contributor
Contributor
Jump to solution

I actually have this same issue.  What driver needs to be loaded for the NIC card to work?  I found that vlan tagging does not work.  If I untag the port, everything works; however, I need multiple VLAN's on the nic card.

This only happens on the X710 (X710-DA4).  I have a X520 on the same server and it does not have this issue.

Reply
0 Kudos
litjjones
Contributor
Contributor
Jump to solution

Solved!  At least for me.  BTW, I am using the Dell Customized image as well.

I had to remove the i40en driver using the following command:

esxcli software vib remove -n i40en

(wait about two minutes for esxi to remove the drivers)

Reboot the server

After removing the driver, the X710 NICS's defaulted to the i40e driver (without the n).  The VLAN tagging started to work.

This worked on all four of my servers.  I hope this helps you.

FYI:

Also, before removing the driver, I did try the following commands which did not change anything.  After rebooting the server, the i40en was still enabled.  Removing the i40en driver was the only thing that worked.

esxcli system module set --enabled=false --module=i40en

Reply
0 Kudos
KellyGreen
Contributor
Contributor
Jump to solution

I also see issues with the X710 (dual port) after an upgrade to ESXi 6.5 (DELL customized).  In my case, the nic works normally for a few days after host boot, and then silently fails. As I am using  vlans extensively, I presume that this is due to delayed onset of the vlan tagging issue.

I had no issues at all under ESXi 6.0 with the i40e driver.  The combination worked flawlessly.

I am going to try disabling the i40en driver and fall back to the i40e, but it will take a week or so of uptime to assert with some confidence that this is the solution.

Does anyone have any better advice?

esxcli network nic get -n vmnic4

   Advertised Auto Negotiation: false

   Advertised Link Modes: 1000BaseT/Full, 10000BaseT/Full, 10000BaseT/Full

   Auto Negotiation: false

   Cable Type:

   Current Message Level: -1

   Driver Info:

         Bus Info: 0000:06:00:0

         Driver: i40en

         Firmware Version: 5.05 0x80002899 17.0.12

         Version: 1.1.0

   Link Detected: true

   Link Status: Up

   Name: vmnic4

   PHYAddress: 0

   Pause Autonegotiate: false

   Pause RX: false

   Pause TX: false

   Supported Ports:

   Supports Auto Negotiation: false

   Supports Pause: false

   Supports Wakeon: true

   Transceiver:

   Virtual Address: 00:50:56:13:ac:e0

   Wakeon: MagicPacket(tm)

Reply
0 Kudos
KellyGreen
Contributor
Contributor
Jump to solution

Successfully disabled the i40en driver, now running the i40e driver upon reboot. So far, so good. Time will tell. 

I originally thought this was a vlan tagging issue, but further investigation revealed the nic goes entirely silent. I have link but nothing else. VMs communicating through the same vswitch remain in communication with each other.  Everything beyond the nic is disconnected.

Much to my disgust, I also encountered the problem described in ESXi 6.5 connectivity issue on PowerEdge R430,​​​ as my standby nics running the ntg3 driver also failed several days after I physically disconnected the problematic X710.  I will likely disable the ntg3 too, and fall back to the tg3 driver.

Reply
0 Kudos
SA_Pswieg
Contributor
Contributor
Jump to solution

Good day

Im sitting in the same boat on the current issue.

So i have removed the i40en VIB and the host reverted to the i40e driver ..... this seems to solve the problem, but after time I see host and VM disconnects.

log extracts :

2017-04-14T19:29:25.825Z cpu39:66210)WARNING: LinNet: netdev_watchdog:3688: NETDEV WATCHDOG: vmnic1: transmit timed out

2017-04-14T19:29:25.825Z cpu39:66210)<6>i40e 0000:01:00.1: tx_timeout: VSI_seid: 390, Q 0, NTC: 0x1e9, HWB: 0x1e9, NTU: 0x57, TAIL: 0x57, INT: 0x1

2017-04-14T19:29:25.825Z cpu39:66210)<6>i40e 0000:01:00.1: tx_timeout recovery level 1, hung_queue 0

2017-04-14T19:29:25.825Z cpu39:66210)WARNING: at vmkdrivers/src_92/vmklinux_92/vmware/linux_net.c:3717/netdev_watchdog() (inside vmklinux)

2017-04-14T19:29:25.825Z cpu39:66210)Backtrace for current CPU #39, worldID=66210, fp=0x4307acd91500

2017-04-14T19:29:25.825Z cpu39:66210)0x43915511be50:[0x418033303f71]vmk_LogBacktraceMessage@vmkernel#nover+0x29 stack: 0x4307acd81e48, 0x418033a818ad, 0xe68, 0x4307acd91500, 0x439100001022

2017-04-14T19:29:25.825Z cpu39:66210)0x43915511be70:[0x418033a818ad]watchdog_work_cb@com.vmware.driverAPI#9.2+0x27d stack: 0x439100001022, 0x418033d9a3c8, 0x418033a8185a, 0x43915511bef0, 0xc0000000

2017-04-14T19:29:25.825Z cpu39:66210)0x43915511bed0:[0x418033aa2e28]vmklnx_workqueue_callout@com.vmware.driverAPI#9.2+0xe0 stack: 0x4307acdbc9c0, 0x417fc4e0c5c0, 0x8000000000001014, 0x418033a81630, 0x418033aa2e1d

2017-04-14T19:29:25.825Z cpu39:66210)0x43915511bf50:[0x4180332c93ee]helpFunc@vmkernel#nover+0x4b6 stack: 0x4301a0fff050, 0x0, 0x0, 0x0, 0x1014

2017-04-14T19:29:25.825Z cpu39:66210)0x43915511bfe0:[0x4180334c8c95]CpuSched_StartWorld@vmkernel#nover+0x99 stack: 0x0, 0x0, 0x0, 0x0, 0x0

2017-04-14T19:29:25.836Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 390 Tx ring 0 disable timeout

2017-04-14T19:29:25.849Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 399 Tx ring 1 disable timeout

2017-04-14T19:29:25.861Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 400 Tx ring 2 disable timeout

2017-04-14T19:29:25.873Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 401 Tx ring 3 disable timeout

2017-04-14T19:29:25.886Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 402 Tx ring 4 disable timeout

2017-04-14T19:29:25.898Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 403 Tx ring 5 disable timeout

2017-04-14T19:29:25.910Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 404 Tx ring 6 disable timeout

2017-04-14T19:29:25.923Z cpu5:66220)<6>i40e 0000:01:00.1: VSI seid 405 Tx ring 7 disable timeout

2017-04-14T19:29:26.125Z cpu5:66220)<6>i40e 0000:01:00.1: PF reset failed, -15

2017-04-14T19:29:29.862Z cpu21:69904)NetSched: 701: 0x2000002: received a force quiesce for port 0x2000008, dropped 7 pkts

2017-04-14T19:29:30.834Z cpu44:66219)WARNING: LinNet: netdev_watchdog:3688: NETDEV WATCHDOG: vmnic0: transmit timed out

2017-04-14T19:29:30.834Z cpu44:66219)<6>i40e 0000:01:00.0: tx_timeout: VSI_seid: 391, Q 0, NTC: 0x13f, HWB: 0x13f, NTU: 0x194, TAIL: 0x194, INT: 0x1

2017-04-14T19:29:30.834Z cpu44:66219)<6>i40e 0000:01:00.0: tx_timeout recovery level 1, hung_queue 0

2017-04-14T19:29:30.845Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 391 Tx ring 0 disable timeout

2017-04-14T19:29:30.857Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 392 Tx ring 1 disable timeout

2017-04-14T19:29:30.870Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 393 Tx ring 2 disable timeout

2017-04-14T19:29:30.883Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 394 Tx ring 3 disable timeout

2017-04-14T19:29:30.895Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 395 Tx ring 4 disable timeout

2017-04-14T19:29:30.907Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 396 Tx ring 5 disable timeout

2017-04-14T19:29:30.920Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 397 Tx ring 6 disable timeout

2017-04-14T19:29:30.932Z cpu36:66205)<6>i40e 0000:01:00.0: VSI seid 398 Tx ring 7 disable timeout

2017-04-14T19:29:31.384Z cpu36:66205)<6>i40e 0000:01:00.0: PF reset failed, -15

This is happening randomly on 3 hosts - so a hardware error is mostly eliminated.

The next step is to update firmware on the Dell M630 hosts. This will update the x710 firmware to 17.5.12 from the current 17.5.11 and revert back to the i40en driver.

Current support case open with VMWare and Dell.

Will update the outcomes.

Thanks everyone for the information shared till date.

Reply
0 Kudos
KellyGreen
Contributor
Contributor
Jump to solution

My host, (a Dell R430), continues to work normally after 4 days of uptime with the i40e driver.  That's somewhat longer than I was seeing with the i40en.  Still watching and waiting.

Reply
0 Kudos
SA_Pswieg
Contributor
Contributor
Jump to solution

just an update :

i have done the firmware update and reverted back to i40en.

No Luck .

So im back with the i40e driver on the new x710 firmware and Dell OS Driver Pack

pastedImage_0.png

fingers crossed.

Reply
0 Kudos
KellyGreen
Contributor
Contributor
Jump to solution

I have not seen the disconnections since reverting to the i40e driver.  However, I did abruptly start getting PSODs, with roughly the same frequency of occurrence.

See ESXi host that uses Intel Corporation Ethernet Controller X710 for 10GbE SFP+ NIC fails with PSOD (2...

I have since disabled TSO/LRO support.  I have not seen either type of failure for the last few days, but that's hardly inspiring at this point.

Reply
0 Kudos
KellyGreen
Contributor
Contributor
Jump to solution

After 5 fault free days, I'm fairly confident now that I've seen the last of both the disconnections and PSODs.

Action summary (in maintenance mode):

1) Disable the i40en driver

esxcli system module set --enabled=false --module=i40en

2) Disable TSO/LRO globally

esxcli system settings advanced set -o /Net/UseHwTSO -i 0

esxcli system settings advanced set -o /Net/UseHwTSO6 -i 0

esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0

3) Explicitly disable both HW & SW LRO on Vmxnet 3.

    Probably not necessary, but I wanted LRO really, truly, most sincerely dead.

esxcli system settings advanced set -o /Net/Vmxnet3HwLRO -i 0

esxcli system settings advanced set -o /Net/Vmxnet3SwLRO -i 0

4) Reboot, then confirm use of i40e driver

esxcli system module list | grep i40e

i40e                                true        true

i40en                              false       false

esxcli network nic list

Name    PCI Device    Driver  Admin Status  Link Status  Speed  Duplex  MAC Address         MTU  Description

------  ------------  ------  ------------  -----------  -----  ------  -----------------  ----  ---------------------------------------------------------

vmnic0  0000:02:00.0  ntg3    Up            Up            1000  Full    14:18:77:43:23:78  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

vmnic1  0000:02:00.1  ntg3    Up            Up            1000  Full    14:18:77:43:23:79  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

vmnic2  0000:03:00.0  ntg3    Up            Up            1000  Full    14:18:77:43:23:7a  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

vmnic3  0000:03:00.1  ntg3    Up            Up            1000  Full    14:18:77:43:23:7b  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet

vmnic4  0000:06:00.0  i40e    Up            Up           10000  Full    3c:fd:fe:03:ac:e0  1500  Intel Corporation Ethernet Controller X710 for 10GbE SFP+

vmnic5  0000:06:00.1  i40e    Up            Up           10000  Full    3c:fd:fe:03:ac:e2  1500  Intel Corporation Ethernet Controller X710 for 10GbE SFP+

5) Confirm non-use of TSO/LRO

esxcli system settings advanced list -o /Net/UseHwTSO

   Path: /Net/UseHwTSO

   Type: integer

   Int Value: 0

   Default Int Value: 1

   Min Value: 0

   Max Value: 1

   String Value:

   Default String Value:

   Valid Characters:

   Description: When non-zero, use pNIC HW TSO offload if available

esxcli system settings advanced list -o /Net/UseHwTSO6

   Path: /Net/UseHwTSO6

   Type: integer

   Int Value: 0

   Default Int Value: 1

   Min Value: 0

   Max Value: 1

   String Value:

   Default String Value:

   Valid Characters:

   Description: When non-zero, use pNIC HW IPv6 TSO offload if available

esxcli system settings advanced list -o /Net/TcpipDefLROEnabled

   Path: /Net/TcpipDefLROEnabled

   Type: integer

   Int Value: 0

   Default Int Value: 1

   Min Value: 0

   Max Value: 1

   String Value:

   Default String Value:

   Valid Characters:

   Description: LRO enabled for TCP/IP

esxcli system settings advanced list -o /Net/Vmxnet3HwLRO

   Path: /Net/Vmxnet3HwLRO

   Type: integer

   Int Value: 0

   Default Int Value: 1

   Min Value: 0

   Max Value: 1

   String Value:

   Default String Value:

   Valid Characters:

   Description: Whether to enable HW LRO on pkts going to a LPD capable vmxnet3

esxcli system settings advanced list -o /Net/Vmxnet3SwLRO

   Path: /Net/Vmxnet3SwLRO

   Type: integer

   Int Value: 0

   Default Int Value: 1

   Min Value: 0

   Max Value: 1

   String Value:

   Default String Value:

   Valid Characters:

   Description: Whether to perform SW LRO on pkts going to a LPD capable vmxnet3

My ESXi version:

esxcli system version get

   Product: VMware ESXi

   Version: 6.5.0

   Build: Releasebuild-5224529

   Update: 0

   Patch: 15

My R430 firmware, per IDRAC:

Integrated Dell Remote Access Controller2.41.40.40
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:787.10.64
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:797.10.64
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:7A7.10.64
Broadcom Gigabit Ethernet BCM5720 - 14:18:77:43:23:7B7.10.64
Intel(R) Ethernet Converged Network Adapter X710 - 3C:FD:FE:03:AC:E017.5.12
Intel(R) Ethernet Converged Network Adapter X710 - 3C:FD:FE:03:AC:E217.5.12
BIOS2.3.4
Lifecycle Controller2.41.40.40
Dell 32 Bit uEFI Diagnostics, version 4239, 4239A29, 4239.374239A29
Dell OS Driver Pack, 16.10.10, A0016.10.10
OS COLLECTOR 1.1, OSC_1.1, A00OSC_1.1
System CPLD1.0.3
PERC H330 Mini25.5.0.0019
Dell 12Gbps HBA13.17.03.00
aeroliteflyer1
Contributor
Contributor
Jump to solution

That's good info.  I ran across the same issues with r630/x710.  When I had first got the hosts and installed ESXi 5.5, I was getting PSOD's and had to disable TSO/LRO.  I was hoping after upgrading to 6.5 I would be able to re-enable it.  I guess I will leave it disabled after reading this. 

Reply
0 Kudos
rzuber78
Contributor
Contributor
Jump to solution

Had the same network outage issue with Dell R830 and the 10g Intel X710. Out of the box the esxi 6.5 uses the i40en driver, we has several issues with he latest version of the i40en. at the end we disabled the i40en driver and now we hope that the i40e driver will be stable, at least Dell support confirmed.

Reply
0 Kudos
rzuber78
Contributor
Contributor
Jump to solution

Perfect, now we have PSOD with the i40e driver.

2017-08-07 09_24_19-w7-pool.png

Reply
0 Kudos
rzuber78
Contributor
Contributor
Jump to solution

the PSOD could be this ? ESXi host that uses Intel Corporation Ethernet Controller X710 for 10GbE SFP+ NIC fails with PSOD

Disabling TSO TSO6 LRO

esxcli system settings advanced set -o /Net/UseHwTSO -i 0

esxcli system settings advanced set -o /Net/UseHwTSO6 -i 0

esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0

Reply
0 Kudos
cjckalb
Contributor
Contributor
Jump to solution

Hey rzuber78,

we have been experiencing similar problems with our X710-DA4 NICs ever since we got them.

In short:

  • i40e 2.0.6 resulted in regular PSODs.
  • i40en 1.3.1-x is causing regular network outages whenever we see any amount of traffic (watch out for "Malicious Driver Detection" in vmkernel.log!)

We currently have an open Support Request 17530352108 with VMware. We also had another SR 17479731106 a while ago where VMware suggested we switch from i40e 2.0.6 to i40en 1.3.1.

Unfortunately, in our current SR, the support engineer couldn't really help us other than telling us the following:

  • ALL current i40e(n) drivers are problematic and they could not point us in any specific direction that would solve our problems.
  • However there is no KB entry detailing these problems.
  • VMware does not directly support Intel drivers and we should open a Ticket with Intel for further support.
  • HCL entries are maintained by Intel and can only be modified and/or removed by the respective vendor.

I then escalated the SR and had a brief conversation with the manager who - besides asking me whether I knew what the VMware HCL was (seriously?) - asked me to try the i40e 2.0.6 driver once again, which I refused to do, because that's the driver we had the PSODs with and the driver another support engineer explicitly told us to move away from. The manager also suggested we involve the server OEM, which - at least in our case - would be pointless, as we're using Intel retail NICs.

It's been a day since then and I have not heard from VMware support since.

I also contacted Intel support who politely told me that Intel does not support VMware drivers directly since ESXi 5.x.

And if you think that's bad, have a look at the following Intel communities post, where someone has been fighting these issues with various firmware/driver combinations since 2+ years: Intel X710-DA4 / VMware ESXi 6.5u1 - Malicious ... |Intel Communities To quote directly from that post:

I've had PSOD's and NIC PF peset issues with all the NVM Firmware versions & Drivers I've tried for the past 2 years.

NVM / i40e Driver Versions I've tried.

4.42 / 1.2.48

4.53 / 1.3.38 & 1.3.45

5.02 / 1.4.26

5.04 / 1.4.28

5.05 / 2.0.6

5.05 / 1.31 (i40en)

At first Intel Engineering said many of my issues were known and kept delaying me until NVM 5.02 / 1.4.26 which they expected would resolve them. That release at least made the cards someone stable but the PSOD's and NIC PF resets still happen too frequently (PSOD's occur at least once a week across one of my 12 hosts).

Quite frankly, I've come to the conclusion that neither VMware support nor Intel are willing or able to help us with that problem and our only way out is replacing all of the NICs with hardware from a different vendor.

KellyGreen
Contributor
Contributor
Jump to solution

It seems people are still having issues with this card & driver.  All I can add to the discussion at this point is that my problems went away after the changes I described previously.  I have not had a single PSOD or hung network for the past 3 months.  They were almost daily events before.

It is possible that I am simply not stressing the network hard enough to expose additional difficulties.  My average load is only 0.5 Gb/s, with spikes around 2 Gb/s.

rzuber78
Contributor
Contributor
Jump to solution

Hi cjckalb,

Thank you for the detailed description.

The NIC behaves totally identical here.

We fresh installed esxi 6.5 onto the DELL R830 with the NIC and it was using the i40en driver (i40en 1.1.0-1vmw.650.0.0.4564106) out of the box and we have been running very light workload for 2 months without any problem.

After we migrated some more VMs onto the ESX the network outage issue occurred after around a week, then again in 10 days.

Following the outage I have upgraded the i40en to the latest available ( 1.3.1-5vmw.650.1.26.5969303 ) but within hours we have had network outage.

I was not touching the NIC Firmware at all as it was already the latest available on delivery (firmware-version: 5.05 0x80002885 17.5.12)

And i found this forum and have read what KellyGreen posted. and eventually used the same workaround as he did.

I have filed a case with both DELL and VMware. Dell recommended to use the i40e driver version 1.4.26.

VMware Support is complaining that my FW is too new and not on HCL, that mine is 5.05 and the HCL has only 5.02.

I have installed the net-i40e 1.4.28-1OEM.550.0.0.1331820 and had to disable the i40en driver with: 

esxcli system module set --enabled=false --module=i40en

After about a day or two we landed with PSOD,

2017-08-07 09_24_19-w7-pool.png

similar to this ESXi host that uses Intel Corporation Ethernet Controller X710 for 10GbE SFP+ NIC fails with PSOD

So i have followed and disabled the TSO TSO6 LRO:

esxcli system settings advanced set -o /Net/UseHwTSO -i 0

esxcli system settings advanced set -o /Net/UseHwTSO6 -i 0

esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0

After this we think the driver is stable now but in any case we have purchased some extra 10G Briadcom NICs.

Some tech details:

Our X710 NICs

[root@esx:~] vmkchdev -l | grep vmnic | grep 1572

0000:01:00.0 8086:1572 1028:1f99 vmkernel vmnic0

0000:01:00.1 8086:1572 1028:0000 vmkernel vmnic1

Driver and FW we are using:

[root@esx:~] ethtool -i vmnic0

driver: i40e

version: 2.0.6

firmware-version: 5.05 0x80002885 17.5.12

bus-info: 0000:01:00.0

Reply
0 Kudos
cjckalb
Contributor
Contributor
Jump to solution

With i40e 2.0.6 we still managed to get a PSOD on one of the ESX servers with LRO/TSO/TSO6 disabled. While I suspect this might be due to us using VMware NSX (VXLAN), that is still speculation at this point. I'm also kind of unwilling to buy modern day NICs and then disable all of these performance features.

In the meantime, 18 days after first opening my service request, I managed to get VMware support to actually do what their policy says they would do - contact Intel via TSAnet (Multi Vendor Support from TSANet | Vendor-Neutral Technical Support Alliance & Community). Yay. Might actually get the first status update tomorrow. However, I'm not deluded enough to expect this to lead anywhere, as that would require someone to actually fix one of the drivers. I don't think that's gonna happen unless they already have a working driver up their sleeve.

Since we cannot really afford any network instability on this multi-tenant environment, I have also ordered some Broadcom NICs, the first of which will go into testing within the next days.

P.S.: If support asks you to downgrade your firmware to 5.02 again, you might want to remind them of the fact that anything older than 5.05 is prone to a denial-of-service vulnerability (Intel® Product Security Center​). At least with the retail X(L)710, the HCL also clearly states: "Firmware versions listed are the minimum supported versions."

Reply
0 Kudos
TCG2
Contributor
Contributor
Jump to solution

We also have been experiencing issues with the X710s I am in the process of changing drivers and working with support. This has also caused some heartburn due to these NICs being used on our vSAN cluster.

I wrote up a quick and dirty script to change out the driver and change the advanced settings:

$vmhost = get-vmhost vmhost1

$esxcli = Get-EsxCli -v2 -VMHost $vmhost

$a = $esxcli.system.module.set.CreateArgs()

$a.enabled = $false

$a.module = "i40en"

$esxcli.system.module.set.invoke($a)

$a = $esxcli.system.settings.advanced.set.CreateArgs()

$a.option = "/Net/UseHwTSO"

$a.intvalue = 0

$esxcli.system.settings.advanced.set.invoke($a)

$a.option = "/Net/UseHwTSO6"

$esxcli.system.settings.advanced.set.invoke($a)

$a.option = "/Net/TcpipDefLROEnabled"

$esxcli.system.settings.advanced.set.invoke($a)

$a.option = "/Net/Vmxnet3HwLRO"

$esxcli.system.settings.advanced.set.invoke($a)

$a.option = "/Net/Vmxnet3SwLRO"

$esxcli.system.settings.advanced.set.invoke($a)

Thanks to cjckalb for keeping this post updated and pushing support to get a fix!

Reply
0 Kudos