1 2 Previous Next 27 Replies Latest reply on Sep 9, 2020 3:01 AM by dz0077

    dead I/O on igb-nic (ESXi 6.7)

    BaumMeister Lurker

      Hi,

       

      I'm running a homelab with ESXi 6.7 (13006603). I got three nics in my host, two are onboard and one is an Intel ET 82576 dual-port pci-e card. All nics are assigned to the same vSwitch; actually only one is connected to the (physical) switch atm.

      When I'm using one of the 82576 nics and put heavy load on it (like backing up VMs via Nakivo B&R) the nic stops workign after a while and is dead/Not responding anymore. Only a reboot of the host or (much easier) physically reconnecting the nic (cable out, cable in) solves the problem.

       

      I was guessing there is a driver issue, so I updated to the latest driver by intel:

       

       

      [root@esxi:~] /usr/sbin/esxcfg-nics -l

      Name    PCI          Driver      Link Speed      Duplex MAC Address       MTU    Description

      vmnic0  0000:04:00.0 ne1000      Down 0Mbps      Half   00:25:90:a7:65:dc 1500   Intel Corporation 82574L Gigabit Network Connection

      vmnic1  0000:00:19.0 ne1000      Up   1000Mbps   Full   00:25:90:a7:65:dd 1500   Intel Corporation 82579LM Gigabit Network Connection

      vmnic2  0000:01:00.0 igb         Down 0Mbps      Half   90:e2:ba:1e:4d:c6 1500   Intel Corporation 82576 Gigabit Network Connection

      vmnic3  0000:01:00.1 igb         Down 0Mbps      Half   90:e2:ba:1e:4d:c7 1500   Intel Corporation 82576 Gigabit Network Connection

      [root@esxi:~] esxcli software vib list|grep igb

      net-igb                        5.2.5-1OEM.550.0.0.1331820            Intel   VMwareCertified   2019-06-16

      igbn                           0.1.1.0-4vmw.670.2.48.13006603        VMW     VMwareCertified   2019-06-07

       

      Unfortunately this didn't solve the problem.

       

      However ... this behaviour doesn't occur, when I'm using one of the nics using the ne1000 driver.

       

      Any idea how to solve the issue?

      (... or at least dig down to it's root?)

       

      Thanks a lot in advance.

       

      Regards

      Chris

       

      PS: I found another thread which might be connected to my problem: Stopping I/O on vmnic0  Same system behaviour, same driver.

        • 1. Re: dead I/O on igb-nic (ESXi 6.7)
          Sureshkumar M Expert
          vExpert

          What does vmkernel.log say ? can you post vmkernel logs here ..

          • 2. Re: dead I/O on igb-nic (ESXi 6.7)
            anvanster Enthusiast

            igb driver 5.2.5 that you are using was released in 2014 and quite old.

            Unfortunately your card is not supported by newer "igbn" drivers.

            • 3. Re: dead I/O on igb-nic (ESXi 6.7)
              BaumMeister Lurker

              You're right about the newer igbn driver not supporting the nic anymore.

              However ... the nic and driver I'm using are on vmwares hcl:

              VMware Compatibility Guide - I/O Device Search

              • 4. Re: dead I/O on igb-nic (ESXi 6.7)
                BaumMeister Lurker

                Sure.

                Here's the log output in the relevant timeslot.

                I marked the line that shows when the 82576-nic (-> vmnic3) went down. vmnic1 is runnign with the ne1000 driver.

                2019-06-17T12:20:44.190Z cpu4:2097707)DVFilter: 5964: Checking disconnected filters for timeouts

                2019-06-17T12:23:04.707Z cpu3:2097182)vmw_ahci[0000001f]: AHCI_EdgeIntrHandler:new interrupts coming, IS= 0x2, no repeat

                2019-06-17T12:30:44.190Z cpu0:2097707)DVFilter: 5964: Checking disconnected filters for timeouts

                2019-06-17T12:35:42.190Z cpu0:2098034)StorageApdHandler: 1203: APD start for 0x430c44ee76d0 [3a5eb32c-7141e730]

                2019-06-17T12:35:42.190Z cpu0:2098034)StorageApdHandler: 1203: APD start for 0x430c44ee95d0 [a16fe90b-d7095fcc]

                2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandler: 419: APD start event for 0x430c44ee76d0 [3a5eb32c-7141e730]

                2019-06-17T12:35:42.190Z cpu0:2098034)StorageApdHandler: 1203: APD start for 0x430c44eeb4c0 [37c6519b-ec9783e7]

                2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [3a5eb32c-7141e730] has entered the All Paths Down state.

                2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandler: 419: APD start event for 0x430c44ee95d0 [a16fe90b-d7095fcc]

                2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [a16fe90b-d7095fcc] has entered the All Paths Down state.

                2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandler: 419: APD start event for 0x430c44eeb4c0 [37c6519b-ec9783e7]

                2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [37c6519b-ec9783e7] has entered the All Paths Down state.

                2019-06-17T12:37:06.190Z cpu7:2098034)WARNING: NFS: 337: Lost connection to the server 10.0.0.199 mount point /volume1/VMs, mounted as 3a5eb32c-7141e730-0000-000000000000 ("VMs@Fuchur")

                2019-06-17T12:37:06.190Z cpu7:2098034)WARNING: NFS: 337: Lost connection to the server 10.0.0.199 mount point /volume1/VM_Backups/, mounted as a16fe90b-d7095fcc-0000-000000000000 ("VM_Backups@Fuchur")

                2019-06-17T12:37:06.190Z cpu7:2098034)WARNING: NFS: 337: Lost connection to the server 10.0.0.199 mount point /volume1/Media, mounted as 37c6519b-ec9783e7-0000-000000000000 ("Media@Fuchur")

                2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandler: 609: APD timeout event for 0x430c44ee76d0 [3a5eb32c-7141e730]

                2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandlerEv: 126: Device or filesystem with identifier [3a5eb32c-7141e730] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

                2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandler: 609: APD timeout event for 0x430c44ee95d0 [a16fe90b-d7095fcc]

                2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandlerEv: 126: Device or filesystem with identifier [a16fe90b-d7095fcc] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

                2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandler: 609: APD timeout event for 0x430c44eeb4c0 [37c6519b-ec9783e7]

                2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandlerEv: 126: Device or filesystem with identifier [37c6519b-ec9783e7] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

                2019-06-17T12:40:44.190Z cpu0:2097707)DVFilter: 5964: Checking disconnected filters for timeouts

                2019-06-17T12:45:39.351Z cpu3:2097615)<6>igb: vmnic3 NIC Link is Down

                2019-06-17T12:45:42.732Z cpu7:2097615)<6>igb: vmnic3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

                2019-06-17T12:45:43.190Z cpu4:2097220)NetqueueBal: 5032: vmnic3: device Up notification, reset logical space needed

                2019-06-17T12:45:43.190Z cpu4:2097220)NetPort: 1580: disabled port 0x2000004

                2019-06-17T12:45:43.190Z cpu2:2097770)NetSched: 654: vmnic3-0-tx: worldID = 2097770 exits

                2019-06-17T12:45:43.190Z cpu4:2097220)Uplink: 11689: enabled port 0x2000004 with mac 90:e2:ba:1e:4d:c7

                2019-06-17T12:45:43.190Z cpu4:2097220)NetPort: 1580: disabled port 0x2000004

                2019-06-17T12:45:43.190Z cpu4:2097220)Uplink: 11689: enabled port 0x2000004 with mac 90:e2:ba:1e:4d:c7

                2019-06-17T12:45:43.191Z cpu5:2097296)CpuSched: 699: user latency of 2102301 vmnic3-0-tx 0 changed by 2097296 NetSchedHelper -6

                2019-06-17T12:45:43.191Z cpu2:2102301)NetSched: 654: vmnic3-0-tx: worldID = 2102301 exits

                2019-06-17T12:45:43.191Z cpu5:2097296)CpuSched: 699: user latency of 2102302 vmnic3-0-tx 0 changed by 2097296 NetSchedHelper -6

                2019-06-17T12:45:48.941Z cpu3:2098034)NFS: 346: Restored connection to the server 10.0.0.199 mount point /volume1/Media, mounted as 37c6519b-ec9783e7-0000-000000000000 ("Media@Fuvchur")

                2019-06-17T12:45:48.941Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44eeb4c0 [37c6519b-ec9783e7]

                2019-06-17T12:45:48.941Z cpu3:2098034)NFS: 346: Restored connection to the server 10.0.0.199 mount point /volume1/VMs, mounted as 3a5eb32c-7141e730-0000-000000000000 ("VMs@Fuchur")

                2019-06-17T12:45:48.941Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [37c6519b-ec9783e7] has exited the All Paths Down state.

                2019-06-17T12:45:48.941Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44ee76d0 [3a5eb32c-7141e730]

                2019-06-17T12:45:48.941Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [3a5eb32c-7141e730] has exited the All Paths Down state.

                2019-06-17T12:45:49.613Z cpu3:2098034)NFS: 346: Restored connection to the server 10.0.0.199 mount point /volume1/VM_Backups/, mounted as a16fe90b-d7095fcc-0000-000000000000 ("VM_Backups@Fuchur")

                2019-06-17T12:45:49.613Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44ee95d0 [a16fe90b-d7095fcc]

                2019-06-17T12:45:49.613Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [a16fe90b-d7095fcc] has exited the All Paths Down state.

                2019-06-17T12:49:19.476Z cpu3:2097615)<6>igb: vmnic3 NIC Link is Down

                2019-06-17T12:49:29.190Z cpu6:2098637 opID=f97c863c)World: 11943: VC opID sps-Main-767271-893-94-37-bba6 maps to vmkernel opID f97c863c

                2019-06-17T12:49:29.190Z cpu6:2098637 opID=f97c863c)SunRPC: 3303: Synchronous RPC abort for client 0x4304520bfb90 IP 10.0.0.199.8.1 proc 1 xid 0x76d7dd9e attempt 1 of 3

                2019-06-17T12:49:39.190Z cpu6:2098637 opID=f97c863c)SunRPC: 3303: Synchronous RPC abort for client 0x4304520bfb90 IP 10.0.0.199.8.1 proc 1 xid 0x76d7dda2 attempt 2 of 3

                2019-06-17T12:49:49.190Z cpu6:2098637 opID=f97c863c)SunRPC: 3303: Synchronous RPC abort for client 0x4304520bfb90 IP 10.0.0.199.8.1 proc 1 xid 0x76d7dda6 attempt 3 of 3

                2019-06-17T12:49:49.190Z cpu6:2098637 opID=f97c863c)WARNING: NFS: 2335: Failed to get attributes (I/O error)

                2019-06-17T12:49:49.190Z cpu6:2098637 opID=f97c863c)NFS: 2444: [Repeated 1 times] Failed to get object (0x451a1b49b3ce) 36 3a5eb32c 7141e730 70001 686a001 0 829c3d42 976c7782 0 0 0 0 0 :No connection

                2019-06-17T12:49:49.190Z cpu6:2098637 opID=f97c863c)NFS: 2449: Failed to get object (0x451a1751b16e) 36 37c6519b ec9783e7 70001 48001 0 829c3d42 976c7782 0 0 0 0 0 :I/O error

                2019-06-17T12:49:51.673Z cpu5:2099927)DEBUG (ne1000): checking link for adapter vmnic1

                2019-06-17T12:49:52.679Z cpu3:2097566)INFO (ne1000): vmnic1: Link is Up

                2019-06-17T12:49:52.679Z cpu3:2097566)DEBUG (ne1000): Reporting uplink 0x43044d090250 status

                2019-06-17T12:49:53.190Z cpu3:2097220)NetqueueBal: 4967: vmnic1: new netq module, reset logical space needed

                2019-06-17T12:49:53.190Z cpu3:2097220)NetqueueBal: 4996: vmnic1: plugins to call differs, reset logical space

                2019-06-17T12:49:53.190Z cpu3:2097220)NetqueueBal: 5032: vmnic1: device Up notification, reset logical space needed

                2019-06-17T12:49:53.190Z cpu3:2097220)Uplink: 537: Driver claims supporting 0 RX queues, and 0 queues are accepted.

                2019-06-17T12:49:53.190Z cpu3:2097220)Uplink: 533: Driver claims supporting 0 TX queues, and 0 queues are accepted.

                2019-06-17T12:49:53.190Z cpu3:2097220)NetPort: 1580: disabled port 0x2000008

                2019-06-17T12:49:53.190Z cpu1:2097761)NetSched: 654: vmnic1-0-tx: worldID = 2097761 exits

                2019-06-17T12:49:53.190Z cpu3:2097220)Uplink: 11689: enabled port 0x2000008 with mac 00:25:90:a7:65:dd

                2019-06-17T12:49:53.190Z cpu5:2097296)CpuSched: 699: user latency of 2102444 vmnic1-0-tx 0 changed by 2097296 NetSchedHelper -6

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Xmit Scatter-Gathered Data'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Offload Checksum for IPv4'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Offload TCP Segmentation for IPv4'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Insert VLAN Tag'

                2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing uplink config

                2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing adapter config

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Strip VLAN Tag'

                2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing uplink config

                2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing adapter config

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Xmit Scatter-Gathered Across Multiple Pages'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Offload Checksum for IPv6'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Offload TCP Segmentation for IPv6'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Xmit Scatter-Gathered Data'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Offload Checksum for IPv4'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Offload TCP Segmentation for IPv4'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Insert VLAN Tag'

                2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing uplink config

                2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing adapter config

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Strip VLAN Tag'

                2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing uplink config

                2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing adapter config

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Xmit Scatter-Gathered Across Multiple Pages'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Offload Checksum for IPv6'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Offload TCP Segmentation for IPv6'

                2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Driver Requires No Packet Scheduling'

                2019-06-17T12:49:54.190Z cpu6:2098034)StorageApdHandler: 1203: APD start for 0x430c44ee76d0 [3a5eb32c-7141e730]

                2019-06-17T12:49:54.190Z cpu6:2098034)StorageApdHandler: 1203: APD start for 0x430c44ee95d0 [a16fe90b-d7095fcc]

                2019-06-17T12:49:54.190Z cpu6:2098034)StorageApdHandler: 1203: APD start for 0x430c44eeb4c0 [37c6519b-ec9783e7]

                2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandler: 419: APD start event for 0x430c44ee76d0 [3a5eb32c-7141e730]

                2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [3a5eb32c-7141e730] has entered the All Paths Down state.

                2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandler: 419: APD start event for 0x430c44ee95d0 [a16fe90b-d7095fcc]

                2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [a16fe90b-d7095fcc] has entered the All Paths Down state.

                2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandler: 419: APD start event for 0x430c44eeb4c0 [37c6519b-ec9783e7]

                2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [37c6519b-ec9783e7] has entered the All Paths Down state.

                2019-06-17T12:50:00.969Z cpu2:2098034)StorageApdHandler: 1315: APD exit for 0x430c44eeb4c0 [37c6519b-ec9783e7]

                2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44eeb4c0 [37c6519b-ec9783e7]

                2019-06-17T12:50:00.969Z cpu2:2098034)StorageApdHandler: 1315: APD exit for 0x430c44ee76d0 [3a5eb32c-7141e730]

                2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [37c6519b-ec9783e7] has exited the All Paths Down state.

                2019-06-17T12:50:00.969Z cpu2:2098034)StorageApdHandler: 1315: APD exit for 0x430c44ee95d0 [a16fe90b-d7095fcc]

                2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44ee76d0 [3a5eb32c-7141e730]

                2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [3a5eb32c-7141e730] has exited the All Paths Down state.

                2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44ee95d0 [a16fe90b-d7095fcc]

                2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [a16fe90b-d7095fcc] has exited the All Paths Down state.

                2019-06-17T12:50:32.325Z cpu6:2099723)VSCSI: 6602: handle 8209(vscsi0:0):Destroying Device for world 2099687 (pendCom 0)

                2019-06-17T12:50:32.327Z cpu3:2099715)VSCSI: 6602: handle 8208(vscsi0:0):Destroying Device for world 2099688 (pendCom 0)

                2019-06-17T12:50:32.327Z cpu2:2099723)CBT: 723: Disconnecting the cbt device 2f0796-cbt with filehandle 3082134

                2019-06-17T12:50:32.328Z cpu3:2099715)CBT: 723: Disconnecting the cbt device 31072d-cbt with filehandle 3213101

                2019-06-17T12:50:32.342Z cpu1:2099723)CBT: 1352: Created device 41078e-cbt for cbt driver with filehandle 4261774

                2019-06-17T12:50:32.342Z cpu3:2099715)CBT: 1352: Created device 320792-cbt for cbt driver with filehandle 3278738

                2019-06-17T12:50:32.345Z cpu1:2099723)CBT: 1352: Created device 5107a4-cbt for cbt driver with filehandle 5310372

                2019-06-17T12:50:32.346Z cpu1:2099723)CBT: 723: Disconnecting the cbt device 41078e-cbt with filehandle 4261774

                2019-06-17T12:50:32.346Z cpu3:2099715)CBT: 1352: Created device 2807a7-cbt for cbt driver with filehandle 2623399

                2019-06-17T12:50:32.346Z cpu3:2099715)CBT: 723: Disconnecting the cbt device 320792-cbt with filehandle 3278738

                2019-06-17T12:50:32.346Z cpu1:2099723)CBT: 723: Disconnecting the cbt device 5107a4-cbt with filehandle 5310372

                2019-06-17T12:50:32.346Z cpu3:2099715)CBT: 723: Disconnecting the cbt device 2807a7-cbt with filehandle 2623399

                2019-06-17T12:50:32.347Z cpu3:2099715)CBT: 1352: Created device 2a07a7-cbt for cbt driver with filehandle 2754471

                2019-06-17T12:50:32.348Z cpu1:2099723)CBT: 1352: Created device 5307a4-cbt for cbt driver with filehandle 5441444

                2019-06-17T12:50:32.348Z cpu3:2099715)SVM: 5032: SkipZero 0, dstFsBlockSize -1, preallocateBlocks 0, vmfsOptimizations 0, useBitmapCopy 1, skipPlugGrain 1, destination disk grainSize 0

                2019-06-17T12:50:32.349Z cpu3:2099715)SVM: 5126: SVM_MakeDev.5126: Creating device 2a07a7-3407aa-svmmirror: Success

                2019-06-17T12:50:32.349Z cpu3:2099715)SVM: 5175: Created device 2a07a7-3407aa-svmmirror, primary 2a07a7, secondary 3407aa

                2019-06-17T12:50:32.349Z cpu3:2099715)VSCSI: 3782: handle 8212(vscsi0:0):Using sync mode due to sparse disks

                2019-06-17T12:50:32.349Z cpu3:2099715)VSCSI: 3810: handle 8212(vscsi0:0):Creating Virtual Device for world 2099688 (FSS handle 4327310) numBlocks=41943040 (bs=512)

                2019-06-17T12:50:32.349Z cpu3:2099715)VSCSI: 273: handle 8212(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000

                2019-06-17T12:50:32.349Z cpu3:2099715)Vmxnet3: 18569: indLROPktToGuest: 1, vcd->umkShared->vrrsSelected: 3 port 0x200000b

                2019-06-17T12:50:32.349Z cpu3:2099715)Vmxnet3: 18810: Using default queue delivery for vmxnet3 for port 0x200000b

                2019-06-17T12:50:32.349Z cpu1:2099723)SVM: 5032: SkipZero 0, dstFsBlockSize -1, preallocateBlocks 0, vmfsOptimizations 0, useBitmapCopy 1, skipPlugGrain 1, destination disk grainSize 0

                2019-06-17T12:50:32.349Z cpu1:2099723)SVM: 5126: SVM_MakeDev.5126: Creating device 5307a4-3b07ad-svmmirror: Success

                2019-06-17T12:50:32.349Z cpu1:2099723)SVM: 5175: Created device 5307a4-3b07ad-svmmirror, primary 5307a4, secondary 3b07ad

                2019-06-17T12:50:32.349Z cpu1:2099723)VSCSI: 3782: handle 8213(vscsi0:0):Using sync mode due to sparse disks

                2019-06-17T12:50:32.349Z cpu1:2099723)VSCSI: 3810: handle 8213(vscsi0:0):Creating Virtual Device for world 2099687 (FSS handle 3606440) numBlocks=62914560 (bs=512)

                2019-06-17T12:50:32.349Z cpu1:2099723)VSCSI: 273: handle 8213(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000

                2019-06-17T12:50:32.350Z cpu1:2099723)Vmxnet3: 18569: indLROPktToGuest: 1, vcd->umkShared->vrrsSelected: 3 port 0x200000d

                2019-06-17T12:50:32.350Z cpu1:2099723)Vmxnet3: 18810: Using default queue delivery for vmxnet3 for port 0x200000d

                2019-06-17T12:50:33.185Z cpu2:2102534)SVM: 2847: scsi0:0 Completed copy in 821 ms. vmmLeaderID = 2099688.

                2019-06-17T12:50:33.223Z cpu0:2102533)SVM: 2847: scsi0:0 Completed copy in 858 ms. vmmLeaderID = 2099687.

                2019-06-17T12:50:33.275Z cpu0:2099715)VSCSI: 6602: handle 8212(vscsi0:0):Destroying Device for world 2099688 (pendCom 0)

                2019-06-17T12:50:33.276Z cpu0:2099715)SVM: 2548: SVM Mirrored mode IO stats for device: 2a07a7-3407aa-svmmirror

                2019-06-17T12:50:33.276Z cpu0:2099715)SVM: 2552: Total # IOs mirrored: 0, Total # IOs sent only to source: 0, Total # IO deferred by lock: 0

                2019-06-17T12:50:33.276Z cpu0:2099715)SVM: 2556: Deferred IO stats - Max: 0, Total: 0, Avg: 1 (msec)

                2019-06-17T12:50:33.276Z cpu0:2099715)SVM: 2570: Destroyed device 2a07a7-3407aa-svmmirror

                2019-06-17T12:50:33.281Z cpu3:2099723)VSCSI: 6602: handle 8213(vscsi0:0):Destroying Device for world 2099687 (pendCom 0)

                2019-06-17T12:50:33.282Z cpu7:2099723)SVM: 2548: SVM Mirrored mode IO stats for device: 5307a4-3b07ad-svmmirror

                2019-06-17T12:50:33.282Z cpu7:2099723)SVM: 2552: Total # IOs mirrored: 0, Total # IOs sent only to source: 0, Total # IO deferred by lock: 0

                2019-06-17T12:50:33.282Z cpu7:2099723)SVM: 2556: Deferred IO stats - Max: 0, Total: 0, Avg: 1 (msec)

                2019-06-17T12:50:33.282Z cpu7:2099723)SVM: 2570: Destroyed device 5307a4-3b07ad-svmmirror

                2019-06-17T12:50:33.335Z cpu1:2099715)CBT: 723: Disconnecting the cbt device 2a07a7-cbt with filehandle 2754471

                2019-06-17T12:50:33.341Z cpu6:2099723)CBT: 723: Disconnecting the cbt device 5307a4-cbt with filehandle 5441444

                2019-06-17T12:50:33.350Z cpu3:2099715)CBT: 1352: Created device 6d09cd-cbt for cbt driver with filehandle 7145933

                2019-06-17T12:50:33.350Z cpu3:2099715)VSCSI: 3782: handle 8214(vscsi0:0):Using sync mode due to sparse disks

                2019-06-17T12:50:33.350Z cpu3:2099715)VSCSI: 3810: handle 8214(vscsi0:0):Creating Virtual Device for world 2099688 (FSS handle 12388969) numBlocks=41943040 (bs=512)

                2019-06-17T12:50:33.350Z cpu3:2099715)VSCSI: 273: handle 8214(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000

                2019-06-17T12:50:33.351Z cpu3:2099715)Vmxnet3: 18569: indLROPktToGuest: 1, vcd->umkShared->vrrsSelected: 3 port 0x200000b

                2019-06-17T12:50:33.351Z cpu3:2099715)Vmxnet3: 18810: Using default queue delivery for vmxnet3 for port 0x200000b

                2019-06-17T12:50:33.357Z cpu4:2099723)CBT: 1352: Created device 220ba5-cbt for cbt driver with filehandle 2231205

                2019-06-17T12:50:33.357Z cpu4:2099723)VSCSI: 3782: handle 8215(vscsi0:0):Using sync mode due to sparse disks

                2019-06-17T12:50:33.357Z cpu4:2099723)VSCSI: 3810: handle 8215(vscsi0:0):Creating Virtual Device for world 2099687 (FSS handle 1706919) numBlocks=62914560 (bs=512)

                2019-06-17T12:50:33.357Z cpu4:2099723)VSCSI: 273: handle 8215(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000

                2019-06-17T12:50:33.357Z cpu4:2099723)Vmxnet3: 18569: indLROPktToGuest: 1, vcd->umkShared->vrrsSelected: 3 port 0x200000d

                2019-06-17T12:50:33.357Z cpu4:2099723)Vmxnet3: 18810: Using default queue delivery for vmxnet3 for port 0x200000d

                • 5. Re: dead I/O on igb-nic (ESXi 6.7)
                  Sureshkumar M Expert
                  vExpert

                  Sorry for the late response.

                   

                  Above log does not give more information on why the nic went down. We have to enable debug logging for the driver to find what made the nic to go down at that time. However, if we identify this issue is something due to driver , we cant do much apart from updating the driver/firmware that you have done already. Only NIC vendor can help us.

                   

                  or if you see no issues with ne1000, you may use this driver instead of igb.

                  • 6. Re: dead I/O on igb-nic (ESXi 6.7)
                    nague Lurker

                    Exact same behavior here with ESXi 6.5 U3 and Intel NIC 82576. Everythnigs was fine in ESXi 6.5 U2.

                    I've updated igb driver from 5.0.5 to 5.2.5 (last officialy supported version), let's say, it's a "little" better, it takes now two weeks (instead of 2 days) before NIC stops passing trafic. Plugin ou/in the ethernet cable, or remotly down/up the port on switch, solve the issue.

                     

                    Do you find any solution to this issue ? Using ne1000 driver with this NIC is possible right ? How to switch driver ?

                    • 7. Re: dead I/O on igb-nic (ESXi 6.7)
                      monderick Enthusiast

                      We're having the same random issue with Intel Corporation 82576 Gigabit Network Connection QP NICs on our vSPhere 6.5 hosts, opened support ticket and of course the suggestion is upgrading to the 5.2.5 driver.  We're going to proceed but this thread doesn't make me confident.

                      • 8. Re: dead I/O on igb-nic (ESXi 6.7)
                        PeterCr Novice
                        vExpert

                        Have the same problem here when under load, for example 2-3 hours into backups over nic's.

                        Two different servers, tried both the inbox and 5.2.5 versions of the driver.

                        If the system is stable I can recover via cli running "esxcli network down -n vmnic0" and "esxcli network up -n vmnic0"  which gets the nic's back online without a reboot.

                        • 9. Re: dead I/O on igb-nic (ESXi 6.7)
                          bewe Expert

                          your vmnic3 is "only" 3 seconds down - anyway to long and should not happen.

                          but did you ignore the apd events before the nic went down ? seems that you lost storage connect to your nfs

                          • 10. Re: dead I/O on igb-nic (ESXi 6.7)
                            DataBitz Novice

                            Tried ESXi 6.7 with the older 4.2.16.8 driver same result, also confirmed also happening on ESXi 6.5 U3.

                            • 11. Re: dead I/O on igb-nic (ESXi 6.7)
                              HobbyStudent Lurker

                              This is exactly the same issue I have with one of my servers. It's a Supermicro X9DRH-7TF with the onboard 1 Gbit interface. Both are Intel 82576 and one is used as Management (vmnic2), the other one (vmnic3) for the guests (1x centOS, 2x Ubuntu 18.04 LTS, 4x Windows Server 2012R2/2019) with its own vSwitch.

                               

                              Everything was working with ESXi 6.7 Build 13473784. Problem first occured after installing ESXi 6.7 Build 15160138.

                               

                              vmnic2  0000:02:00.0 igb         Up   1000Mbps   Full   <MAC address> 1500   Intel Corporation 82576 Gigabit Network Connection

                              vmnic3  0000:02:00.1 igb         Up   1000Mbps   Full   <MAC address> 1500   Intel Corporation 82576 Gigabit Network Connection

                               

                              The Management network is always reachable, while the other one stops passing traffic when there is heavy traffic on it (e.g. backups). The logs doesn't show anything and Link is always "Up".

                              First, some Linux VMs caused

                               

                              Vmxnet3: 24934: <Linux VM>,<MAC address>, portID(83886088): Hang detected,numHangQ: 1, enableGen: 183

                              changed all Linux to e1000e. No "Hang" in logs since... But problem wasn't resolved. vmnic3 stops passing traffic without any log entry.

                               

                              esxcli network down -n vmnic3
                              esxcli network up -n vmnic3

                               

                              immediate starts passing traffic

                               

                              net-igb                        5.0.5.1.1-5vmw.670.0.0.8169922        VMW     VMwareCertified   2019-05-09

                               

                              As others stated, driver update seems not to solve the issue. Is there anything I could try to resolve this issue? Perhaps some extended logging?

                               

                              Edit:

                              Issue occurs more or less random, but minimum every 48-72h

                              • 12. Re: dead I/O on igb-nic (ESXi 6.7)
                                theoha Lurker

                                Hello all,

                                 

                                I have a same problem ...

                                I use a ESXI 6.7.0 Update 3 (Build 14320388) i using also one of the 82576 nics. It's working for me, but whit a latency more than 500ms ....

                                For the moment, I have not found the solution .... but if you have other information, I'm all ears ....

                                 

                                Best regards,

                                Theo

                                • 13. Re: dead I/O on igb-nic (ESXi 6.7)
                                  MattSnead Lurker

                                  I've been having very similar issues since upgrading 3 of my hosts to the latest 6.5 (v15256549).  The version I was running prior to this update was very old as I had been slacking on updates.  I don't even recall what version it was but I think it was a 6.5 version from around 12/2018. As I mentioned I have 3 hosts (of similar vintage- older Dell M610 blade servers) and they've all got dual-quad port Intel 82576's.  Unfortunately the upgrade process went completely fine and gave me no indication of a problem until the very last host.  I was vmotioning VMs between the 3 of them the entire time and had no issues at all.  After I completed the process and went to vmotion the last hosts was when that vmotion failed an all hell broke loose. 

                                   

                                  I use 4 of the uplinks in an static lag (4 active uplinks with IP hash teaming mode on the vmware side, and static lag on the switch side).  This configuration has been in place for almost 9 years and has worked flawlessly.  My findings are completely in line with what has been mentioned here-- after a period of time, either some VMs or all VMs on a host stop passing traffic.  Simply downing a NIC and bring it back up brings it back online.  When some hosts don't work, it's usually (maybe always?) the last NIC in the group that has a problem.  Some hosts can ping hosts that other hosts can't, and vice versa.  Using a vmware IP hash calculator (https://techslaves.org/2014/02/25/vmware-ip-hash-algorithm-calculator/ ) you can see which vmnic it would be sending the traffic over and you can see which NIC is the one with the problem.

                                   

                                  I started a support call immediately with vmware on this issue and I've gotten very little help.  #1 because my servers are technically only certified up to 6.0 U3.  So it's very easy for them just to blame that.  However, these servers have been running on 6.5 for at least 2 years no problem.  I wasn't about to roll back to a version that is end of life in 2 months.  One of the things we tried that seemed to work at first was just destroying the vswitch and recreating a new one.  That actually worked for 3-4 days without an issue.  At that point I had recreated the vswitches on the other 2 hosts and started moving some production VMs to them.  Then the problems starting cropping up again on all three hosts.  Every time I call back in to vmware they want to blame either my old servers or my physical switches so I had to take matters into my own hands to do some real debugging.

                                   

                                  Last weekend I spent many hours debugging it and this is what I found... You can use pktcap-uw to capture packets at different points through the system.  This document indicates the different stages of pktcap-uw: Capture Points of the pktcap-uw Utility. Using a physical server that could not ping on of my virtual hosts (while others could) I opened a continuous ping to the VM from that physical server.  I could identify the packets coming in from the non-vmware-related potion of the network into the vmware-related switches and eventually reaching the host.  The host receives the ping packets and replies to them.  I see the return packets exit the VM and enter the vswitch, but they never leave the vswitch and get put on the physical adapter.  Here are the steps I used:

                                   

                                  ping source 10.212.132.50

                                  ping dest 10.100.32.25

                                  dest VM name: ghost

                                   

                                  [root@vm2:~] esxcli network vm
                                  listWorld ID  Name       Num Ports  Networks
                                  --------  ---------  ---------  --------------------------------
                                   2102595  ghost              1  Data Network A
                                   2102796  Server 2           1  Data Network B
                                   2102159  Server 3           1  Data Network C
                                   2101973  Server 4           1  Data Network A
                                   2101731  Server 5           1  Data Network B
                                  
                                  
                                  [root@vm2:~] esxcli network vm port list -w 2102595
                                     Port ID: 83886095
                                     vSwitch: vSwitch3
                                     Portgroup: Data Network A
                                     DVPort ID:
                                     MAC Address: 00:50:56:bc:2e:e9
                                     IP Address: 0.0.0.0
                                     Team Uplink: all(4)
                                     Uplink Port ID: 0
                                     Active Filters:
                                     
                                  [root@vm2:~] pktcap-uw --switchport 83886095 --capture PortInput --dstip 10.212.132.50 -o- |tcpdump-uw -enr -
                                  The switch port id is 0x0500000f.
                                  The session capture point is PortInput.
                                  The session filter destination IP address is 10.212.132.50.
                                  07:09:58.812762 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3323, length 40
                                  07:10:03.813107 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3324, length 40
                                  07:10:08.809467 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3325, length 40
                                  07:10:13.808301 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3326, length 40
                                  pktcap: Dumped 4 packet to file -, dropped 0 packets.
                                  pktcap: Done.
                                  
                                  [root@vm2:~] pktcap-uw --uplink vmnic9 --capture PortOutput --dstip 10.212.132.50 -o- |tcpdump-uw -enr -
                                  The name of the uplink is vmnic9.
                                  The session capture point is PortOutput.
                                  The session filter destination IP address is 10.212.132.50.
                                  07:11:23.808640 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3340, length 40
                                  07:11:28.808943 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3341, length 40
                                  07:11:33.810570 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3342, length 40
                                  07:11:38.809677 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3343, length 40
                                  pktcap: Dumped 4 packet to file -, dropped 0 packets.
                                  pktcap: Done.
                                  
                                  [root@vm2:~] pktcap-uw --uplink vmnic9 --capture UplinkSnd --dstip 10.212.132.50 -o- |tcpdump-uw -enr -
                                  The name of the uplink is vmnic9.
                                  The session capture point is UplinkSnd.
                                  The session filter destination IP address is 10.212.132.50.
                                  pktcap: Dumped 0 packet to file -, dropped 0 packets.
                                  pktcap: Done.
                                  

                                  ... so note on the last packet capture there are no packets captured.  PortInput in my first capture is the vswitch receiving the packet from the VM, basically.  PortOutput is the packet leaving the vswitch.  UplinkSnd is the vswitch putting the packet on the physical adapter.  Note that I used "--switchport 83886095" for the first capture which theoertically captures all packets from/to that host's portgroup.  I used "--uplink vmnic9" on the other two commands because at that point you're dealing with the vswitch itself.  So you have to know (or trial and error to find) the vmnic.

                                   

                                  Here are similar tests that produce the same result, but using different "[stage]s" and "[dir]ection" switches for the command instead.  My understanding is that the "--capture" points are basically alternates to using --dir and --stage.

                                   

                                  [root@vm2:~] pktcap-uw --uplink vmnic9 --dir 0 --stage 0 -o- |tcpdump-uw -enr -|grep 10.212
                                  The name of the uplink is vmnic9.
                                  The Stage is Pre.
                                  pktcap: The output file is -.
                                  pktcap: No server port specifed, select 40524 as the port.
                                  pktcap: Local CID 2.
                                  pktcap: Listen on port 40524.
                                  pktcap: Accept...
                                  pktcap: Vsock connection from port 1152 cid 2.
                                  reading from file -, link-type EN10MB (Ethernet)
                                  07:30:38.809784 b8:af:67:70:92:c6 > 00:50:56:bc:2e:e9, ethertype IPv4 (0x0800), length 74: 10.212.132.50 > 10.100.32.25: ICMP echo request, id 1, seq 3571, length 40
                                  07:30:43.808875 b8:af:67:70:92:c6 > 00:50:56:bc:2e:e9, ethertype IPv4 (0x0800), length 74: 10.212.132.50 > 10.100.32.25: ICMP echo request, id 1, seq 3572, length 40
                                  tcpdump-uw: pcap_loop: error reading dump file: Interrupted system call
                                  pktcap: Join with dump thread failed.
                                  pktcap: Destroying session 128.
                                  pktcap:
                                  pktcap: Dumped 130 packet to file -, dropped 0 packets.
                                  pktcap: Done.
                                  
                                  [root@vm2:~] pktcap-uw --uplink vmnic9 --dir 0 --stage 1 -o- |tcpdump-uw -enr -|grep 10.212
                                  The name of the uplink is vmnic9.
                                  The Stage is Post.
                                  pktcap: The output file is -.
                                  pktcap: No server port specifed, select 40537 as the port.
                                  pktcap: Local CID 2.
                                  pktcap: Listen on port 40537.
                                  pktcap: Accept...
                                  reading from file -, link-type EN10MB (Ethernet)
                                  pktcap: Vsock connection from port 1153 cid 2.
                                  07:30:53.810564 b8:af:67:70:92:c6 > 00:50:56:bc:2e:e9, ethertype IPv4 (0x0800), length 74: 10.212.132.50 > 10.100.32.25: ICMP echo request, id 1, seq 3574, length 40
                                  07:30:58.812753 b8:af:67:70:92:c6 > 00:50:56:bc:2e:e9, ethertype IPv4 (0x0800), length 74: 10.212.132.50 > 10.100.32.25: ICMP echo request, id 1, seq 3575, length 40
                                  tcpdump-uw: pcap_loop: error reading dump file: Interrupted system call
                                  pktcap: Join with dump thread failed.
                                  pktcap: Destroying session 129.
                                  pktcap:
                                  pktcap: Dumped 91 packet to file -, dropped 0 packets.
                                  pktcap: Done.
                                  
                                  [root@vm2:~] pktcap-uw --uplink vmnic9 --dir 1 --stage 0 -o- |tcpdump-uw -enr -|grep 10.212
                                  The name of the uplink is vmnic9.
                                  The Stage is Pre.
                                  pktcap: The output file is -.
                                  pktcap: No server port specifed, select 40550 as the port.
                                  pktcap: Local CID 2.
                                  pktcap: Listen on port 40550.
                                  reading from file -, link-type EN10MB (Ethernet)
                                  pktcap: Accept...
                                  pktcap: Vsock connection from port 1154 cid 2.
                                  07:31:13.811837 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3578, length 40
                                  07:31:18.813389 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3579, length 40
                                  07:31:23.811731 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3580, length 40
                                  tcpdump-uw: pcap_loop: error reading dump file: Interrupted system call
                                  pktcap: Join with dump thread failed.
                                  pktcap: Destroying session 130.
                                  pktcap:
                                  pktcap: Dumped 8 packet to file -, dropped 0 packets.
                                  pktcap: Done.
                                  
                                  [root@vm2:~] pktcap-uw --uplink vmnic9 --dir 1 --stage 1 -o- |tcpdump-uw -enr -|grep 10.212
                                  The name of the uplink is vmnic9.
                                  The Stage is Post.
                                  pktcap: The output file is -.
                                  pktcap: No server port specifed, select 40560 as the port.
                                  pktcap: Local CID 2.
                                  pktcap: Listen on port 40560.
                                  reading from file -, link-type EN10MB (Ethernet)
                                  pktcap: Accept...
                                  pktcap: Vsock connection from port 1155 cid 2.
                                  pktcap: Join with dump thread failed.
                                  tcpdump-uw: pcap_loop: error reading dump file: Interrupted system call
                                  pktcap: Destroying session 131.
                                  pktcap:
                                  pktcap: Dumped 0 packet to file -, dropped 0 packets.
                                  pktcap: Done.
                                  

                                   

                                  So in this test, the first test is dir 0/stage 0, then dir 0/stage 1, then dir 1/stage 0, then finally dir 1/stage 1 where it fails.  Again, same tests just different variations of the commands.

                                   

                                  Then I found the "trace" switch for that command and this further proves my findings.  Here is a successful trace of an ICMP packet:

                                   

                                  [root@vm2:~] pktcap-uw --trace --ip 10.100.1.5
                                  The trace session is enabled.
                                  The session filter IP(src or dst) address is 10.100.1.5.
                                  No server port specifed, select 56026 as the port.
                                  Output the packet info to console.
                                  Local CID 2.
                                  Listen on port 56026.
                                  Accept...
                                  Vsock connection from port 1207 cid 2.
                                  18:39:04.106975[1] Captured at PktFree point, TSO not enabled, Checksum not offloaded and not verified, length 74.
                                          PATH:
                                            +- [18:39:04.106955] |                        UplinkRcv |            |
                                            +- [18:39:04.106957] |                  UplinkRcvKernel |            |
                                            +- [18:39:04.106958] |                        PortInput |   83886086 |
                                            +- [18:39:04.106958] |                          IOChain |            | UplinkDoSwLRO@vmkernel#nover
                                            +- [18:39:04.106959] |               EtherswitchDispath |   83886086 |
                                            +- [18:39:04.106961] |                EtherswitchOutput |   83886095 |
                                            +- [18:39:04.106961] |                       PortOutput |   83886095 |
                                            +- [18:39:04.106962] |                          IOChain |            | VLAN_OutputProcessor@com.vmware.vswitch#1.0.0
                                            +- [18:39:04.106963] |                          IOChain |            | VSwitchDisablePT@com.vmware.vswitch#1.0.0
                                            +- [18:39:04.106968] |                           VnicRx |   83886095 |
                                            +- [18:39:04.106974] |                          PktFree |            |
                                  
                                  18:39:04.107284[2] Captured at PktFree point, TSO not enabled, Checksum not offloaded and not verified, VLAN tag 101, length 74.
                                          PATH:
                                            +- [18:39:04.107187] |                           VnicTx |   83886095 |
                                            +- [18:39:04.107189] |                        PortInput |   83886095 |
                                            +- [18:39:04.107190] |                          IOChain |            | VLAN_InputProcessor@com.vmware.vswitch#1.0.0
                                            +- [18:39:04.107192] |               EtherswitchDispath |   83886095 |
                                            +- [18:39:04.107195] |                       PortOutput |   83886084 |
                                            +- [18:39:04.107195] |                          IOChain |            | UplinkGenericOffload@vmkernel#nover
                                            +- [18:39:04.107196] |                          IOChain |            | UplinkTSO6ExtHdrs@vmkernel#nover
                                            +- [18:39:04.107197] |                          IOChain |            | UplinkCSum6ExtHdrs@vmkernel#nover
                                            +- [18:39:04.107197] |                          IOChain |            | Uplink_BuildWritableInetHeaders@vmkernel#nover
                                            +- [18:39:04.107198] |                          IOChain |            | NetSchedInput@vmkernel#nover
                                            +- [18:39:04.107200] |                  UplinkSndKernel |            |
                                            +- [18:39:04.107201] |                        UplinkSnd |            |
                                            +- [18:39:04.107281] |                          PktFree |            |
                                  

                                   

                                  ...The first packet is the ICMP being received, and the second is the reply.  Note lines 36 and 37-- the UplinkSndKernel and UplinkSnd before the PktFree.

                                   

                                  Now here is an UNsuccessful ICMP trace:

                                   

                                  [root@vm2:~] pktcap-uw --trace --ip 10.100.0.54
                                  The trace session is enabled.
                                  The session filter IP(src or dst) address is 10.100.0.54.
                                  No server port specifed, select 55910 as the port.
                                  Output the packet info to console.
                                  Local CID 2.
                                  Listen on port 55910.
                                  Accept...
                                  Vsock connection from port 1205 cid 2.
                                  18:34:21.838652[1] Captured at PktFree point, TSO not enabled, Checksum not offloaded and not verified, length 74.
                                          PATH:
                                            +- [18:34:21.838622] |                        UplinkRcv |            |
                                            +- [18:34:21.838626] |                  UplinkRcvKernel |            |
                                            +- [18:34:21.838627] |                        PortInput |   83886088 |
                                            +- [18:34:21.838628] |                          IOChain |            | UplinkDoSwLRO@vmkernel#nover
                                            +- [18:34:21.838630] |               EtherswitchDispath |   83886088 |
                                            +- [18:34:21.838633] |                EtherswitchOutput |   83886095 |
                                            +- [18:34:21.838634] |                       PortOutput |   83886095 |
                                            +- [18:34:21.838634] |                          IOChain |            | VLAN_OutputProcessor@com.vmware.vswitch#1.0.0
                                            +- [18:34:21.838636] |                          IOChain |            | VSwitchDisablePT@com.vmware.vswitch#1.0.0
                                            +- [18:34:21.838642] |                           VnicRx |   83886095 |
                                            +- [18:34:21.838651] |                          PktFree |            |
                                  
                                  18:34:21.838900[2] Captured at PktFree point, TSO not enabled, Checksum not offloaded and not verified, VLAN tag 101, length 74.
                                          PATH:
                                            +- [18:34:21.838878] |                           VnicTx |   83886095 |
                                            +- [18:34:21.838881] |                        PortInput |   83886095 |
                                            +- [18:34:21.838882] |                          IOChain |            | VLAN_InputProcessor@com.vmware.vswitch#1.0.0
                                            +- [18:34:21.838885] |               EtherswitchDispath |   83886095 |
                                            +- [18:34:21.838888] |                       PortOutput |   83886088 |
                                            +- [18:34:21.838889] |                          IOChain |            | UplinkGenericOffload@vmkernel#nover
                                            +- [18:34:21.838890] |                          IOChain |            | UplinkTSO6ExtHdrs@vmkernel#nover
                                            +- [18:34:21.838891] |                          IOChain |            | UplinkCSum6ExtHdrs@vmkernel#nover
                                            +- [18:34:21.838891] |                          IOChain |            | Uplink_BuildWritableInetHeaders@vmkernel#nover
                                            +- [18:34:21.838892] |                          IOChain |            | NetSchedInput@vmkernel#nover
                                            +- [18:34:21.838899] |                          PktFree |            |
                                  

                                   

                                  ... Note the lack of UplinkSndKernel and UplinkSnd before the PktFree.  The packet never gets put on the line.

                                   

                                  I didn't mention earlier but just to put it out there, before doing all of this this past weekend, I actually fresh installed the latest ESXi 6.7 just to see if it was something that had lingered from previous upgrades or something.  But it didn't help, obviously because here I am.  All of these tests and capture above are on a fresh install of the latest 6.7 as of last weekend.  I figured I would try this before forcing myself into installed a soon-to-be obsolete 6.0 version.

                                   

                                  Now, the interesting thing was that the igb driver that ships inside 6.7 was version 5.0.5, I believe.  So searching around for an update to that I found this: Download VMware vSphere  which is version 5.3.3 of the driver.  This is also listed as compatible with version 6.7 on another page but doesn't say it on that link.  After upgrading the driver to v5.3.3 I haven't had a problem... yet.  However, after finding and reading this thread I am not so confident that I have the problem solved.  After all, I mentioned earlier that simply recreating the vswitch had lasted 3-4 days before.  I'm on day 3 now.  I also only have 3 non-productive VMs on the host right now as well.  I am glad to see I am not the only one having the problem and maybe if enough of us bark up the right trees we can get some sort of solution to this.  I know I need to replace my aging servers, and have it in the plans to do so about a year from now, but they all worked fine and suited our needs before this.

                                  • 14. Re: dead I/O on igb-nic (ESXi 6.7)
                                    MattSnead Lurker

                                    .... AAANNNDDD it just happened to me again.   And it happened on the second uplink this time (vmnic4, 5, 8, and 9 make up the LAG, and vmnic5 faulted this time).

                                    1 2 Previous Next