VMware Cloud Community
BaumMeister
Contributor
Contributor

dead I/O on igb-nic (ESXi 6.7)

Hi,

I'm running a homelab with ESXi 6.7 (13006603). I got three nics in my host, two are onboard and one is an Intel ET 82576 dual-port pci-e card. All nics are assigned to the same vSwitch; actually only one is connected to the (physical) switch atm.

When I'm using one of the 82576 nics and put heavy load on it (like backing up VMs via Nakivo B&R) the nic stops workign after a while and is dead/Not responding anymore. Only a reboot of the host or (much easier) physically reconnecting the nic (cable out, cable in) solves the problem.

I was guessing there is a driver issue, so I updated to the latest driver by intel:

[root@esxi:~] /usr/sbin/esxcfg-nics -l

Name    PCI          Driver      Link Speed      Duplex MAC Address       MTU    Description

vmnic0  0000:04:00.0 ne1000      Down 0Mbps      Half   00:25:90:a7:65:dc 1500   Intel Corporation 82574L Gigabit Network Connection

vmnic1  0000:00:19.0 ne1000      Up   1000Mbps   Full   00:25:90:a7:65:dd 1500   Intel Corporation 82579LM Gigabit Network Connection

vmnic2  0000:01:00.0 igb         Down 0Mbps      Half   90:e2:ba:1e:4d:c6 1500   Intel Corporation 82576 Gigabit Network Connection

vmnic3  0000:01:00.1 igb         Down 0Mbps      Half   90:e2:ba:1e:4d:c7 1500   Intel Corporation 82576 Gigabit Network Connection

[root@esxi:~] esxcli software vib list|grep igb

net-igb                        5.2.5-1OEM.550.0.0.1331820            Intel   VMwareCertified   2019-06-16

igbn                           0.1.1.0-4vmw.670.2.48.13006603        VMW     VMwareCertified   2019-06-07

Unfortunately this didn't solve the problem.

However ... this behaviour doesn't occur, when I'm using one of the nics using the ne1000 driver.

Any idea how to solve the issue?

(... or at least dig down to it's root?)

Thanks a lot in advance.

Regards

Chris

PS: I found another thread which might be connected to my problem: Stopping I/O on vmnic0  Same system behaviour, same driver.

27 Replies
SureshKumarMuth
Commander
Commander

What does vmkernel.log say ? can you post vmkernel logs here ..

Regards,
Suresh
https://vconnectit.wordpress.com/
Reply
0 Kudos
anvanster
Enthusiast
Enthusiast

igb driver 5.2.5 that you are using was released in 2014 and quite old.

Unfortunately your card is not supported by newer "igbn" drivers.

Reply
0 Kudos
BaumMeister
Contributor
Contributor

You're right about the newer igbn driver not supporting the nic anymore.

However ... the nic and driver I'm using are on vmwares hcl:

VMware Compatibility Guide - I/O Device Search

Reply
0 Kudos
BaumMeister
Contributor
Contributor

Sure.

Here's the log output in the relevant timeslot.

I marked the line that shows when the 82576-nic (-> vmnic3) went down. vmnic1 is runnign with the ne1000 driver.

2019-06-17T12:20:44.190Z cpu4:2097707)DVFilter: 5964: Checking disconnected filters for timeouts

2019-06-17T12:23:04.707Z cpu3:2097182)vmw_ahci[0000001f]: AHCI_EdgeIntrHandler:new interrupts coming, IS= 0x2, no repeat

2019-06-17T12:30:44.190Z cpu0:2097707)DVFilter: 5964: Checking disconnected filters for timeouts

2019-06-17T12:35:42.190Z cpu0:2098034)StorageApdHandler: 1203: APD start for 0x430c44ee76d0 [3a5eb32c-7141e730]

2019-06-17T12:35:42.190Z cpu0:2098034)StorageApdHandler: 1203: APD start for 0x430c44ee95d0 [a16fe90b-d7095fcc]

2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandler: 419: APD start event for 0x430c44ee76d0 [3a5eb32c-7141e730]

2019-06-17T12:35:42.190Z cpu0:2098034)StorageApdHandler: 1203: APD start for 0x430c44eeb4c0 [37c6519b-ec9783e7]

2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [3a5eb32c-7141e730] has entered the All Paths Down state.

2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandler: 419: APD start event for 0x430c44ee95d0 [a16fe90b-d7095fcc]

2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [a16fe90b-d7095fcc] has entered the All Paths Down state.

2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandler: 419: APD start event for 0x430c44eeb4c0 [37c6519b-ec9783e7]

2019-06-17T12:35:42.190Z cpu3:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [37c6519b-ec9783e7] has entered the All Paths Down state.

2019-06-17T12:37:06.190Z cpu7:2098034)WARNING: NFS: 337: Lost connection to the server 10.0.0.199 mount point /volume1/VMs, mounted as 3a5eb32c-7141e730-0000-000000000000 ("VMs@Fuchur")

2019-06-17T12:37:06.190Z cpu7:2098034)WARNING: NFS: 337: Lost connection to the server 10.0.0.199 mount point /volume1/VM_Backups/, mounted as a16fe90b-d7095fcc-0000-000000000000 ("VM_Backups@Fuchur")

2019-06-17T12:37:06.190Z cpu7:2098034)WARNING: NFS: 337: Lost connection to the server 10.0.0.199 mount point /volume1/Media, mounted as 37c6519b-ec9783e7-0000-000000000000 ("Media@Fuchur")

2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandler: 609: APD timeout event for 0x430c44ee76d0 [3a5eb32c-7141e730]

2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandlerEv: 126: Device or filesystem with identifier [3a5eb32c-7141e730] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandler: 609: APD timeout event for 0x430c44ee95d0 [a16fe90b-d7095fcc]

2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandlerEv: 126: Device or filesystem with identifier [a16fe90b-d7095fcc] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandler: 609: APD timeout event for 0x430c44eeb4c0 [37c6519b-ec9783e7]

2019-06-17T12:38:02.191Z cpu0:2097369)StorageApdHandlerEv: 126: Device or filesystem with identifier [37c6519b-ec9783e7] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

2019-06-17T12:40:44.190Z cpu0:2097707)DVFilter: 5964: Checking disconnected filters for timeouts

2019-06-17T12:45:39.351Z cpu3:2097615)<6>igb: vmnic3 NIC Link is Down

2019-06-17T12:45:42.732Z cpu7:2097615)<6>igb: vmnic3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

2019-06-17T12:45:43.190Z cpu4:2097220)NetqueueBal: 5032: vmnic3: device Up notification, reset logical space needed

2019-06-17T12:45:43.190Z cpu4:2097220)NetPort: 1580: disabled port 0x2000004

2019-06-17T12:45:43.190Z cpu2:2097770)NetSched: 654: vmnic3-0-tx: worldID = 2097770 exits

2019-06-17T12:45:43.190Z cpu4:2097220)Uplink: 11689: enabled port 0x2000004 with mac 90:e2:ba:1e:4d:c7

2019-06-17T12:45:43.190Z cpu4:2097220)NetPort: 1580: disabled port 0x2000004

2019-06-17T12:45:43.190Z cpu4:2097220)Uplink: 11689: enabled port 0x2000004 with mac 90:e2:ba:1e:4d:c7

2019-06-17T12:45:43.191Z cpu5:2097296)CpuSched: 699: user latency of 2102301 vmnic3-0-tx 0 changed by 2097296 NetSchedHelper -6

2019-06-17T12:45:43.191Z cpu2:2102301)NetSched: 654: vmnic3-0-tx: worldID = 2102301 exits

2019-06-17T12:45:43.191Z cpu5:2097296)CpuSched: 699: user latency of 2102302 vmnic3-0-tx 0 changed by 2097296 NetSchedHelper -6

2019-06-17T12:45:48.941Z cpu3:2098034)NFS: 346: Restored connection to the server 10.0.0.199 mount point /volume1/Media, mounted as 37c6519b-ec9783e7-0000-000000000000 ("Media@Fuvchur")

2019-06-17T12:45:48.941Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44eeb4c0 [37c6519b-ec9783e7]

2019-06-17T12:45:48.941Z cpu3:2098034)NFS: 346: Restored connection to the server 10.0.0.199 mount point /volume1/VMs, mounted as 3a5eb32c-7141e730-0000-000000000000 ("VMs@Fuchur")

2019-06-17T12:45:48.941Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [37c6519b-ec9783e7] has exited the All Paths Down state.

2019-06-17T12:45:48.941Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44ee76d0 [3a5eb32c-7141e730]

2019-06-17T12:45:48.941Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [3a5eb32c-7141e730] has exited the All Paths Down state.

2019-06-17T12:45:49.613Z cpu3:2098034)NFS: 346: Restored connection to the server 10.0.0.199 mount point /volume1/VM_Backups/, mounted as a16fe90b-d7095fcc-0000-000000000000 ("VM_Backups@Fuchur")

2019-06-17T12:45:49.613Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44ee95d0 [a16fe90b-d7095fcc]

2019-06-17T12:45:49.613Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [a16fe90b-d7095fcc] has exited the All Paths Down state.

2019-06-17T12:49:19.476Z cpu3:2097615)<6>igb: vmnic3 NIC Link is Down

2019-06-17T12:49:29.190Z cpu6:2098637 opID=f97c863c)World: 11943: VC opID sps-Main-767271-893-94-37-bba6 maps to vmkernel opID f97c863c

2019-06-17T12:49:29.190Z cpu6:2098637 opID=f97c863c)SunRPC: 3303: Synchronous RPC abort for client 0x4304520bfb90 IP 10.0.0.199.8.1 proc 1 xid 0x76d7dd9e attempt 1 of 3

2019-06-17T12:49:39.190Z cpu6:2098637 opID=f97c863c)SunRPC: 3303: Synchronous RPC abort for client 0x4304520bfb90 IP 10.0.0.199.8.1 proc 1 xid 0x76d7dda2 attempt 2 of 3

2019-06-17T12:49:49.190Z cpu6:2098637 opID=f97c863c)SunRPC: 3303: Synchronous RPC abort for client 0x4304520bfb90 IP 10.0.0.199.8.1 proc 1 xid 0x76d7dda6 attempt 3 of 3

2019-06-17T12:49:49.190Z cpu6:2098637 opID=f97c863c)WARNING: NFS: 2335: Failed to get attributes (I/O error)

2019-06-17T12:49:49.190Z cpu6:2098637 opID=f97c863c)NFS: 2444: [Repeated 1 times] Failed to get object (0x451a1b49b3ce) 36 3a5eb32c 7141e730 70001 686a001 0 829c3d42 976c7782 0 0 0 0 0 :No connection

2019-06-17T12:49:49.190Z cpu6:2098637 opID=f97c863c)NFS: 2449: Failed to get object (0x451a1751b16e) 36 37c6519b ec9783e7 70001 48001 0 829c3d42 976c7782 0 0 0 0 0 :I/O error

2019-06-17T12:49:51.673Z cpu5:2099927)DEBUG (ne1000): checking link for adapter vmnic1

2019-06-17T12:49:52.679Z cpu3:2097566)INFO (ne1000): vmnic1: Link is Up

2019-06-17T12:49:52.679Z cpu3:2097566)DEBUG (ne1000): Reporting uplink 0x43044d090250 status

2019-06-17T12:49:53.190Z cpu3:2097220)NetqueueBal: 4967: vmnic1: new netq module, reset logical space needed

2019-06-17T12:49:53.190Z cpu3:2097220)NetqueueBal: 4996: vmnic1: plugins to call differs, reset logical space

2019-06-17T12:49:53.190Z cpu3:2097220)NetqueueBal: 5032: vmnic1: device Up notification, reset logical space needed

2019-06-17T12:49:53.190Z cpu3:2097220)Uplink: 537: Driver claims supporting 0 RX queues, and 0 queues are accepted.

2019-06-17T12:49:53.190Z cpu3:2097220)Uplink: 533: Driver claims supporting 0 TX queues, and 0 queues are accepted.

2019-06-17T12:49:53.190Z cpu3:2097220)NetPort: 1580: disabled port 0x2000008

2019-06-17T12:49:53.190Z cpu1:2097761)NetSched: 654: vmnic1-0-tx: worldID = 2097761 exits

2019-06-17T12:49:53.190Z cpu3:2097220)Uplink: 11689: enabled port 0x2000008 with mac 00:25:90:a7:65:dd

2019-06-17T12:49:53.190Z cpu5:2097296)CpuSched: 699: user latency of 2102444 vmnic1-0-tx 0 changed by 2097296 NetSchedHelper -6

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Xmit Scatter-Gathered Data'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Offload Checksum for IPv4'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Offload TCP Segmentation for IPv4'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Insert VLAN Tag'

2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing uplink config

2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing adapter config

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Strip VLAN Tag'

2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing uplink config

2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing adapter config

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Xmit Scatter-Gathered Across Multiple Pages'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Offload Checksum for IPv6'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Capable To Offload TCP Segmentation for IPv6'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Xmit Scatter-Gathered Data'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Offload Checksum for IPv4'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Offload TCP Segmentation for IPv4'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Insert VLAN Tag'

2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing uplink config

2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing adapter config

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Strip VLAN Tag'

2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing uplink config

2019-06-17T12:49:53.190Z cpu3:2097220)DEBUG (ne1000): writing adapter config

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Xmit Scatter-Gathered Across Multiple Pages'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Offload Checksum for IPv6'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Enabled 'Capable To Offload TCP Segmentation for IPv6'

2019-06-17T12:49:53.190Z cpu3:2097220)INFO (ne1000): vmnic1: Disabled 'Driver Requires No Packet Scheduling'

2019-06-17T12:49:54.190Z cpu6:2098034)StorageApdHandler: 1203: APD start for 0x430c44ee76d0 [3a5eb32c-7141e730]

2019-06-17T12:49:54.190Z cpu6:2098034)StorageApdHandler: 1203: APD start for 0x430c44ee95d0 [a16fe90b-d7095fcc]

2019-06-17T12:49:54.190Z cpu6:2098034)StorageApdHandler: 1203: APD start for 0x430c44eeb4c0 [37c6519b-ec9783e7]

2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandler: 419: APD start event for 0x430c44ee76d0 [3a5eb32c-7141e730]

2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [3a5eb32c-7141e730] has entered the All Paths Down state.

2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandler: 419: APD start event for 0x430c44ee95d0 [a16fe90b-d7095fcc]

2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [a16fe90b-d7095fcc] has entered the All Paths Down state.

2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandler: 419: APD start event for 0x430c44eeb4c0 [37c6519b-ec9783e7]

2019-06-17T12:49:54.190Z cpu4:2097369)StorageApdHandlerEv: 110: Device or filesystem with identifier [37c6519b-ec9783e7] has entered the All Paths Down state.

2019-06-17T12:50:00.969Z cpu2:2098034)StorageApdHandler: 1315: APD exit for 0x430c44eeb4c0 [37c6519b-ec9783e7]

2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44eeb4c0 [37c6519b-ec9783e7]

2019-06-17T12:50:00.969Z cpu2:2098034)StorageApdHandler: 1315: APD exit for 0x430c44ee76d0 [3a5eb32c-7141e730]

2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [37c6519b-ec9783e7] has exited the All Paths Down state.

2019-06-17T12:50:00.969Z cpu2:2098034)StorageApdHandler: 1315: APD exit for 0x430c44ee95d0 [a16fe90b-d7095fcc]

2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44ee76d0 [3a5eb32c-7141e730]

2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [3a5eb32c-7141e730] has exited the All Paths Down state.

2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandler: 507: APD exit event for 0x430c44ee95d0 [a16fe90b-d7095fcc]

2019-06-17T12:50:00.969Z cpu4:2097369)StorageApdHandlerEv: 117: Device or filesystem with identifier [a16fe90b-d7095fcc] has exited the All Paths Down state.

2019-06-17T12:50:32.325Z cpu6:2099723)VSCSI: 6602: handle 8209(vscsi0:0):Destroying Device for world 2099687 (pendCom 0)

2019-06-17T12:50:32.327Z cpu3:2099715)VSCSI: 6602: handle 8208(vscsi0:0):Destroying Device for world 2099688 (pendCom 0)

2019-06-17T12:50:32.327Z cpu2:2099723)CBT: 723: Disconnecting the cbt device 2f0796-cbt with filehandle 3082134

2019-06-17T12:50:32.328Z cpu3:2099715)CBT: 723: Disconnecting the cbt device 31072d-cbt with filehandle 3213101

2019-06-17T12:50:32.342Z cpu1:2099723)CBT: 1352: Created device 41078e-cbt for cbt driver with filehandle 4261774

2019-06-17T12:50:32.342Z cpu3:2099715)CBT: 1352: Created device 320792-cbt for cbt driver with filehandle 3278738

2019-06-17T12:50:32.345Z cpu1:2099723)CBT: 1352: Created device 5107a4-cbt for cbt driver with filehandle 5310372

2019-06-17T12:50:32.346Z cpu1:2099723)CBT: 723: Disconnecting the cbt device 41078e-cbt with filehandle 4261774

2019-06-17T12:50:32.346Z cpu3:2099715)CBT: 1352: Created device 2807a7-cbt for cbt driver with filehandle 2623399

2019-06-17T12:50:32.346Z cpu3:2099715)CBT: 723: Disconnecting the cbt device 320792-cbt with filehandle 3278738

2019-06-17T12:50:32.346Z cpu1:2099723)CBT: 723: Disconnecting the cbt device 5107a4-cbt with filehandle 5310372

2019-06-17T12:50:32.346Z cpu3:2099715)CBT: 723: Disconnecting the cbt device 2807a7-cbt with filehandle 2623399

2019-06-17T12:50:32.347Z cpu3:2099715)CBT: 1352: Created device 2a07a7-cbt for cbt driver with filehandle 2754471

2019-06-17T12:50:32.348Z cpu1:2099723)CBT: 1352: Created device 5307a4-cbt for cbt driver with filehandle 5441444

2019-06-17T12:50:32.348Z cpu3:2099715)SVM: 5032: SkipZero 0, dstFsBlockSize -1, preallocateBlocks 0, vmfsOptimizations 0, useBitmapCopy 1, skipPlugGrain 1, destination disk grainSize 0

2019-06-17T12:50:32.349Z cpu3:2099715)SVM: 5126: SVM_MakeDev.5126: Creating device 2a07a7-3407aa-svmmirror: Success

2019-06-17T12:50:32.349Z cpu3:2099715)SVM: 5175: Created device 2a07a7-3407aa-svmmirror, primary 2a07a7, secondary 3407aa

2019-06-17T12:50:32.349Z cpu3:2099715)VSCSI: 3782: handle 8212(vscsi0:0):Using sync mode due to sparse disks

2019-06-17T12:50:32.349Z cpu3:2099715)VSCSI: 3810: handle 8212(vscsi0:0):Creating Virtual Device for world 2099688 (FSS handle 4327310) numBlocks=41943040 (bs=512)

2019-06-17T12:50:32.349Z cpu3:2099715)VSCSI: 273: handle 8212(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000

2019-06-17T12:50:32.349Z cpu3:2099715)Vmxnet3: 18569: indLROPktToGuest: 1, vcd->umkShared->vrrsSelected: 3 port 0x200000b

2019-06-17T12:50:32.349Z cpu3:2099715)Vmxnet3: 18810: Using default queue delivery for vmxnet3 for port 0x200000b

2019-06-17T12:50:32.349Z cpu1:2099723)SVM: 5032: SkipZero 0, dstFsBlockSize -1, preallocateBlocks 0, vmfsOptimizations 0, useBitmapCopy 1, skipPlugGrain 1, destination disk grainSize 0

2019-06-17T12:50:32.349Z cpu1:2099723)SVM: 5126: SVM_MakeDev.5126: Creating device 5307a4-3b07ad-svmmirror: Success

2019-06-17T12:50:32.349Z cpu1:2099723)SVM: 5175: Created device 5307a4-3b07ad-svmmirror, primary 5307a4, secondary 3b07ad

2019-06-17T12:50:32.349Z cpu1:2099723)VSCSI: 3782: handle 8213(vscsi0:0):Using sync mode due to sparse disks

2019-06-17T12:50:32.349Z cpu1:2099723)VSCSI: 3810: handle 8213(vscsi0:0):Creating Virtual Device for world 2099687 (FSS handle 3606440) numBlocks=62914560 (bs=512)

2019-06-17T12:50:32.349Z cpu1:2099723)VSCSI: 273: handle 8213(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000

2019-06-17T12:50:32.350Z cpu1:2099723)Vmxnet3: 18569: indLROPktToGuest: 1, vcd->umkShared->vrrsSelected: 3 port 0x200000d

2019-06-17T12:50:32.350Z cpu1:2099723)Vmxnet3: 18810: Using default queue delivery for vmxnet3 for port 0x200000d

2019-06-17T12:50:33.185Z cpu2:2102534)SVM: 2847: scsi0:0 Completed copy in 821 ms. vmmLeaderID = 2099688.

2019-06-17T12:50:33.223Z cpu0:2102533)SVM: 2847: scsi0:0 Completed copy in 858 ms. vmmLeaderID = 2099687.

2019-06-17T12:50:33.275Z cpu0:2099715)VSCSI: 6602: handle 8212(vscsi0:0):Destroying Device for world 2099688 (pendCom 0)

2019-06-17T12:50:33.276Z cpu0:2099715)SVM: 2548: SVM Mirrored mode IO stats for device: 2a07a7-3407aa-svmmirror

2019-06-17T12:50:33.276Z cpu0:2099715)SVM: 2552: Total # IOs mirrored: 0, Total # IOs sent only to source: 0, Total # IO deferred by lock: 0

2019-06-17T12:50:33.276Z cpu0:2099715)SVM: 2556: Deferred IO stats - Max: 0, Total: 0, Avg: 1 (msec)

2019-06-17T12:50:33.276Z cpu0:2099715)SVM: 2570: Destroyed device 2a07a7-3407aa-svmmirror

2019-06-17T12:50:33.281Z cpu3:2099723)VSCSI: 6602: handle 8213(vscsi0:0):Destroying Device for world 2099687 (pendCom 0)

2019-06-17T12:50:33.282Z cpu7:2099723)SVM: 2548: SVM Mirrored mode IO stats for device: 5307a4-3b07ad-svmmirror

2019-06-17T12:50:33.282Z cpu7:2099723)SVM: 2552: Total # IOs mirrored: 0, Total # IOs sent only to source: 0, Total # IO deferred by lock: 0

2019-06-17T12:50:33.282Z cpu7:2099723)SVM: 2556: Deferred IO stats - Max: 0, Total: 0, Avg: 1 (msec)

2019-06-17T12:50:33.282Z cpu7:2099723)SVM: 2570: Destroyed device 5307a4-3b07ad-svmmirror

2019-06-17T12:50:33.335Z cpu1:2099715)CBT: 723: Disconnecting the cbt device 2a07a7-cbt with filehandle 2754471

2019-06-17T12:50:33.341Z cpu6:2099723)CBT: 723: Disconnecting the cbt device 5307a4-cbt with filehandle 5441444

2019-06-17T12:50:33.350Z cpu3:2099715)CBT: 1352: Created device 6d09cd-cbt for cbt driver with filehandle 7145933

2019-06-17T12:50:33.350Z cpu3:2099715)VSCSI: 3782: handle 8214(vscsi0:0):Using sync mode due to sparse disks

2019-06-17T12:50:33.350Z cpu3:2099715)VSCSI: 3810: handle 8214(vscsi0:0):Creating Virtual Device for world 2099688 (FSS handle 12388969) numBlocks=41943040 (bs=512)

2019-06-17T12:50:33.350Z cpu3:2099715)VSCSI: 273: handle 8214(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000

2019-06-17T12:50:33.351Z cpu3:2099715)Vmxnet3: 18569: indLROPktToGuest: 1, vcd->umkShared->vrrsSelected: 3 port 0x200000b

2019-06-17T12:50:33.351Z cpu3:2099715)Vmxnet3: 18810: Using default queue delivery for vmxnet3 for port 0x200000b

2019-06-17T12:50:33.357Z cpu4:2099723)CBT: 1352: Created device 220ba5-cbt for cbt driver with filehandle 2231205

2019-06-17T12:50:33.357Z cpu4:2099723)VSCSI: 3782: handle 8215(vscsi0:0):Using sync mode due to sparse disks

2019-06-17T12:50:33.357Z cpu4:2099723)VSCSI: 3810: handle 8215(vscsi0:0):Creating Virtual Device for world 2099687 (FSS handle 1706919) numBlocks=62914560 (bs=512)

2019-06-17T12:50:33.357Z cpu4:2099723)VSCSI: 273: handle 8215(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000

2019-06-17T12:50:33.357Z cpu4:2099723)Vmxnet3: 18569: indLROPktToGuest: 1, vcd->umkShared->vrrsSelected: 3 port 0x200000d

2019-06-17T12:50:33.357Z cpu4:2099723)Vmxnet3: 18810: Using default queue delivery for vmxnet3 for port 0x200000d

Reply
0 Kudos
SureshKumarMuth
Commander
Commander

Sorry for the late response.

Above log does not give more information on why the nic went down. We have to enable debug logging for the driver to find what made the nic to go down at that time. However, if we identify this issue is something due to driver , we cant do much apart from updating the driver/firmware that you have done already. Only NIC vendor can help us.

or if you see no issues with ne1000, you may use this driver instead of igb.

Regards,
Suresh
https://vconnectit.wordpress.com/
Reply
0 Kudos
nague
Contributor
Contributor

Exact same behavior here with ESXi 6.5 U3 and Intel NIC 82576. Everythnigs was fine in ESXi 6.5 U2.

I've updated igb driver from 5.0.5 to 5.2.5 (last officialy supported version), let's say, it's a "little" better, it takes now two weeks (instead of 2 days) before NIC stops passing trafic. Plugin ou/in the ethernet cable, or remotly down/up the port on switch, solve the issue.

Do you find any solution to this issue ? Using ne1000 driver with this NIC is possible right ? How to switch driver ?

Reply
0 Kudos
monderick
Enthusiast
Enthusiast

We're having the same random issue with Intel Corporation 82576 Gigabit Network Connection QP NICs on our vSPhere 6.5 hosts, opened support ticket and of course the suggestion is upgrading to the 5.2.5 driver.  We're going to proceed but this thread doesn't make me confident.

Reply
0 Kudos
PeterCr
Enthusiast
Enthusiast

Have the same problem here when under load, for example 2-3 hours into backups over nic's.

Two different servers, tried both the inbox and 5.2.5 versions of the driver.

If the system is stable I can recover via cli running "esxcli network down -n vmnic0" and "esxcli network up -n vmnic0"  which gets the nic's back online without a reboot.

Reply
0 Kudos
berndweyand
Expert
Expert

your vmnic3 is "only" 3 seconds down - anyway to long and should not happen.

but did you ignore the apd events before the nic went down ? seems that you lost storage connect to your nfs

Reply
0 Kudos
DataBitz
Enthusiast
Enthusiast

Tried ESXi 6.7 with the older 4.2.16.8 driver same result, also confirmed also happening on ESXi 6.5 U3.

Reply
0 Kudos
HobbyStudent
Contributor
Contributor

This is exactly the same issue I have with one of my servers. It's a Supermicro X9DRH-7TF with the onboard 1 Gbit interface. Both are Intel 82576 and one is used as Management (vmnic2), the other one (vmnic3) for the guests (1x centOS, 2x Ubuntu 18.04 LTS, 4x Windows Server 2012R2/2019) with its own vSwitch.

Everything was working with ESXi 6.7 Build 13473784. Problem first occured after installing ESXi 6.7 Build 15160138.

vmnic2  0000:02:00.0 igb         Up   1000Mbps   Full   <MAC address> 1500   Intel Corporation 82576 Gigabit Network Connection

vmnic3  0000:02:00.1 igb         Up   1000Mbps   Full   <MAC address> 1500   Intel Corporation 82576 Gigabit Network Connection

The Management network is always reachable, while the other one stops passing traffic when there is heavy traffic on it (e.g. backups). The logs doesn't show anything and Link is always "Up".

First, some Linux VMs caused

Vmxnet3: 24934: <Linux VM>,<MAC address>, portID(83886088): Hang detected,numHangQ: 1, enableGen: 183

changed all Linux to e1000e. No "Hang" in logs since... But problem wasn't resolved. vmnic3 stops passing traffic without any log entry.

esxcli network down -n vmnic3
esxcli network up -n vmnic3

immediate starts passing traffic

net-igb                        5.0.5.1.1-5vmw.670.0.0.8169922        VMW     VMwareCertified   2019-05-09

As others stated, driver update seems not to solve the issue. Is there anything I could try to resolve this issue? Perhaps some extended logging?

Edit:

Issue occurs more or less random, but minimum every 48-72h

Reply
0 Kudos
theoha
Contributor
Contributor

Hello all,

I have a same problem ...

I use a ESXI 6.7.0 Update 3 (Build 14320388) i using also one of the 82576 nics. It's working for me, but whit a latency more than 500ms ....

For the moment, I have not found the solution .... but if you have other information, I'm all ears ....

Best regards,

Theo

Reply
0 Kudos
MattSnead
Contributor
Contributor

I've been having very similar issues since upgrading 3 of my hosts to the latest 6.5 (v15256549).  The version I was running prior to this update was very old as I had been slacking on updates.  I don't even recall what version it was but I think it was a 6.5 version from around 12/2018. As I mentioned I have 3 hosts (of similar vintage- older Dell M610 blade servers) and they've all got dual-quad port Intel 82576's.  Unfortunately the upgrade process went completely fine and gave me no indication of a problem until the very last host.  I was vmotioning VMs between the 3 of them the entire time and had no issues at all.  After I completed the process and went to vmotion the last hosts was when that vmotion failed an all hell broke loose. 

I use 4 of the uplinks in an static lag (4 active uplinks with IP hash teaming mode on the vmware side, and static lag on the switch side).  This configuration has been in place for almost 9 years and has worked flawlessly.  My findings are completely in line with what has been mentioned here-- after a period of time, either some VMs or all VMs on a host stop passing traffic.  Simply downing a NIC and bring it back up brings it back online.  When some hosts don't work, it's usually (maybe always?) the last NIC in the group that has a problem.  Some hosts can ping hosts that other hosts can't, and vice versa.  Using a vmware IP hash calculator (https://techslaves.org/2014/02/25/vmware-ip-hash-algorithm-calculator/ ) you can see which vmnic it would be sending the traffic over and you can see which NIC is the one with the problem.

I started a support call immediately with vmware on this issue and I've gotten very little help.  #1 because my servers are technically only certified up to 6.0 U3.  So it's very easy for them just to blame that.  However, these servers have been running on 6.5 for at least 2 years no problem.  I wasn't about to roll back to a version that is end of life in 2 months.  One of the things we tried that seemed to work at first was just destroying the vswitch and recreating a new one.  That actually worked for 3-4 days without an issue.  At that point I had recreated the vswitches on the other 2 hosts and started moving some production VMs to them.  Then the problems starting cropping up again on all three hosts.  Every time I call back in to vmware they want to blame either my old servers or my physical switches so I had to take matters into my own hands to do some real debugging.

Last weekend I spent many hours debugging it and this is what I found... You can use pktcap-uw to capture packets at different points through the system.  This document indicates the different stages of pktcap-uw: Capture Points of the pktcap-uw Utility. Using a physical server that could not ping on of my virtual hosts (while others could) I opened a continuous ping to the VM from that physical server.  I could identify the packets coming in from the non-vmware-related potion of the network into the vmware-related switches and eventually reaching the host.  The host receives the ping packets and replies to them.  I see the return packets exit the VM and enter the vswitch, but they never leave the vswitch and get put on the physical adapter.  Here are the steps I used:

ping source 10.212.132.50

ping dest 10.100.32.25

dest VM name: ghost

[root@vm2:~] esxcli network vm
list

World ID  Name       Num Ports  Networks

--------  ---------  ---------  --------------------------------

2102595  ghost              1  Data Network A

2102796  Server 2           1  Data Network B

2102159  Server 3           1  Data Network C

2101973  Server 4           1  Data Network A

2101731  Server 5           1  Data Network B

[root@vm2:~] esxcli network vm port list -w 2102595

   Port ID: 83886095

   vSwitch: vSwitch3

   Portgroup: Data Network A

   DVPort ID:

   MAC Address: 00:50:56:bc:2e:e9

   IP Address: 0.0.0.0

   Team Uplink: all(4)

   Uplink Port ID: 0

   Active Filters:

  

[root@vm2:~] pktcap-uw --switchport 83886095 --capture PortInput --dstip 10.212.132.50 -o- |tcpdump-uw -enr -

The switch port id is 0x0500000f.

The session capture point is PortInput.

The session filter destination IP address is 10.212.132.50.

07:09:58.812762 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3323, length 40

07:10:03.813107 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3324, length 40

07:10:08.809467 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3325, length 40

07:10:13.808301 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3326, length 40

pktcap: Dumped 4 packet to file -, dropped 0 packets.

pktcap: Done.

[root@vm2:~] pktcap-uw --uplink vmnic9 --capture PortOutput --dstip 10.212.132.50 -o- |tcpdump-uw -enr -

The name of the uplink is vmnic9.

The session capture point is PortOutput.

The session filter destination IP address is 10.212.132.50.

07:11:23.808640 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3340, length 40

07:11:28.808943 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3341, length 40

07:11:33.810570 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3342, length 40

07:11:38.809677 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3343, length 40

pktcap: Dumped 4 packet to file -, dropped 0 packets.

pktcap: Done.

[root@vm2:~] pktcap-uw --uplink vmnic9 --capture UplinkSnd --dstip 10.212.132.50 -o- |tcpdump-uw -enr -

The name of the uplink is vmnic9.

The session capture point is UplinkSnd.

The session filter destination IP address is 10.212.132.50.

pktcap: Dumped 0 packet to file -, dropped 0 packets.

pktcap: Done.

... so note on the last packet capture there are no packets captured.  PortInput in my first capture is the vswitch receiving the packet from the VM, basically.  PortOutput is the packet leaving the vswitch.  UplinkSnd is the vswitch putting the packet on the physical adapter.  Note that I used "--switchport 83886095" for the first capture which theoertically captures all packets from/to that host's portgroup.  I used "--uplink vmnic9" on the other two commands because at that point you're dealing with the vswitch itself.  So you have to know (or trial and error to find) the vmnic.

Here are similar tests that produce the same result, but using different "[stage]s" and "[dir]ection" switches for the command instead.  My understanding is that the "--capture" points are basically alternates to using --dir and --stage.

[root@vm2:~] pktcap-uw --uplink vmnic9 --dir 0 --stage 0 -o- |tcpdump-uw -enr -|grep 10.212

The name of the uplink is vmnic9.

The Stage is Pre.

pktcap: The output file is -.

pktcap: No server port specifed, select 40524 as the port.

pktcap: Local CID 2.

pktcap: Listen on port 40524.

pktcap: Accept...

pktcap: Vsock connection from port 1152 cid 2.

reading from file -, link-type EN10MB (Ethernet)

07:30:38.809784 b8:af:67:70:92:c6 > 00:50:56:bc:2e:e9, ethertype IPv4 (0x0800), length 74: 10.212.132.50 > 10.100.32.25: ICMP echo request, id 1, seq 3571, length 40

07:30:43.808875 b8:af:67:70:92:c6 > 00:50:56:bc:2e:e9, ethertype IPv4 (0x0800), length 74: 10.212.132.50 > 10.100.32.25: ICMP echo request, id 1, seq 3572, length 40

tcpdump-uw: pcap_loop: error reading dump file: Interrupted system call

pktcap: Join with dump thread failed.

pktcap: Destroying session 128.

pktcap:

pktcap: Dumped 130 packet to file -, dropped 0 packets.

pktcap: Done.

[root@vm2:~] pktcap-uw --uplink vmnic9 --dir 0 --stage 1 -o- |tcpdump-uw -enr -|grep 10.212

The name of the uplink is vmnic9.

The Stage is Post.

pktcap: The output file is -.

pktcap: No server port specifed, select 40537 as the port.

pktcap: Local CID 2.

pktcap: Listen on port 40537.

pktcap: Accept...

reading from file -, link-type EN10MB (Ethernet)

pktcap: Vsock connection from port 1153 cid 2.

07:30:53.810564 b8:af:67:70:92:c6 > 00:50:56:bc:2e:e9, ethertype IPv4 (0x0800), length 74: 10.212.132.50 > 10.100.32.25: ICMP echo request, id 1, seq 3574, length 40

07:30:58.812753 b8:af:67:70:92:c6 > 00:50:56:bc:2e:e9, ethertype IPv4 (0x0800), length 74: 10.212.132.50 > 10.100.32.25: ICMP echo request, id 1, seq 3575, length 40

tcpdump-uw: pcap_loop: error reading dump file: Interrupted system call

pktcap: Join with dump thread failed.

pktcap: Destroying session 129.

pktcap:

pktcap: Dumped 91 packet to file -, dropped 0 packets.

pktcap: Done.

[root@vm2:~] pktcap-uw --uplink vmnic9 --dir 1 --stage 0 -o- |tcpdump-uw -enr -|grep 10.212

The name of the uplink is vmnic9.

The Stage is Pre.

pktcap: The output file is -.

pktcap: No server port specifed, select 40550 as the port.

pktcap: Local CID 2.

pktcap: Listen on port 40550.

reading from file -, link-type EN10MB (Ethernet)

pktcap: Accept...

pktcap: Vsock connection from port 1154 cid 2.

07:31:13.811837 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3578, length 40

07:31:18.813389 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3579, length 40

07:31:23.811731 00:50:56:bc:2e:e9 > b8:af:67:70:92:c6, ethertype IPv4 (0x0800), length 74: 10.100.32.25 > 10.212.132.50: ICMP echo reply, id 1, seq 3580, length 40

tcpdump-uw: pcap_loop: error reading dump file: Interrupted system call

pktcap: Join with dump thread failed.

pktcap: Destroying session 130.

pktcap:

pktcap: Dumped 8 packet to file -, dropped 0 packets.

pktcap: Done.

[root@vm2:~] pktcap-uw --uplink vmnic9 --dir 1 --stage 1 -o- |tcpdump-uw -enr -|grep 10.212

The name of the uplink is vmnic9.

The Stage is Post.

pktcap: The output file is -.

pktcap: No server port specifed, select 40560 as the port.

pktcap: Local CID 2.

pktcap: Listen on port 40560.

reading from file -, link-type EN10MB (Ethernet)

pktcap: Accept...

pktcap: Vsock connection from port 1155 cid 2.

pktcap: Join with dump thread failed.

tcpdump-uw: pcap_loop: error reading dump file: Interrupted system call

pktcap: Destroying session 131.

pktcap:

pktcap: Dumped 0 packet to file -, dropped 0 packets.

pktcap: Done.

So in this test, the first test is dir 0/stage 0, then dir 0/stage 1, then dir 1/stage 0, then finally dir 1/stage 1 where it fails.  Again, same tests just different variations of the commands.

Then I found the "trace" switch for that command and this further proves my findings.  Here is a successful trace of an ICMP packet:

[root@vm2:~] pktcap-uw --trace --ip 10.100.1.5

The trace session is enabled.

The session filter IP(src or dst) address is 10.100.1.5.

No server port specifed, select 56026 as the port.

Output the packet info to console.

Local CID 2.

Listen on port 56026.

Accept...

Vsock connection from port 1207 cid 2.

18:39:04.106975[1] Captured at PktFree point, TSO not enabled, Checksum not offloaded and not verified, length 74.

        PATH:

          +- [18:39:04.106955] |                        UplinkRcv |            |

          +- [18:39:04.106957] |                  UplinkRcvKernel |            |

          +- [18:39:04.106958] |                        PortInput |   83886086 |

          +- [18:39:04.106958] |                          IOChain |            | UplinkDoSwLRO@vmkernel#nover

          +- [18:39:04.106959] |               EtherswitchDispath |   83886086 |

          +- [18:39:04.106961] |                EtherswitchOutput |   83886095 |

          +- [18:39:04.106961] |                       PortOutput |   83886095 |

          +- [18:39:04.106962] |                          IOChain |            | VLAN_OutputProcessor@com.vmware.vswitch#1.0.0

          +- [18:39:04.106963] |                          IOChain |            | VSwitchDisablePT@com.vmware.vswitch#1.0.0

          +- [18:39:04.106968] |                           VnicRx |   83886095 |

          +- [18:39:04.106974] |                          PktFree |            |

18:39:04.107284[2] Captured at PktFree point, TSO not enabled, Checksum not offloaded and not verified, VLAN tag 101, length 74.

        PATH:

          +- [18:39:04.107187] |                           VnicTx |   83886095 |

          +- [18:39:04.107189] |                        PortInput |   83886095 |

          +- [18:39:04.107190] |                          IOChain |            | VLAN_InputProcessor@com.vmware.vswitch#1.0.0

          +- [18:39:04.107192] |               EtherswitchDispath |   83886095 |

          +- [18:39:04.107195] |                       PortOutput |   83886084 |

          +- [18:39:04.107195] |                          IOChain |            | UplinkGenericOffload@vmkernel#nover

          +- [18:39:04.107196] |                          IOChain |            | UplinkTSO6ExtHdrs@vmkernel#nover

          +- [18:39:04.107197] |                          IOChain |            | UplinkCSum6ExtHdrs@vmkernel#nover

          +- [18:39:04.107197] |                          IOChain |            | Uplink_BuildWritableInetHeaders@vmkernel#nover

          +- [18:39:04.107198] |                          IOChain |            | NetSchedInput@vmkernel#nover

          +- [18:39:04.107200] |                  UplinkSndKernel |            |

          +- [18:39:04.107201] |                        UplinkSnd |            |

          +- [18:39:04.107281] |                          PktFree |            |

...The first packet is the ICMP being received, and the second is the reply.  Note lines 36 and 37-- the UplinkSndKernel and UplinkSnd before the PktFree.

Now here is an UNsuccessful ICMP trace:

[root@vm2:~] pktcap-uw --trace --ip 10.100.0.54

The trace session is enabled.

The session filter IP(src or dst) address is 10.100.0.54.

No server port specifed, select 55910 as the port.

Output the packet info to console.

Local CID 2.

Listen on port 55910.

Accept...

Vsock connection from port 1205 cid 2.

18:34:21.838652[1] Captured at PktFree point, TSO not enabled, Checksum not offloaded and not verified, length 74.

        PATH:

          +- [18:34:21.838622] |                        UplinkRcv |            |

          +- [18:34:21.838626] |                  UplinkRcvKernel |            |

          +- [18:34:21.838627] |                        PortInput |   83886088 |

          +- [18:34:21.838628] |                          IOChain |            | UplinkDoSwLRO@vmkernel#nover

          +- [18:34:21.838630] |               EtherswitchDispath |   83886088 |

          +- [18:34:21.838633] |                EtherswitchOutput |   83886095 |

          +- [18:34:21.838634] |                       PortOutput |   83886095 |

          +- [18:34:21.838634] |                          IOChain |            | VLAN_OutputProcessor@com.vmware.vswitch#1.0.0

          +- [18:34:21.838636] |                          IOChain |            | VSwitchDisablePT@com.vmware.vswitch#1.0.0

          +- [18:34:21.838642] |                           VnicRx |   83886095 |

          +- [18:34:21.838651] |                          PktFree |            |


18:34:21.838900[2] Captured at PktFree point, TSO not enabled, Checksum not offloaded and not verified, VLAN tag 101, length 74.

        PATH:

          +- [18:34:21.838878] |                           VnicTx |   83886095 |

          +- [18:34:21.838881] |                        PortInput |   83886095 |

          +- [18:34:21.838882] |                          IOChain |            | VLAN_InputProcessor@com.vmware.vswitch#1.0.0

          +- [18:34:21.838885] |               EtherswitchDispath |   83886095 |

          +- [18:34:21.838888] |                       PortOutput |   83886088 |

          +- [18:34:21.838889] |                          IOChain |            | UplinkGenericOffload@vmkernel#nover

          +- [18:34:21.838890] |                          IOChain |            | UplinkTSO6ExtHdrs@vmkernel#nover

          +- [18:34:21.838891] |                          IOChain |            | UplinkCSum6ExtHdrs@vmkernel#nover

          +- [18:34:21.838891] |                          IOChain |            | Uplink_BuildWritableInetHeaders@vmkernel#nover

          +- [18:34:21.838892] |                          IOChain |            | NetSchedInput@vmkernel#nover

          +- [18:34:21.838899] |                          PktFree |            |

... Note the lack of UplinkSndKernel and UplinkSnd before the PktFree.  The packet never gets put on the line.

I didn't mention earlier but just to put it out there, before doing all of this this past weekend, I actually fresh installed the latest ESXi 6.7 just to see if it was something that had lingered from previous upgrades or something.  But it didn't help, obviously because here I am.  All of these tests and capture above are on a fresh install of the latest 6.7 as of last weekend.  I figured I would try this before forcing myself into installed a soon-to-be obsolete 6.0 version.

Now, the interesting thing was that the igb driver that ships inside 6.7 was version 5.0.5, I believe.  So searching around for an update to that I found this: Download VMware vSphere  which is version 5.3.3 of the driver.  This is also listed as compatible with version 6.7 on another page but doesn't say it on that link.  After upgrading the driver to v5.3.3 I haven't had a problem... yet.  However, after finding and reading this thread I am not so confident that I have the problem solved.  After all, I mentioned earlier that simply recreating the vswitch had lasted 3-4 days before.  I'm on day 3 now.  I also only have 3 non-productive VMs on the host right now as well.  I am glad to see I am not the only one having the problem and maybe if enough of us bark up the right trees we can get some sort of solution to this.  I know I need to replace my aging servers, and have it in the plans to do so about a year from now, but they all worked fine and suited our needs before this.

MattSnead
Contributor
Contributor

.... AAANNNDDD it just happened to me again.   And it happened on the second uplink this time (vmnic4, 5, 8, and 9 make up the LAG, and vmnic5 faulted this time).

Reply
0 Kudos
MattSnead
Contributor
Contributor

FYI.. in case anyone finds this post in the future.  I have not found any solution to make this work on the latest versions of 6.5 or 6.7.  The only solutions I have found were rolling back to 6.0 (with all latest patches as of this writing is fine) or 6.5 build 10719125.  Something in 6.5 build 10884925 is what's breaking it for me.  If you install a fresh 6.5 U2 you can create a custom baseline that only includes updates before 11/27/2018 (11/26 or earlier).  That will take you up to build 10719125.

Reply
0 Kudos
horfor
Contributor
Contributor

Hello,

We experienced the same issue with ESXi 6.7 and quad port cards:

Vendor:Intel Corporation
Vendor ID:0x8086
Device ID:0x10e8
Sub-Vendor ID:0x8086
Sub-Device ID:0xa02c
Device name:82576 Gigabit Network Connection

VMware informed us that the support for this card was dropped in ESXi 6.7:

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=12997&deviceCa...

As we have just a few of these cards we decided to replace them.

Reply
0 Kudos
leotog
Contributor
Contributor

i have exactly the same problem, any solution?

Reply
0 Kudos
VirtualSlam
Contributor
Contributor

I've had this same issue and this forum post has been helpful in me troubleshooting the issue. I had issues with both a 4 port 82576 and 2 port 82575EB cards that are in my 2 host lab environment. I was using the original 5.0.5 igb driver that is included with ESXi 6.7u3 when I first experienced the issue with the 2nd host that was not host vCenter dropping its connection to vCenter. Previously I noticed it mainly with vMotions, but then I started noticing with any high traffic functions even in VMs. Issues happened more often when I had 2 uplink ports on a vSwitch for redundancy. Multi-NIC vMotion is setup as well.

So I went from igb 5.0.5 to 5.2.5 and like others have said the issue persisted. I was going to try 5.3.3 even though another user had mentioned having the same issue with that. However, I started acquiring every version of the igb driver that I could find. I found that 5.3.2 was the last version to be a similar size with 5.3.0 and 5.3.1. So I have tried 5.3.2 instead. And so far I have not had any of the issues I was seeing before. With the other driver versions I would see the issue within 50% of the vMotion and within a few minutes of a high transaction operation. This includes a gigabit speed backup that would max out 1 uplink where that would have failed before with the other drivers.

Also pfSense was having issues and stating an error "vmx0: watchdog timeout on queue 0" while pushing a decent amount of internet traffic but not maxing out my connection. I could only get around by using e1000 nic instead of vmxnet3. Now that is working with vmxnet3 as well.

Time will tell though, but I thought I'd share my early results.

igb 5.3.2 download:

https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI55-INTEL-IGB-532&productId=323

Reply
0 Kudos
VirtualSlam
Contributor
Contributor

Scratch the pfSense part. It still has issues with vmxnet3, but I was only half suspecting that it was a related issue. Everything else still looks good with 5.3.2.

Reply
0 Kudos