eth0: tx hang

Kelvin0431 · ‎06-01-2016

Got one node tx hang when running MR jobs, this node is one of datanodes from our hadoop cluster, we are importing databases from sqoop when got below errors, any ideas ?

Jun 1 14:01:19 kernel: ------------[ cut here ]------------

Jun 1 14:01:19 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26b/0x280() (Not tainted)

Jun 1 14:01:19 kernel: Hardware name: VMware Virtual Platform

Jun 1 14:01:19 kernel: NETDEV WATCHDOG: eth0 (vmxnet3): transmit queue 2 timed out

Jun 1 14:01:19 kernel: Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc autofs4 8021q garp stp llc vsock(U) ipv6 microcode vmware_balloon sg vmci(U) i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom vmxnet3 mptspi mptscsih mptbase scsi_transport_spi pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip6t_REJECT]

Jun 1 14:01:19 kernel: Pid: 18580, comm: java Not tainted 2.6.32-431.el6.x86_64 #1

Jun 1 14:01:19 kernel: Call Trace:

Jun 1 14:01:19 kernel: <IRQ> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0

Jun 1 14:01:19 kernel: [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50

Jun 1 14:01:19 kernel: [<ffffffff8147b74b>] ? dev_watchdog+0x26b/0x280

Jun 1 14:01:19 kernel: [<ffffffff81083e75>] ? internal_add_timer+0xb5/0x110

Jun 1 14:01:19 kernel: [<ffffffff8147b4e0>] ? dev_watchdog+0x0/0x280

Jun 1 14:01:19 kernel: [<ffffffff81084b07>] ? run_timer_softirq+0x197/0x340

Jun 1 14:01:19 kernel: [<ffffffff810ac8f5>] ? tick_dev_program_event+0x65/0xc0

Jun 1 14:01:19 kernel: [<ffffffff8107a8e1>] ? __do_softirq+0xc1/0x1e0

Jun 1 14:01:19 kernel: [<ffffffff810ac9ca>] ? tick_program_event+0x2a/0x30

Jun 1 14:01:19 kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30

Jun 1 14:01:19 kernel: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0

Jun 1 14:01:19 kernel: [<ffffffff8107a795>] ? irq_exit+0x85/0x90

Jun 1 14:01:19 kernel: [<ffffffff815310aa>] ? smp_apic_timer_interrupt+0x4a/0x60

Jun 1 14:01:19 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20

Jun 1 14:01:19 kernel: <EOI>

Jun 1 14:01:19 kernel: ---[ end trace 808af6e00c97548a ]---

Jun 1 14:01:19 kernel: vmxnet3 0000:03:00.0: eth0: tx hang

Jun 1 14:01:24 kernel: vmxnet3 0000:03:00.0: eth0: resetting

Jun 1 14:01:24 kernel: vmxnet3 0000:03:00.0: eth0: intr type 3, mode 0, 9 vectors allocated

Jun 1 14:01:24 kernel: vmxnet3 0000:03:00.0: eth0: NIC Link is Up 10000 Mbps

DavoudTeimouri · ‎06-02-2016

Read this KB: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=20551...

Then disable TSO on your guest, then check and if the problem not resolved, change your NIC to E1000.

If the problem was still exist, you can disable TSO on ESXi as a test but I don't suggest it.

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/

Kelvin0431 · ‎06-02-2016

Thx, Davoud, I will try your suggestion and continue monitoring this guest,

cesprov · ‎06-08-2016

Are you on 6.0 and if so, are you on 6.0U1a or later?

NETDEV WATCHDOG timeout error and ESXi 6.0 loses network connectivity (2124669) | VMware KB

rembertv · ‎06-02-2017

Hi,

We are using version VMware ESXi 5.5.0 build-1331820 & facing the same issue(tx hang) with same back trace in vmxnet3 driver.

Please let us know the availablity of patch .

As mentioned TSO is disabled by default in our guest vm.

Thanks in advance.

All

eth0: tx hang