Got one node tx hang when running MR jobs, this node is one of datanodes from our hadoop cluster, we are importing databases from sqoop when got below errors, any ideas ?
Jun 1 14:01:19 kernel: ------------[ cut here ]------------
Jun 1 14:01:19 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26b/0x280() (Not tainted)
Jun 1 14:01:19 kernel: Hardware name: VMware Virtual Platform
Jun 1 14:01:19 kernel: NETDEV WATCHDOG: eth0 (vmxnet3): transmit queue 2 timed out
Jun 1 14:01:19 kernel: Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc autofs4 8021q garp stp llc vsock(U) ipv6 microcode vmware_balloon sg vmci(U) i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom vmxnet3 mptspi mptscsih mptbase scsi_transport_spi pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip6t_REJECT]
Jun 1 14:01:19 kernel: Pid: 18580, comm: java Not tainted 2.6.32-431.el6.x86_64 #1
Jun 1 14:01:19 kernel: Call Trace:
Jun 1 14:01:19 kernel: <IRQ> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0
Jun 1 14:01:19 kernel: [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50
Jun 1 14:01:19 kernel: [<ffffffff8147b74b>] ? dev_watchdog+0x26b/0x280
Jun 1 14:01:19 kernel: [<ffffffff81083e75>] ? internal_add_timer+0xb5/0x110
Jun 1 14:01:19 kernel: [<ffffffff8147b4e0>] ? dev_watchdog+0x0/0x280
Jun 1 14:01:19 kernel: [<ffffffff81084b07>] ? run_timer_softirq+0x197/0x340
Jun 1 14:01:19 kernel: [<ffffffff810ac8f5>] ? tick_dev_program_event+0x65/0xc0
Jun 1 14:01:19 kernel: [<ffffffff8107a8e1>] ? __do_softirq+0xc1/0x1e0
Jun 1 14:01:19 kernel: [<ffffffff810ac9ca>] ? tick_program_event+0x2a/0x30
Jun 1 14:01:19 kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
Jun 1 14:01:19 kernel: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0
Jun 1 14:01:19 kernel: [<ffffffff8107a795>] ? irq_exit+0x85/0x90
Jun 1 14:01:19 kernel: [<ffffffff815310aa>] ? smp_apic_timer_interrupt+0x4a/0x60
Jun 1 14:01:19 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
Jun 1 14:01:19 kernel: <EOI>
Jun 1 14:01:19 kernel: ---[ end trace 808af6e00c97548a ]---
Jun 1 14:01:19 kernel: vmxnet3 0000:03:00.0: eth0: tx hang
Jun 1 14:01:24 kernel: vmxnet3 0000:03:00.0: eth0: resetting
Jun 1 14:01:24 kernel: vmxnet3 0000:03:00.0: eth0: intr type 3, mode 0, 9 vectors allocated
Jun 1 14:01:24 kernel: vmxnet3 0000:03:00.0: eth0: NIC Link is Up 10000 Mbps
Read this KB: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=20551...
Then disable TSO on your guest, then check and if the problem not resolved, change your NIC to E1000.
If the problem was still exist, you can disable TSO on ESXi as a test but I don't suggest it.
Thx, Davoud, I will try your suggestion and continue monitoring this guest,
Are you on 6.0 and if so, are you on 6.0U1a or later?
NETDEV WATCHDOG timeout error and ESXi 6.0 loses network connectivity (2124669) | VMware KB
Hi,
We are using version VMware ESXi 5.5.0 build-1331820 & facing the same issue(tx hang) with same back trace in vmxnet3 driver.
Please let us know the availablity of patch .
As mentioned TSO is disabled by default in our guest vm.
Thanks in advance.