Solved: Fake Tx hang detected with timeout of 160 seconds

cesprov · ‎08-12-2015

I just upgraded 8 ESXi 5.5U2 hosts to 6.0b (2809209). The first host I upgraded (a Dell R910) ran fine for about a week and then died in the middle of the night Sunday night. Came in to it hung up, couldn't SSH to it, unresponsive on the console, and was shown as disconnected in vCenter, all the VMs on it had HA'ed to other hosts. Had to power it off via the iDrac and it came up fine. The syslog functionality stopped 9 minutes prior to the events in vCenter showing it going down so I couldn't check any logs to see what happened before it went down. Chalked it up to an anomaly and put it back into production. Less than 24 hours later, awoke to pages from our monitoring system from VMs on the same host. These VMs were unreachable. The host was still responsive and shown as up in vCenter. Could not open the consoles for any of the VMs on that host. Was able to SSH into the host and this was in the vmkernel.log:

2015-08-11T11:14:52.338Z cpu23:33245)<6>ixgbe 0000:41:00.0: vmnic4: Fake Tx hang detected with timeout of 160 seconds

2015-08-11T11:14:53.340Z cpu23:33256)WARNING: LinNet: netdev_watchdog:3678: NETDEV WATCHDOG: vmnic5: transmit timed out

2015-08-11T11:14:53.340Z cpu23:33256)<6>ixgbe 0000:41:00.1: vmnic5: Fake Tx hang detected with timeout of 160 seconds

2015-08-11T11:14:53.340Z cpu23:33256)WARNING: LinNet: netdev_watchdog:3678: NETDEV WATCHDOG: vmnic4: transmit timed out

2015-08-11T11:14:53.340Z cpu23:33256)<6>ixgbe 0000:41:00.0: vmnic4: Fake Tx hang detected with timeout of 160 seconds

2015-08-11T11:14:54.342Z cpu19:33251)WARNING: LinNet: netdev_watchdog:3678: NETDEV WATCHDOG: vmnic5: transmit timed out

2015-08-11T11:14:54.342Z cpu19:33251)<6>ixgbe 0000:41:00.1: vmnic5: Fake Tx hang detected with timeout of 160 seconds

2015-08-11T11:14:54.342Z cpu19:33251)WARNING: LinNet: netdev_watchdog:3678: NETDEV WATCHDOG: vmnic4: transmit timed out

2015-08-11T11:14:54.342Z cpu19:33251)<6>ixgbe 0000:41:00.0: vmnic4: Fake Tx hang detected with timeout of 160 seconds

These repeated over and over several times a second. The host locked again shortly afterwards and had to be rebooted to force the VMs to HA to other hosts.

Both vmnic4 and vmnic5 are ports on the same Intel X520-2 NIC (dual port), Intel version, not the Dell re-branded version. We have two of these NICs in each host with the other NIC's ports being vmnic6 and vmnic7. vmnic4 and vmnic6 go to our LAN, vmnic5 and 7 go to our iSCSI network. These NICs use the ixgbe driver (ethtool reports 3.21.6iov *latest* with firmware version 0x61c10001). TSO and LRO are turned off due to issues we had previously. I spent yesterday upgrading all the firmware on the problem host but the Intel X520-2 doesn't appear to have newer firmware that I can find, although Dell appears to have a recent release for it which doesn't apply to the Intel version of these NICs.

The problem host is currently back into production with an extremely light load on it for more than 24 hours so far and I am increasing the load steadily to see if it eventually bombs again.

Googling "Fake Tx hang detected" results in a lot of older hits, mostly from Linux ixgbe driver-related issues. Nothing really VMware-related. And nothing that seems relevant.

Any ideas? I find it hard to believe the NIC itself suddenly went bad as this host has been with us for years without issues until we upgraded to 6.0b. I have another R910 that was purchased at the same time that I am weary of upgrading as I can't have two hosts having problems as that would cause capacity issues within our cluster.

cesprov · ‎08-24-2015

Despite being told that there is no workaround and the only solution was to downgrade to 5.5U2 when I opened an SR with VMware, I found out through other channels that there is a workaround script that appears to change the handling of CPU interrupts from automatic to manual, which is supposedly the cause of this issue. Why VMware is handing that script out to some people and not others I don't know, hopefully it was just the tech that worked my case not having knowledge of said workaround script at the time.

EDIT: I should add that since applying that script to our hosts, we haven't seen the issue reoccur yet when I had three crashes in the first week. /knockonwood

View solution in original post

cesprov · ‎08-12-2015

After posting this, I have found several other posts in this community experiencing the same "transmit timed out" issue. A lot of those posts seem to be saying this is a somewhat-known-but-not-yet-acknowledged-publicly problem with ESXi 6 and a fix supposedly forthcoming with the only immediate fix being to downgrade to 5.5U2. I have opened a case with VMware to see if I can get more detail on this.

effex805 · ‎08-19-2015

Exact same issue, especilly the "somewhat-known-but-not-yet-acknowledged-publicly!"

BBAVMWARE · ‎08-19-2015

Same here with INTEL X540. Two cases already opened for same issue

cesprov · ‎08-24-2015

Despite being told that there is no workaround and the only solution was to downgrade to 5.5U2 when I opened an SR with VMware, I found out through other channels that there is a workaround script that appears to change the handling of CPU interrupts from automatic to manual, which is supposedly the cause of this issue. Why VMware is handing that script out to some people and not others I don't know, hopefully it was just the tech that worked my case not having knowledge of said workaround script at the time.

EDIT: I should add that since applying that script to our hosts, we haven't seen the issue reoccur yet when I had three crashes in the first week. /knockonwood

All

Fake Tx hang detected with timeout of 160 seconds