Solved: VMotion fails (often) with powered-on guests

tedklugman · ‎05-18-2020

Environment - three 6.5 hosts - two have just been upgraded to 6.5 P04 (build 15256549), one is still at 6.5 U2 (8294253) but will be upgraded shortly. vCenter Appliance has just been upgraded to 6.5.0.32300.

Hardware is three essentially identical old HP desktop PCs (8300 SFF), 32GB RAM, and 500GB SSDs. Each has a single onboard Intel 82579LM GigE NIC, and that single NIC provides everything - VMware management and vmotion which share the same IP, guest network connectivity, and Intel AMT for remote management (on a separate IP); everything is on a single VLAN and subnet. Network is a Cisco 3560G switch. There is no shared storage (other than an NFS mount for ISOs and such), so vmotion is done from local storage on host A to local storage on host B. Vmotion speed averages over 900Mbps, so essentially wirespeed. The network is otherwise clean and quiet.

As I was preparing to do the ESX upgrades, I was vmotioning VMs around so I could power cycle the hosts. More often than not, running guests would fail to vmotion after period of time (2-10 minutes) - data would be flying along, and then just stop. This continues to happen even after I've upgraded the hosts.

The error usually seen in the vSphere client is: "Failed waiting for data. Error 195887137. Timeout."

After a number of trials, I've dug up what appears to be an indication of the problem: events in vmkernel.log on the source host.

Consistently, the source host reports the following (with nothing significant for minutes before it):

2020-05-19T00:37:29.280Z cpu7:65944)INFO (ne1000): hardware TX hang detected on vmnic0

2020-05-19T00:37:29.280Z cpu7:65944)DEBUG (ne1000): resetting adapter of vmnic0

This is followed by a bunch of log entries that indicate a reset of the network interface. The switch reports the same - the link flaps, and connectivity is lost to the source host and all of its guests for a few seconds. No drops or errors are seen on the switch on either the source or destination port.

The destination reports a timeout error, but about six seconds after the above messages (and yes, they are in sync, using ntp). So the messages above are the FIRST indication that something's failing, although I suspect that if I were doing a packet capture, I'd see the network stream stop a few seconds earlier.

Surprisingly, I haven't been able to find anything that refers specifically to "hardware TX hang detected". I haven't found anything in any logfiles (host or guest) that occur just before that message, which leads me to believe that this issue is a NIC/driver issue. I have some dual-port PCI NICs (also Intel) on order, which would fix the problem IF the issue is either related to

- Combination with Intel AMT on the same interface

- Combination with too much VMware stuff (management, vmotion, guests) on the same interface

- Overall lousiness of a desktop embedded NIC (but hey, it's Intel?)

But wouldn't address the issue if there's something I need to tweak with the ne1000 driver, or my crappy desktop hosts just aren't up to the task.

Sure - this is (mostly) cheap consumer hardware, and it's a hanky-janky environment with no shared storage or separate networks. But it's standard stuff - no realtek nonsense, for instance.

One thought I had was the possibility of trying the e1000 driver instead of ne1000. That sort of seems like a step backwards.

Any brighter ideas?

tedklugman · ‎05-18-2020

Replying to my own post.

I found a lot of similar experiences on hardware running various flavors of Linux, but none running ESX. All reported failing NICs, with the message ""Detected Hardware Unit Hang"

The fix always seemed to be to turn off TSO (TCP Segmentation Offload).

I did this on my three hosts and restarted them.

And since then, so far, I've been moving stuff around for hours and no failures. Anecdotally, it actually looks like things might be going a tiny bit faster.

Advanced Properties -> Net.UseHwTSO = 0

View solution in original post

tedklugman · ‎05-18-2020

Replying to my own post.

I found a lot of similar experiences on hardware running various flavors of Linux, but none running ESX. All reported failing NICs, with the message ""Detected Hardware Unit Hang"

The fix always seemed to be to turn off TSO (TCP Segmentation Offload).

I did this on my three hosts and restarted them.

And since then, so far, I've been moving stuff around for hours and no failures. Anecdotally, it actually looks like things might be going a tiny bit faster.

Advanced Properties -> Net.UseHwTSO = 0

scott28tt · ‎05-19-2020

Problem solved?

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog

tedklugman · ‎05-19-2020

I think so. I'll mark my own reply as correct. I wonder if I get credit for that. :smileygrin:

scott28tt · ‎05-19-2020

Helps everyone - those experiencing the same or a similar problem can see what you did.

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog