We've recently switched from 6.7 to 7.0.3, 19482537 and we had never had any similar problems with vMotion before. When a network failure occurs and it affects ESXi hosts, they go back to normal as soon as Cisco ports or the entire network environment re-balances.
Yesterday we had a problem to vMotion several VMs onto two ESXi hosts after such network incidents. I looked through vpxa, hostd and vmkernel logs and found:
Failed waiting for data. Error 195887167. Connection closed by remote host, possibly due to timeout.
VMotionStream [-1407778881:4151649780786036937] failed to read stream keepalive: Connection closed by remote host, possibly due to timeout
cpu34:2591196)WARNING: Migrate: 6460: 4151649780786036937 😧 Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.
There are a lot of entries including: Cannot open file "/vmfs/volumes/5ed12ccc-e4651386-16a9-bc97e148c8ec/VMXXX/VMXXX.vmx": Device or resource busy OR: il3: 4994: Lock failed on file: VMXXX.vmx on vol 'ST0CML1-VMFS2' with FD: <FD c57 r1>
Based on some Cisco log entries I decided to replace SFP modules in one ESXi host (also replaced the corresponding module in Cisco) - still, was not able to vMotion any VMs.
The only workaround seems to be a reboot - after the reboot, problems with vMotion are gone. It means that there are no configuration problems (MTU mismatch, etc.). Not a single VM stucks at 20% again while moving it onto another host. At this moment, it's the only workaround - maybe there's a bug in 7.0.3?
VMware Cluster environment comprises of ESXi, 6.7.0, 16075168 hosts, vCenter 6.7.0 Build 16046713, and vSphere Client version 184.108.40.206000