Huawei Switches vSphere 6.5 Regular Loss of Connectivity using LACP and LBT
This is fixed, just posting to hopefully help others.
I had a client with a pair of Huawei S6720-54C-EI-48S switches, that were stacked as a logical switch, running software version V200R008C00SPC500 (Old I know); that was losing network connectivity every few minutes. The physical ports always stayed up, just there was no network traffic passing (i.e. PING drops to vmk0 and to VM's on connected VMPG's) .
They had two 10 Gbps ports connected (one from each physical switch) to vmnic0 & vmnic1 of each ESXi 6.5 host. They had a vDS with LACP/LAG configured and they also had configured LACP Eth-Trunks on the Huawei switches. The LACP load distribution algo was set as src-dst-ip on both the vDS and the Huawei Eth-Trunks.
Approximately every 6 minutes (give or take 10 seconds) a continuous PING to a VM would drop for 20-30 seconds and then it would recover. You could vMotion the VM to a new host and get the usual single PING drop and then within a few minutes the PING would drop again, but for 20-30 seconds.
It did not matter if both vmnic's were connected or just one; the drop out pattern was the same.
We decided to re-configure from LACP to LBT to see if this made any difference. After quite some time we had removed the LACP Eth-Trunk configurations from the switches as well as the LACP/LAG from the vDS. We were now running happily with LBT but if we tested vmnic fail over and/or vMotioned a VM, we would then get a drop out again; but not for the same duration! We would also get 'random' drop outs on vmk0 once the host had both vmnic's up and when there were a few VM's running on it.
The longest drop after a vMotion was just under 20 minutes before it recovered and this led us to look at the Huawei switch again. It turns out that on the Huawei switch the default aging timeout of dynamic ARP entries is 1200 seconds (20 minutes). Once we realised this, we could prove it was an ARP issue by manually clearing the ARP entry for a VM after we had vMotioned it and we would instantly get a PING response back.
The Huawei switches were effectively ignoring the RARP that the ESXi host gives during a vMotion and we were only getting a PING response back when the dynamic ARP entry eventually timed out.
After some digging around the Huawei site/documentation we found that we needed to implement the "mac-address update arp enable" and the "undo arp anti-attack entry-check enable" commands; as we were NOT using an Eth-Trunk for load balancing.
Once both settings were made, we could vMotion VM's and/or fail over vmnic's at will with just a single PING drop every time.
Client was more than happy to run with LBT, so we never got to the bottom of the LACP drop outs.
Message was edited by: Martin Cooper
Corrected typos in both commands 😞