I have encountered an issue with vxlan and lacp and I’m wondering if anybody else has seen or can replicate this issue.
The issue appears to be that the vxlan network stack does not monitor the lacp port status. If one of the ports goes down, the outbound network traffic does not get rerouted to an operating link. In my case, it affected arp, which causes all communications to fail. The other non-vxlan port groups in the lacp trunk do not exhibit this problem.
Steps to produce the problem (in general):
- Configured VDS with lacp with 3 trunked ports.
- Migrate management, vsan, and vmotion port groups into lacp trunk.
- Configure cluster for vxlan using the vds.
- Confirm lacp link status with “esxcli network vswitch dvs vmware lacp status get”
- Confirm proper operation of each port group (ping something from the esxi host)
- Unplug one of the lacp ports
- Repeat 4,5
- Plug the port back in
- Repeat 6-8 for each of the remaining ports in the trunk group.
When a lacp port goes down, I expect a momentary loss of communications as long as at least one port remains up. What I'm seeing is a permanent loss of communications until the affected port comes back online.
Software:
NSX 6.3.3
vCenter, ESXi 6.5U1
Network Hardware:
Brocade VDX6740
Intel X710, Intel 82599EB, Mellonex ConnectX-4Lx (each tested separately)
Thanks,
Erik