I have encountered an issue with vxlan and lacp and I’m wondering if anybody else has seen or can replicate this issue.
The issue appears to be that the vxlan network stack does not monitor the lacp port status. If one of the ports goes down, the outbound network traffic does not get rerouted to an operating link. In my case, it affected arp, which causes all communications to fail. The other non-vxlan port groups in the lacp trunk do not exhibit this problem.
Steps to produce the problem (in general):
When a lacp port goes down, I expect a momentary loss of communications as long as at least one port remains up. What I'm seeing is a permanent loss of communications until the affected port comes back online.
vCenter, ESXi 6.5U1
Intel X710, Intel 82599EB, Mellonex ConnectX-4Lx (each tested separately)
The nice people at support had a quick answer for my problem.
In my attempt to plan for the future, on the server side, I had four network ports configured for LACP. The network switch was configured for three ports and three cables were plugged in.
I other places where I've done this, I've never had a problem. It seems that vxlan does not handle the lacp port count being different on the server vs the switch.