5 Replies Latest reply on May 9, 2019 1:40 PM by pbhite

    NSX with LACP/LAG - Intermittent Connectivity Loss

    pbhite Novice

      We are working on an NSX deployment across two sites and three clusters, cross vcenter with UDLR controller at each site. All clusters use a vDS with LAGs on the relevant interfaces to a Juniper QFX top of rack. We are seeing issues where VMs are intermittently not able to egress traffic. There is an HA pair of ESGs at each site for this purpose, configured with local egress. When it works, it works correctly - traffic egresses at the right spots. The VMs on the same segment can always ping each other, regardless of where they are, but certain ones won't be able to egress out via either ESG. The furthest they can ping is the uplink interface of the UDLR. If we vmotion to a new host, it might start working, but it's entirely random - sometimes we vmotion from host A to host B, egress starts working, then we vmotion back to A and it keeps working, then we vmotion to host B again and it stops working. Issue occurs at both sites/clusters.

       

      At one point earlier in the deployment we couldn't ping half the VTEP interfaces and found that it was the load balancing policy that the NSX-created port groups were using. Fixing that brought them all up consistently. I'm still suspecting there is some LACP/LAG issue occurring here, but I'm not certain. I've perused a few guides on NSX configuration with vDS, but some of the interchangeable terminology has me confused - maybe someone can point out what we might want to change on our load balancing policies here:

       

      Site 1:

      Cluster A / DVS A

      • Version 6.5
      • LACP Enabled / Active (v1)
      • Load balancing for all port groups set for "Route based on originating virtual port"

      Cluster B / DVS B (Management/Edge)

      • Version 6.5
      • LACPv2/Enhanced LACP
      • Load balancing mode: Source and destination IP address, TCP/UDP port and VLAN
      • Edge gateways for Site 1 live here, VMs do not

       

      Site 2:

      Cluster C / DVS C

      • Version 6.7
      • LACPv2/Enhanced LACP
      • Load balancing mode: Source and destination IP address, TCP/UDP port and VLAN (tried swapping to virtual port)
      • Edge gateways and VMs coexist here