10 Replies Latest reply on Mar 28, 2017 1:13 AM by Bayu Wibowo

    VXLAN interfaces intermittently disconnect

    iforbes Hot Shot

      Hi. This is an odd issue. I've been noticing lately that the ESXi server that hosts the DLR and/or ESG control vm's will intermittently have only it's VXLAN interfaces disconnected. No other interfaces on the ESXi server are affected. If the ESXi server doesn't house those control vm's, no issues. I can't figure out what is causing this unusual behaviour, but it's not good as this causes a bunch of issues. Since it's not an ESXi failure (just specific network interfaces going down) HA doesn't kick in to migrate those vm's to another host. So, I end up having vm's on the affected host just sit there until I'm alerted (i.e. network interface redundancy lost) and then I vMotion vm's away from the host. A reboot of the affected ESXi host resolves the problem and the interfaces are magically back up.

      My servers are Cisco UCS blades, and all interfaces are created as vnics in USCM and presented to ESXi as vmnics. As mentioned, no other vmnics on the ESXi host are affected.

        • 1. Re: VXLAN interfaces intermittently disconnect
          Hans Roeder Enthusiast

          What version of NSX are you currently running?

           

          Also, my suggestion would be to open up a Service Request with VMware, since this sounds pretty serious.

          • 2. Re: VXLAN interfaces intermittently disconnect
            iforbes Hot Shot

            Running 6.3.0.5007049. It's deployed in a lab so not affecting production. Big enough issue to present a roadblock to production deployment though.

            • 3. Re: VXLAN interfaces intermittently disconnect
              Bayu Wibowo Master
              User ModeratorsCommunity Warriors

              Hi, when you say VXLAN interfaces are you referring to VXLAN PortGroups, VTEP vmkernel, or something else?

              Could you explain more about this?

               

              Do you have any dynamic routing configured?
              Do you have vPC between Fabric Interconnect to upstream physical switches?

               

              When designing NSX + UCS, I find these three design guides are very helpful

              NSX+Cisco Nexus 7000/UCS Design Guide

              Reference Design: Deploying NSX with Cisco UCS and Nexus 9000 Infrastructure

              https://www.vce.com/asset/documents/vxblock-nsx-6-1-4-architecture-overview.pdf

              Bayu Wibowo | vExpert NSX, VCIX6-DCV/NV, Cisco Champion, AWS-SAA
              Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
              https://nz.linkedin.com/in/bayupw | twitter @bayupw
              • 4. Re: VXLAN interfaces intermittently disconnect
                iforbes Hot Shot

                Hi. My VXLAN interfaces use the same physical uplinks as the VTEP interfaces. They are 2 dedicated physical uplinks in a active standby nic team. Yes, I do have OSPF configured between my DLR and ESG, and from the ESG to the physical core. I don't yet have OSPF configured on the core yet. Yes vPC is configured between FI and core.

                • 5. Re: VXLAN interfaces intermittently disconnect
                  iforbes Hot Shot

                  So, it definitely had something to do with on the NSX side. In testing multi-tenancy I had created an additional DLR and ESG. When I deleted those from the environment, everything is stable again. No idea why, and a bit concerning that additional instances of those would cause issues, but things are back to being stable again.

                  • 6. Re: VXLAN interfaces intermittently disconnect
                    Bayu Wibowo Master
                    User ModeratorsCommunity Warriors

                    Is this a new setup? Any IP conflict?
                    How many VTEPs and what load balancing policy do you use for the VTEP?
                    Have you test that the load balancing policy & failover for the VTEP work properly?

                     

                    As per design guide in my earlier reply, some physical switches doesn't support routing over vPC and you need to have non-vPC link for the North-South routing.

                    But you mentioned that you haven't configured any routing to physical core router so I think this is probably not the issue.

                     

                    I had similar issue with UCS vNIC, pinning configuration, and physical network configuration

                    For example vmnic0 pinned to first Fabric Interconnect and vmnic1 pinned to second Fabric Interconnect.

                    In my case, due to some misconfiguration, vmnic0 can't talk to vmnic1. So it was working normal but when an ESXi is using vmnic1, they can't communicate.

                    I was fixed by redesigning the vNIC & physical network & a reconfiguration.

                    But it was based on NSX 6.2 not NSX 6.3.

                     

                    Please update if you found the root cause.

                    Even if it is a lab (not production) as long as you have the license & support, I believe you can still open a support request to VMware Support but maybe with normal Severity 3 or maybe 2

                    Bayu Wibowo | vExpert NSX, VCIX6-DCV/NV, Cisco Champion, AWS-SAA
                    Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
                    https://nz.linkedin.com/in/bayupw | twitter @bayupw
                    • 7. Re: VXLAN interfaces intermittently disconnect
                      iforbes Hot Shot

                      So, it's still happening but at least I've narrowed it down. It's 100% the DLR control vm that for some reason causes the interfaces I've dedicated for VXLAN/VTEP to become DOWN. In my setup I have a dedicated vDS with 2 physical uplinks dedicated to VTEP/VXLAN traffic. The 2 uplinks are in active / standby nic team (use explicit failover order). Something is happening when this DLR vm resides on an ESXi server. After a period of time, the ESXi server will lose network redundancy because at least on of the 2 uplinks will be marked as DOWN. After so more time the other interface also gets marked DOWN and then it's network connectivity lost since both interfaces are down.

                       

                      Could it be some sort of traffic coming from this vm is flooding the physical interface causing the switch port to get marked as down? When I reboot the ESXi server, the interfaces come back. If I migrate the vm to another ESXi server, after a period of time the exact same thing happens. Is there a way I can figure out why this is happening?

                      • 8. Re: VXLAN interfaces intermittently disconnect
                        Bayu Wibowo Master
                        Community WarriorsUser Moderators

                        Do you have any bridging configured?

                        I had an issue with DLR control VM with HA and bridging.

                        The issue was DLR control VMs were having a split brain scenario and advertising duplicate mac address throughout the network.

                        I could also see a duplicate MAC log errors in the physical switch.

                         

                        Not sure if you are having a same issue, open an SR to VMware Support if you can simulate the issue.

                        Bayu Wibowo | vExpert NSX, VCIX6-DCV/NV, Cisco Champion, AWS-SAA
                        Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
                        https://nz.linkedin.com/in/bayupw | twitter @bayupw
                        • 9. Re: VXLAN interfaces intermittently disconnect
                          iforbes Hot Shot

                          Hi Bayu. Yes, I have bridging deployed, and dual control vm's in active/passive. I'll open a case, but how did you resolve? Is there an easy way to destroy the passive DLR node?

                          • 10. Re: VXLAN interfaces intermittently disconnect
                            Bayu Wibowo Master
                            Community WarriorsUser Moderators

                            In my case it was based on NSX 6.1.x

                            The customer decided to remove NSX bridging and do not extend physical L2 VLAN to VXLAN.

                            Later on we found that there was a bug on that particular version and should be solved by upgrading to newer version.

                            But customer didn't upgrade and removed NSX bridging on their environment.

                             

                            It's worth to check with VMware Support/GSS and see if you hit a known issue or something else

                            Bayu Wibowo | vExpert NSX, VCIX6-DCV/NV, Cisco Champion, AWS-SAA
                            Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
                            https://nz.linkedin.com/in/bayupw | twitter @bayupw