Re: VXLAN interfaces intermittently disconnect

iforbes · ‎03-06-2017

Hi. This is an odd issue. I've been noticing lately that the ESXi server that hosts the DLR and/or ESG control vm's will intermittently have only it's VXLAN interfaces disconnected. No other interfaces on the ESXi server are affected. If the ESXi server doesn't house those control vm's, no issues. I can't figure out what is causing this unusual behaviour, but it's not good as this causes a bunch of issues. Since it's not an ESXi failure (just specific network interfaces going down) HA doesn't kick in to migrate those vm's to another host. So, I end up having vm's on the affected host just sit there until I'm alerted (i.e. network interface redundancy lost) and then I vMotion vm's away from the host. A reboot of the affected ESXi host resolves the problem and the interfaces are magically back up.

My servers are Cisco UCS blades, and all interfaces are created as vnics in USCM and presented to ESXi as vmnics. As mentioned, no other vmnics on the ESXi host are affected.

hansroeder · ‎03-06-2017

What version of NSX are you currently running?

Also, my suggestion would be to open up a Service Request with VMware, since this sounds pretty serious.

iforbes · ‎03-06-2017

Running 6.3.0.5007049. It's deployed in a lab so not affecting production. Big enough issue to present a roadblock to production deployment though.

bayupw · ‎03-06-2017

Hi, when you say VXLAN interfaces are you referring to VXLAN PortGroups, VTEP vmkernel, or something else?

Could you explain more about this?

Do you have any dynamic routing configured?
Do you have vPC between Fabric Interconnect to upstream physical switches?

When designing NSX + UCS, I find these three design guides are very helpful

NSX+Cisco Nexus 7000/UCS Design Guide

Reference Design: Deploying NSX with Cisco UCS and Nexus 9000 Infrastructure

https://www.vce.com/asset/documents/vxblock-nsx-6-1-4-architecture-overview.pdf

Bayu Wibowo | VCIX6-DCV/NV
Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
https://github.com/bayupw/PowerNSX-Scripts
https://nz.linkedin.com/in/bayupw | twitter @bayupw

iforbes · ‎03-07-2017

Hi. My VXLAN interfaces use the same physical uplinks as the VTEP interfaces. They are 2 dedicated physical uplinks in a active standby nic team. Yes, I do have OSPF configured between my DLR and ESG, and from the ESG to the physical core. I don't yet have OSPF configured on the core yet. Yes vPC is configured between FI and core.

iforbes · ‎03-08-2017

So, it definitely had something to do with on the NSX side. In testing multi-tenancy I had created an additional DLR and ESG. When I deleted those from the environment, everything is stable again. No idea why, and a bit concerning that additional instances of those would cause issues, but things are back to being stable again.

bayupw · ‎03-10-2017

Is this a new setup? Any IP conflict?
How many VTEPs and what load balancing policy do you use for the VTEP?
Have you test that the load balancing policy & failover for the VTEP work properly?

As per design guide in my earlier reply, some physical switches doesn't support routing over vPC and you need to have non-vPC link for the North-South routing.

But you mentioned that you haven't configured any routing to physical core router so I think this is probably not the issue.

I had similar issue with UCS vNIC, pinning configuration, and physical network configuration

For example vmnic0 pinned to first Fabric Interconnect and vmnic1 pinned to second Fabric Interconnect.

In my case, due to some misconfiguration, vmnic0 can't talk to vmnic1. So it was working normal but when an ESXi is using vmnic1, they can't communicate.

I was fixed by redesigning the vNIC & physical network & a reconfiguration.

But it was based on NSX 6.2 not NSX 6.3.

Please update if you found the root cause.

Even if it is a lab (not production) as long as you have the license & support, I believe you can still open a support request to VMware Support but maybe with normal Severity 3 or maybe 2

Bayu Wibowo | VCIX6-DCV/NV
Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
https://github.com/bayupw/PowerNSX-Scripts
https://nz.linkedin.com/in/bayupw | twitter @bayupw

iforbes · ‎03-24-2017

So, it's still happening but at least I've narrowed it down. It's 100% the DLR control vm that for some reason causes the interfaces I've dedicated for VXLAN/VTEP to become DOWN. In my setup I have a dedicated vDS with 2 physical uplinks dedicated to VTEP/VXLAN traffic. The 2 uplinks are in active / standby nic team (use explicit failover order). Something is happening when this DLR vm resides on an ESXi server. After a period of time, the ESXi server will lose network redundancy because at least on of the 2 uplinks will be marked as DOWN. After so more time the other interface also gets marked DOWN and then it's network connectivity lost since both interfaces are down.

Could it be some sort of traffic coming from this vm is flooding the physical interface causing the switch port to get marked as down? When I reboot the ESXi server, the interfaces come back. If I migrate the vm to another ESXi server, after a period of time the exact same thing happens. Is there a way I can figure out why this is happening?

bayupw · ‎03-26-2017

Do you have any bridging configured?

I had an issue with DLR control VM with HA and bridging.

The issue was DLR control VMs were having a split brain scenario and advertising duplicate mac address throughout the network.

I could also see a duplicate MAC log errors in the physical switch.

Not sure if you are having a same issue, open an SR to VMware Support if you can simulate the issue.

Bayu Wibowo | VCIX6-DCV/NV
Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
https://github.com/bayupw/PowerNSX-Scripts
https://nz.linkedin.com/in/bayupw | twitter @bayupw

iforbes · ‎03-27-2017

Hi Bayu. Yes, I have bridging deployed, and dual control vm's in active/passive. I'll open a case, but how did you resolve? Is there an easy way to destroy the passive DLR node?

bayupw · ‎03-28-2017

In my case it was based on NSX 6.1.x

The customer decided to remove NSX bridging and do not extend physical L2 VLAN to VXLAN.

Later on we found that there was a bug on that particular version and should be solved by upgrading to newer version.

But customer didn't upgrade and removed NSX bridging on their environment.

It's worth to check with VMware Support/GSS and see if you hit a known issue or something else

Bayu Wibowo | VCIX6-DCV/NV
Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
https://github.com/bayupw/PowerNSX-Scripts
https://nz.linkedin.com/in/bayupw | twitter @bayupw

iktech00 · ‎09-27-2019

I've got a customer who's experiencing similar issues; HA DLR VM's end up in split-brain due to external STP event. The hosts which have the DLR control VM's on there seem to constantly cause switchports to be blocked due to BPDUguard being triggered. The only way to stop it is to kill one of the HA nodes (in one instance, this had to be done on all tenant HA DLR VM's). VMware specify they don't take part in STP, but the only way we seem to resolve this issue, is by killing the HA VM node. Customer is running NSX 6.4.1. We've already raised this with VMware and they said the issue was not related to NSX however i'm strugglig to understand why when we kill the HA control VM's, the environment stabilises.. any ideas?

Nick_Andreev · ‎10-01-2019

Hi @iforbes,

What switch brand are you using? Check switch logs. If switch is turning ports off, you will see it in the logs.

If it's split brain / duplicate MAC issue, a simple test in your case would be to disable HA on your DLR CVM. You can do that under Manage > Settings > High Availability. Click Edit next to High Availability Configuration and disable HA.

NSX Manager will delete the second CVM appliance and CVM will run in non-HA mode.

---
If you found my answers helpful please consider marking them as helpful or correct.
VCIX-DCV, VCIX-NV, VCAP-CMA | vExpert '16, '17, '18
Blog: http://niktips.wordpress.com | Twitter: @nick_andreev_au

All

VXLAN interfaces intermittently disconnect