VMware Networking Community
Czernobog
Expert
Expert
Jump to solution

NSX-T 3 - overlay tunnel degraded after switch from n-vds to vds7 - separating endpoint and overlay

Last week I wanted to migrate workloads from the N-VDS to the VDS7, following a vSphere 7 upgrade.

The ESXi hosts in my test environment have 4 NICs each, the initial configuraton was vmnic0 + vmnic1 on vds, vmnic2 + vmnic3 on n-vds.

The migration and new uplink assignment was done successfully, at least I did not find an error in the uplink profile configuration of the transport nodes. First the vmnic0 and vmnic1 were used in the profile, after the new vmkernels were online I have assigned the hosts vmnic2 and vmnic3 to the vds.

What immedietally became obvious was, that the overlay tunnel status switched to degraded. Falling back to the n-vds configuration remediated the issue. I have checked my configuration and looked for clues in the documentation, but could not find a solution.

Finally I came upon this blog post, where the summary explains the behaviour well:

"So when you're using and N-VDS or VDS for NSX-T and you're placing an Edge on the same switch you have to put the Edge overlay in a different subnet. The Geneve traffic that originates from the Edge is not allowed to pass a switch that's hosting a tunnel endpoint for ESXi (VMK10)."

Following this advice, I've created a new VDS with only one port group for the overlay traffic, and connected to vmnics of the host to its. After this, the nsx edges nic dedicated to te overlay tunnel connection was attached to this new port group, and the tunnel was re-established.

This poses a problem however, because in production I have hosts with only w nics. This means, that I would have to separate the vmnics somehow between two distributed switches, which woudl resultat in a non-redundand setup.

Is there a way to separate the overlay and endpoint traffic while still placing all vmkernels on the same VDS?

0 Kudos
1 Solution

Accepted Solutions
Czernobog
Expert
Expert
Jump to solution

The issue is resolved for me - when placing the edges on NSX-prepared hosts, the geneve tunnel traffic has to be placed on a separate NSX-segment. So, if someone has the same issue, create a new VLAN NSX Segment, add it to a VLAN TZ, add the TZ to your edges. This way you will be able to select this segment for your tunnel traffic.

The NSX design reference guide has just recently been updated to match NSX 3.0 (no 3.1 still...), I hope the business unit steps up in this regard.

edit: I have no idea how to mark the post as "resolved" after the recent vmtn update, so I'll just mark my answer as the correct one.

View solution in original post

0 Kudos
19 Replies
Czernobog
Expert
Expert
Jump to solution

Ah, nice, I've just downloaded the upgrade bundle and will see if it works.

0 Kudos
shank89
Expert
Expert
Jump to solution

If you would like a bit more of an understanding as to the what and why this was an issue in the past, you may find this article useful as well.

NSX-T 3.1 Tunnel Endpoints | Inter TEP

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
Czernobog
Expert
Expert
Jump to solution

I've deployed a fresh 3.1 environment, configured my transport nodes, Tier-0, Tier-1 and attached a few segments, but as soon as I add a VM to a segment, the tunnel becomes degraded.

I want to run a "collapsed" *environment now, where the TEP vmkernels run on the vds too and getting this ocnfiguration to work was my goal with the 3.1 update.

I guess the error is somewhere in the transport node (TEP) configuration, which I have to figure out now...

shank89
Expert
Expert
Jump to solution

It sounds like you want to configure inter TEP, the above article walks through that and tells you how to configure it and how to test to make sure your tunnels are working.

Have you been through it?

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
Czernobog
Expert
Expert
Jump to solution

I have followed another guide, yes, your link reutrns a 404

0 Kudos
shank89
Expert
Expert
Jump to solution

interesting,  if you're still wanting some details around this, try this one it should work.

https://www.lab2prod.com.au/2020/11/nsx-t-inter-tep.html

 

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
engyak
Enthusiast
Enthusiast
Jump to solution

What do your BFD statuses report under the edge cluster?

0 Kudos
Czernobog
Expert
Expert
Jump to solution

"0 - No Diagnostic"

0 Kudos
shank89
Expert
Expert
Jump to solution

If you would like some more assistance.

Happy look at it over zoom, just let me know.

We can look at the steps you've tried and work forwards.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
Czernobog
Expert
Expert
Jump to solution

Thank you  for the offer, however this is not something that would be allowed in my environment:) I'll have to wait for GSS to finally respond to my SR.

0 Kudos
shank89
Expert
Expert
Jump to solution

Have you done all the vmkpings to check it all works?

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
engyak
Enthusiast
Enthusiast
Jump to solution

So, normally this would indicate that the tunnel source and destination could not complete at the N-VDS specified MTU. This is pretty common in new builds because the actual MTU check doesn't occur until the tunnel is needed.

It'd be really cool if we could have some kind of "post-implementation network test"!

Anyhow, for the tunnel that is failing you want to run a vmkping against it with the DNF bit set and MTU set to 1600+:

vmkping -d -I vmk10 -s 1600 <the other end>

 

https://kb.vmware.com/s/article/1003728

0 Kudos
shank89
Expert
Expert
Jump to solution

That will work, so I generally use vmkping ++netstack=vxlan <dstIP> -s 8972 -d

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
engyak
Enthusiast
Enthusiast
Jump to solution

Yep - the use of `vxlan` in this case feels...distasteful...so I don't use it 🙂

0 Kudos
Czernobog
Expert
Expert
Jump to solution

The issue is resolved for me - when placing the edges on NSX-prepared hosts, the geneve tunnel traffic has to be placed on a separate NSX-segment. So, if someone has the same issue, create a new VLAN NSX Segment, add it to a VLAN TZ, add the TZ to your edges. This way you will be able to select this segment for your tunnel traffic.

The NSX design reference guide has just recently been updated to match NSX 3.0 (no 3.1 still...), I hope the business unit steps up in this regard.

edit: I have no idea how to mark the post as "resolved" after the recent vmtn update, so I'll just mark my answer as the correct one.

0 Kudos
acancro
Contributor
Contributor
Jump to solution

You're hitting the same bug that the rest of us always do when getting started with this platform.  Host and Edge TEPs cannot coexist on the same switch with the same physical network adapters.

VMware won't admit this is a bug, but they "fixed" it in 3.1.  🙄

One workaround we were using was to put the Edge TEP on a Standard vSwitch with separate physical NICs, and leave the Host TEPs on the VDS.  This works fine, and you can then have a single Geneve VLAN, but traffic between Host and Edge TEP still needs to go through the top of rack switch, even if it's on the same physical host.

If you can't afford the extra physical NICs, then you had to use an external router to route between your Host TEP and Edge TEP networks, which was, of course, ridiculous.  After all, we're building software defined networks here!

Again, however, VMware has "fixed" this in 3.1 and you should be able to put everything on a single network now.

Art Cancro, VCP-NV 2020
0 Kudos
shank89
Expert
Expert
Jump to solution

Technically not a bug, however as you mentioned the feature didn't exist in earlier versions. 

The explanation can be seen in the link that i posted earlier.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
blancot
Contributor
Contributor
Jump to solution

I've just downloaded the upgrade bundle and will see if it works

 

Speed Test

0 Kudos