VMware Networking Community
Erilliot
Contributor
Contributor

NSX-T overlay issues

Good day guys,

I texted a few weeks ago about some external connectivity issue but did not get a response. please i will really appreciate any help i can get now as time ir running out on me for my thesis completion. Here is  a brief summary of my overal setup:

1) I am using the cisco UCS  as my hardware environment The ucs has 4 blade servers each with 2 physical nics with each going to a specific fabric interconnect switch and both fabric interconnects terminatre on a catalyst 3560 G switch which is BGP enabled and witha system mtu of 1600. The  blade servers host:

- One server for the vcenter 6.7 appliance.1 vds is configured for nsx-t with a management portgroup, an overlay prtgroup and an edge- uplink port group which will provide external network connectivity to my catalyst 3560 switch with a vlan configured for this. All the vlans also have  svis configured on the catalyst.

- One server housing the NSX-T version 2.3 manager, controller and edge VM. The NSX-t manager and controller are only configured to use the management portgroup which has only one uplink to the vmnic0. The edge uses both vmnics with vmnic0 configured as active for edge vlan uplink portgroup with vmnic one as standby, The e overlay portgroup usies the vmnic 1 as active and vmnic 0 as standby. Promiscous mode is enabled for this portgroup

- The last 2 blade servers are not connected to the vcenter vds, but have their 1st nic connected to the esxi standard vswitch. The remaining nic is used by NSX-T for the underlay. They have nsx-t uplink connected for fail-over order.

----- The edge was successfully able to connect and form a bgp neighbour with the catalyst switch via the edge uplink vlan. It was added to the overlay as well using the overlay portgroup. The edge was succesfully connecetd to the overlay and was able to ping the other esxi host teps. ALL TEPS are reachable.

Problem:

I noticed that although all teps are reachable (2 hypervisor teps and the edge tep), vms in the hypervisor nodes in the overlay cannot ping across hypervisors. vms in the same hypervisor, connected to the same logical switch can ping each other. vms in different logical switch and same hypervisor host can also ping each other.

BUT; when they are in different hypervisors, they cannot ping each other even though they may be connected to the same logical switch. VMs in the overlay cannot ping beyond their gateways as well. They cannot ping the external network even though the Service router of the Tier-0 logical router has the VM networks`on its routing table. External networks can ping the gateway of the logical switches but cannot ping beyond that.  I also noticed that when i attach a dhcp server to the VM (I am using ubuntu 16.04), they lease ip addresses for exactly 1 secon and its lease period ends. This is unusal as the lease period is set to the default 8640000s

I have tried all i know and have ran out of ideas. PLEASE HELP ANYONE. I will appreciate any form of help as it is vital for my studies completion  Find attached my UCS vnic template screen shots for possible diagnosis

Reply
0 Kudos
10 Replies
Sreec
VMware Employee
VMware Employee

I noticed that although all teps are reachable (2 hypervisor teps and the edge tep), vms in the hypervisor nodes in the overlay cannot ping across hypervisors. vms in the same hypervisor, connected to the same logical switch can ping each other. vms in different logical switch and same hypervisor host can also ping each other.

BUT; when they are in different hypervisors, they cannot ping each other even though they may be connected to the same logical switch. VMs in the overlay cannot ping beyond their gateways as well.

1) When you performed the above test - how was the VNIC mapping with FI ?  Both the hypervisor Active VNIC going to same FI or different FI ?  Please test both the scenarios to rule out Blade Profile and FI connectivity part.

2) Can you confirm TEP VLAN configured properly - DVS ,Blade Profiles and Catalyst ?

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
Erilliot
Contributor
Contributor

Thanks so much for your reply.

1) I had the the vnic 1(vmnic 0) going to fabric a and vnic 2 (vmnic 1) going to fabric B. And both hypervisors were using vnic 2 for the nsx-t n-vds. Meaning they were both utilizing fabric B. I switched. I just tested by switching the hypervisor hosts to have nsx-t have its vmnic use fabric, but its similar results. I will like to state that both ucs VNICs have similar configurations. I also want to state that i applied QOS settings according to this guide https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/products/nsx/vmware-nsx-on-cisco-n...

and on both the ucs and vds. But i did not touch such settings on the fabric nodes. I left everything as default since they are using N-vds anyway. I think the QOS settings on the vds should affect only the edge vm as it is the only member of the fabric using the vspher vds. Or do you think the qos settings may play a factor here.

2) I confirmed the vlan from the vds, to the ucs, to the catalyst. All seems to be ok. I think if it wasn`t my TEPS wouldn`t have been able to ping each other. Please do you think i am missing something?

I will also like to mention in the QOS setttings applied to the blade VNICS, i limited the bandwidth (rate section) to 1G becasue my upstream catalyst is only 1G. Might that be an issue as well?

Reply
0 Kudos
Erilliot
Contributor
Contributor

please find the relevant screenshots :

Reply
0 Kudos
Erilliot
Contributor
Contributor

Thanks a million!

Reply
0 Kudos
Sreec
VMware Employee
VMware Employee

1) I had the the vnic 1(vmnic 0) going to fabric a and vnic 2 (vmnic 1) going to fabric B. And both hypervisors were using vnic 2 for the nsx-t n-vds. Meaning they were both utilizing fabric B. I switched. I just tested by switching the hypervisor hosts to have nsx-t have its vmnic use fabric, but its similar results. I will like to state that both ucs VNICs have similar configurations. I also want to state that i applied QOS settings according to this guide https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/products/nsx/vmware-nsx-on-cisco-n...

and on both the ucs and vds. But i did not touch such settings on the fabric nodes. I left everything as default since they are using N-vds anyway. I think the QOS settings on the vds should affect only the edge vm as it is the only member of the fabric using the vspher vds. Or do you think the qos settings may play a factor here.

Even though Design guide is for NSX-v ,however MTU changes are still required for overlay communication and QOS MTU settings looks fine .  Since it is a simple L2 test going via Fabric , when your NICS are pinned to same Fabric and you still have connectivity issue - To me this is purely UCS-vSphere&NSX-T config issue

2) I confirmed the vlan from the vds, to the ucs, to the catalyst. All seems to be ok. I think if it wasn`t my TEPS wouldn`t have been able to ping each other. Please do you think i am missing something?

So if i'm not wrong TEP transport VLAN is 200 ?  Are you sure that Uplink profile in NSXT is also configured with VLAN-200 ?  What is the use of vxlanDVportgroup ?

I will also like to mention in the QOS setttings applied to the blade VNICS, i limited the bandwidth (rate section) to 1G becasue my upstream catalyst is only 1G. Might that be an issue as well?

QOS is optional , setting rate limit should not end up in a complete connectivity issue.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
Erilliot
Contributor
Contributor

Once again, thanks for yor response.

1) as the traffic is not transversing hypervisors, i think it has something  to do with the underlay, more like the ucs, but i just can`t find what it is. I say this because as earlier metioned, traffic between vms in same LS and different LS on same host can ping. But not across hypervisors. External network,  via the catalyst can ping the vms `default gateways, but not the vms. A strange thing happened as well. when i enter a satic route on my tier1 connecting the vm LS towards the external newtwork, with the next hop as the default gateway of the vms, the vm was able to ping the external, although  the external still couldn`t ping beyond the vms` gateways

2) The vxlanDVportgroup is the name of  transport portgroup (200) on the vds used by the edge for overlay. But like i said, i don`t think he problem may be so much with the vds because the fabric nodes (excluding the edge) are not using the vds at all.  I used the default nsx-t edge vms host upink profile for the edge nodes. They are failover order with only 1 active uplink. It is untagged because of tagging at the vds  (vxlanDVportgroup level). But for the hypervisor hosts paticipating in  the overlay, their uplinks have the vlan 200 specified with failover order as well.

Reply
0 Kudos
KaBalint
Enthusiast
Enthusiast

Hi!

You opened this question a few months ago, do you have any update?
Is the overlay working? Have you found any good solution?

Kind Regards,

Balint

Reply
0 Kudos
daphnissov
Immortal
Immortal

If you're having overlay issues in going across TEPs and between hypervisors, the issues usually come down to one of three possibilities:

  1. MTU is not set in your physical switching infra to 1600 or greater.
  2. Uplink profile is not specifying the VLAN properly
  3. External infra firewalling
Reply
0 Kudos
kucziakos
Contributor
Contributor

Hi Balint!

Are the edge routers unreachable as well?

Regards,

akos

Reply
0 Kudos
KaBalint
Enthusiast
Enthusiast

Hi

In our case the tunnels are working between the edge transport node and the esxi transport nodes (esxi hosts) .
The tunnels are down between the esxi transport nodes.

Kind Regards,

Balint

Reply
0 Kudos