VMware Networking Community
wperdak
Enthusiast
Enthusiast
Jump to solution

NSX vxlan VMs cannot reach each other via routed fabric

Hello Community

I'm designing NSX over routed underlay and stuck on bizzare thing that drive me into insomnia. I susspect i miss something trivial, but ture is that VMs cannot reach each other via NSX vxlan when sitting on pods separated by L3 fabric.

I pulled out summary of my infrastructure below plus HLD diagram

Pod - two racks in DC with pair of L2/L3 Tor switches

vCneter:        6.5U1

NSX:              6.4.1

Underaly:      Nexus 9k

    

Topology

  • L3 Leaf/Spine partial mesh
  • two L2 Pods separated by L3 routed fabric
  • NSX transport subnets routed between pods using OSFP
  • vMotion and storage routed via L3 fabric
  • vCenter managment streached between pods via cisco VXLAN
  • inter DC is DWDM
  • Latency is less than 0.5 ms

NSX transport layout

  • unique vtep subnet per Pod
  • same  VLAN ID on both pod (local on each pod)
  • Default gateways on TOR swiches per pod for NSX transport vlan
  • VTEPs use vxlan stack poining to default gateway which is vlan transport SVI on ToR switches
  • Single transport zone accross compute clusters

vCenter layout

  • Sigle vcenter
  • single compute cluster per pod
  • streched managment cluster accross pods
  • ESX hosts are dual homed - trunk
  • DVS uplinks are A/S and explicit failover seetting as LB algoritm

Issue

  • VMs cannot reach each over vxlan when placed in pods
  • VMs can ping each other when placed on different hosts in same pod

Addons

  • No errors accross all stacks (cisco, vcenter, nsx, storage), all green and happy
  • VTEP IP pingable between hosts on vxlan stack
  • MTU 9000 end-to-end
  • NSX Controllers see vxlan and VMs
  • ESXi hosts see controlers and vxlan stack
  • OSPF proapgate both transport subnets
  • ICMP packets coming out from ESXi hosts and then are not seen again
  • When do trace from NSX  packet arrive local host vtep and it knows about peer's vtep IP and nothing more happen next

Also when i move NSX transport vlan into cisco VXLAN (what basicaly streach this vlan between pods) VMs start to communicate.

Image

pastedImage_21.png

cheers

Woj

Tags (1)
Reply
0 Kudos
1 Solution

Accepted Solutions
wperdak
Enthusiast
Enthusiast
Jump to solution

Issue is solved, thanks all for your contribution

Solution was to change NSX UDP port to any ohter than 4789, (we chose back to old one 8472), there was conflict becouse we also using cisco VXLAN configuration on Cisco underlay switches and that made subspace fabric unstability

View solution in original post

Reply
0 Kudos
17 Replies
Sreec
VMware Employee
VMware Employee
Jump to solution

First of all , nice summary Smiley Happy

From the summary i don't think you have routing or basic connectivity issues(when i move NSX transport vlan into cisco VXLAN (what basicaly streach this vlan between pods) VMs start to communicate).  But few points to get clarified.

  • VTEP IP is reachable between hosts on vxlan stack

This test is within the POD or across the PODs ?

  • MTU 9000 end-to-end

I hoping MTU is set correctly even at DVS level .

What is the VXLAN replication mode used in the architecture ?

Can you please do a packet capture at VNIC level of Source and destination VTEP when you initiate a Ping - For Eg->  VM running on HOST-A in POD1 to VM running on Host-B in POD 2 ?  If you have dual NIC'S for VTEP , to simplify the test , please use 1 NIC.

VMware Knowledge Base

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
wperdak
Enthusiast
Enthusiast
Jump to solution

thanks Smiley Happy

Yes routing seems to be ok so thats why is so bizzare as it MUST work as intended...

I can reach all VTEPs when ping vxlanstack from each ESXi (local and across), also controllers responsible for vxlan see VTEPs and VMs in that vxlan

No we not use dual VTEPs - just single VTEP per host

Yes MTU is set up to VM vnic and is 9000

Mode for transport zone is Unicast

When i do ping packet trace from ESXi host uplink that what is packet structure

scenario 1 (working)

ping in single pod (VMs on both hosts)

src VM1 IP dst VM2 IP (from subnet allcocated to vxlan)

src mac of VM1 dst mac of VM2

vxlan encapsulation

src Host1 VTEP IP dst Host2 IP (IPs for vlan transport subnet of single pod)

mac of host1 vmk dst mac host 2 vmk

request packet coming out from host 1 and see packet coming in to host 2

also see packet replay incoming to host 1

scenario 2(not working)

src Pod1 VM1 IP dst Pod2 VM2 IP (from subnet allcocated to vxlan)

src mac of Pod1VM1 dst mac of Pod2 VM2

vxlan encapsulation

src Pod1 Host1 VTEP IP dst Pod2 Host2 IP (IPs for vlan transport from local pod subnets)

src mac of SVI local tor switch dst mac host 2 vmk     (src mac seems to be switch vlan interface which is default gateway for local transport zone subent)

request packet comning out from pod 1 host 1 and is gone, never seen cominig to host2 in pod2

no reply packets

Underaly routing is very simple (OSPF area 0) SVI for vtep subnets, iscsi and vmotion are injected to ospf and propagated

so on ToR switches i see redistributed subnets from both pods and i can ping mutualy SVIs IPs from both side switches

Reply
0 Kudos
Sreec
VMware Employee
VMware Employee
Jump to solution

From your explanation it looks like packets are dropping at SVI or Host level. May i know what is the output for below commands

1) esxcli network vswitch dvs vmware vxlan network mac list --vds-name xxxxx --vxlan-id=xxxxxdestina

Should check this on source and dest host were vm's are running (across the pods)

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
wperdak
Enthusiast
Enthusiast
Jump to solution

getting following on both hosts when imputing following command

esxcli network vswitch dvs vmware vxlan network mac list --vds-name dvs --vxlan-id=60004

Error: Unknown command or namespace network vswitch dvs vmware vxlan network mac list

[root@pod1host1:~] esxcli network vswitch dvs vmware

Usage: esxcli network vswitch dvs vmware {cmd} [cmd options]

Available Namespaces:

  lacp                  A set of commands for LACP related operations

Available Commands:

  list                  List the VMware vSphere Distributed Switch currently configured on the ESXi host.

[root@pod1host1:~]

NSX dont indicate any errors on host preparation - all green

even when both hosts have no issues with controller communication and particiate in correct vxlan (similar output from pod2host1)

[root@pod1host1:~] net-vdl2 -l

VXLAN Global States:

        Control plane Out-Of-Sync:      No

        UDP port:       4789

VXLAN VDS:     dvs

        VDS ID: 50 21 1b ae 8d 31 e2 71-c3 66 36 91 af 41 ba 6f

        MTU:    9000

        Segment ID:     63.128

        Gateway IP:     63.254

        Gateway MAC:    00:00:0c:9f:f6:ba

        Vmknic count:   1

                VXLAN vmknic:   vmk4

                        VDS port ID:    73

                        Switch port ID: 50331665

                        Endpoint ID:    0

                        VLAN ID:        1722

                        IP:             63.229

                        Netmask:        255.255.255.128

                        Segment ID:     .128

                        IP acquire timeout:     0

                        MTEP Tx Mac:     00:00:00:00:00:00

                        Multicast group count:  0

        Network count:  1

                VXLAN network:  60004

                        Multicast IP:   N/A (headend replication)

                        Control plane:  Enabled (multicast proxy,ARP proxy)

                        Controller:     64.138 (up)

                        Controller Disconnected Mode: no

                        MAC entry count:        2

                        ARP entry count:        0

                        Port count:     1

esxcli network ip connection list| grep tcp | grep 1234

tcp     0   0 IP:17073         IP:1234   ESTABLISHED   1578339  newreno  netcpa-worker       
tcp     0   0  IP:29752         IP:1234   ESTABLISHED   1578355  newreno  netcpa-worker       
tcp     0   0  IP:27336         IP:1234   ESTABLISHED   1578364  newreno  netcpa-worker        
Reply
0 Kudos
Sreec
VMware Employee
VMware Employee
Jump to solution

Please double check the command

pastedImage_0.png

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
wperdak
Enthusiast
Enthusiast
Jump to solution

same with other two commands

Error: Unknown command or namespace network vswitch dvs vmware vxlan network arp list

I checked health host status from NSX manager for all hosts and all looks ok

NSXManager> sh host host-xy health-status detail

Reply
0 Kudos
wperdak
Enthusiast
Enthusiast
Jump to solution

had to restart esxi services (hostd and vpxa) to get vxlan commands

here are the outputs

pod1host1

[root@pod1host1:~] esxcli network vswitch dvs vmware vxlan network mac list --vds-name HMS-OC-DVS --vxlan-id=60004

Inner MAC          Outer MAC          Outer IP        Flags 

-----------------  -----------------  --------------  --------

00:50:56:a1:b0:5d  00:50:56:63:0c:24   62.130  00001101        -local host vtep

00:50:56:a1:8b:cc  00:50:56:60:a5:81  63.129  00001111          -remote host vtep

[root@pod1host1:~] esxcli network vswitch dvs vmware vxlan network arp list --vds-name HMS-OC-DVS --vxlan-id=60004

IP          MAC                Flags 

----------  -----------------  --------

10.10.10.2  00:50:56:a1:8b:cc  00001101

pod2host2

[root@pod2host2:~] esxcli network vswitch dvs vmware vxlan network mac list --vds-name HMS-OC-DVS --vxlan-id=60004

Inner MAC          Outer MAC          Outer IP        Flags 

-----------------  -----------------  --------------  --------

00:50:56:a1:be:4b  00:50:56:68:ba:9f  63.229  00000001          -local host vtep

00:50:56:a1:d4:1c  00:50:56:61:f1:76  62.132  00001011          -remote host vtep

00:50:56:a1:c0:96  00:50:56:6a:e9:c9  62.131  00000001          -remote host vtep

[root@pod2host2:~] esxcli network vswitch dvs vmware vxlan network arp list --vds-name HMS-OC-DVS --vxlan-id=60004

IP          MAC                Flags 

----------  -----------------  --------

10.10.10.1  00:50:56:a1:d4:1c  00000101

It looks to me that on NSX/VTEP level controllers and hosts know about remote VTEPs and peers VMs,

Reply
0 Kudos
Sreec
VMware Employee
VMware Employee
Jump to solution

May i know what are these below IP&MAC ? 

10.10.10.2  00:50:56:a1:8b:cc

10.10.10.1  00:50:56:a1:d4:1c

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
wperdak
Enthusiast
Enthusiast
Jump to solution

those are ip and mac of VMs that are placed on vxlan 60004

.1 is on pod1

.2 is on pod2

Reply
0 Kudos
Sreec
VMware Employee
VMware Employee
Jump to solution

You are right, VTEP learning looks fine. What is the exact message you are getting for ICMP packet (VM 1 to VM2 ) ? . Also can we have a traceroute output and route table output (From both the VM's) , DFW rule ?

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
wperdak
Enthusiast
Enthusiast
Jump to solution

here is packet capture for traffic that leave pod1host1 and tend to going to other side

pastedImage_0.png

trace from 10.10.10.1 to 10.10.10.2 (as vxaln is 'strached L2' VMs are unaware of any L3 underayl anyway i think)

pastedImage_1.png

DFW - default rule is allow any any as we not use dfw yet

If there is a case that ToR cisco 9k is dropping packet, therefore how to ariculate issue to cisco TAC? As from cisco perspective VTEPs can reach on both sides over OSPFand proper vlan tagged packets are coming to switches.

Is is possible that UTEPs in unicast mode are not behavie?

Reply
0 Kudos
Sreec
VMware Employee
VMware Employee
Jump to solution

Like I said earlier , encapsulation part is fine and we are sending packet to right destination.  I understand that within the POD it is working fine and across the POD , replication mode of vxlan certainly come into play . Since we are using unicast there is no direct dependency on physical network as well , but you are right head-end replication should work fine. If feasible clear the ARP tables on TOR 9k during the test and do check the interface counters and ARP table.

Note : Better turnoff iptables as well for time being.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
wperdak
Enthusiast
Enthusiast
Jump to solution

Did packet capture on switches, and see packets from other pod coming to interfaces north routed ports on switches and then going via ports where hosts are connected.

Thats indicate that incoming packets are mistrated by ESXi Host or dvs or something on vcenter layer, as incoming packets are not reaching VMs (tcpdump shows nothing)

VMs are Centos7 with disabled iptables and firewalld

example of port config on switch (fex2k)

interface Ethernet102/1/1

  description host1(data vnic0)

  switchport mode trunk

  switchport trunk allowed vlan 1-3259,3261-4094

  no shutdown

Host are A/S without any vPC on switch side

LB on esxi uplinks : explicit use failover

vxlan transport portgroup sec 3x accept

Reply
0 Kudos
Sreec
VMware Employee
VMware Employee
Jump to solution

Portconfig looks fine. How about DLR routing table on POD1 and POD-2 ESXI hosts ? Are you sure that routes are showing correctly over there ?

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
jamib
Contributor
Contributor
Jump to solution

You said in your post "Yes MTU is set up to VM vnic and is 9000"

If this is correct you might be hitting MTU issues. VXLAN per standard in both Vmware and Cisco implementations set DF-bit to drop packet before fragment. If VM's VNIC is set to 9000 MTU, the additional overhead of VXLAN encap could cause the packet to fragment meaning it will drop due to DF-bit being set. Also, you mentioned the Cisco infrastructure uses VXLAN to (assuming BGP eVPN) so same rules apply. Need to account for additional overhead there.

Should have following MTU's configured:

Cisco Infrastructure = 9216

Vmware VDS = 9000

Vmware VM vNIC = between 1500 and 8900

Reply
0 Kudos
Sreec
VMware Employee
VMware Employee
Jump to solution

If it is MTU issue , he should be hitting the same issue within the POD, which is not the case . Within the POD everything works fine.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
wperdak
Enthusiast
Enthusiast
Jump to solution

Issue is solved, thanks all for your contribution

Solution was to change NSX UDP port to any ohter than 4789, (we chose back to old one 8472), there was conflict becouse we also using cisco VXLAN configuration on Cisco underlay switches and that made subspace fabric unstability

Reply
0 Kudos