Solved: Re: NSX vxlan VMs cannot reach each other via rout...

wperdak · ‎07-08-2018

Hello Community

I'm designing NSX over routed underlay and stuck on bizzare thing that drive me into insomnia. I susspect i miss something trivial, but ture is that VMs cannot reach each other via NSX vxlan when sitting on pods separated by L3 fabric.

I pulled out summary of my infrastructure below plus HLD diagram

Pod - two racks in DC with pair of L2/L3 Tor switches

vCneter: 6.5U1

NSX: 6.4.1

Underaly: Nexus 9k

Topology

L3 Leaf/Spine partial mesh
two L2 Pods separated by L3 routed fabric
NSX transport subnets routed between pods using OSFP
vMotion and storage routed via L3 fabric
vCenter managment streached between pods via cisco VXLAN
inter DC is DWDM
Latency is less than 0.5 ms

NSX transport layout

unique vtep subnet per Pod
same VLAN ID on both pod (local on each pod)
Default gateways on TOR swiches per pod for NSX transport vlan
VTEPs use vxlan stack poining to default gateway which is vlan transport SVI on ToR switches
Single transport zone accross compute clusters

vCenter layout

Sigle vcenter
single compute cluster per pod
streched managment cluster accross pods
ESX hosts are dual homed - trunk
DVS uplinks are A/S and explicit failover seetting as LB algoritm

Issue

VMs cannot reach each over vxlan when placed in pods
VMs can ping each other when placed on different hosts in same pod

Addons

No errors accross all stacks (cisco, vcenter, nsx, storage), all green and happy
VTEP IP pingable between hosts on vxlan stack
MTU 9000 end-to-end
NSX Controllers see vxlan and VMs
ESXi hosts see controlers and vxlan stack
OSPF proapgate both transport subnets
ICMP packets coming out from ESXi hosts and then are not seen again
When do trace from NSX packet arrive local host vtep and it knows about peer's vtep IP and nothing more happen next

Also when i move NSX transport vlan into cisco VXLAN (what basicaly streach this vlan between pods) VMs start to communicate.

Image

cheers

Woj

wperdak · ‎08-30-2018

Issue is solved, thanks all for your contribution

Solution was to change NSX UDP port to any ohter than 4789, (we chose back to old one 8472), there was conflict becouse we also using cisco VXLAN configuration on Cisco underlay switches and that made subspace fabric unstability

View solution in original post

Sreec · ‎07-08-2018

First of all , nice summary

From the summary i don't think you have routing or basic connectivity issues(when i move NSX transport vlan into cisco VXLAN (what basicaly streach this vlan between pods) VMs start to communicate). But few points to get clarified.

VTEP IP is reachable between hosts on vxlan stack

This test is within the POD or across the PODs ?

MTU 9000 end-to-end

I hoping MTU is set correctly even at DVS level .

What is the VXLAN replication mode used in the architecture ?

Can you please do a packet capture at VNIC level of Source and destination VTEP when you initiate a Ping - For Eg-> VM running on HOST-A in POD1 to VM running on Host-B in POD 2 ? If you have dual NIC'S for VTEP , to simplify the test , please use 1 NIC.

VMware Knowledge Base

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

wperdak · ‎07-08-2018

thanks

Yes routing seems to be ok so thats why is so bizzare as it MUST work as intended...

I can reach all VTEPs when ping vxlanstack from each ESXi (local and across), also controllers responsible for vxlan see VTEPs and VMs in that vxlan

No we not use dual VTEPs - just single VTEP per host

Yes MTU is set up to VM vnic and is 9000

Mode for transport zone is Unicast

When i do ping packet trace from ESXi host uplink that what is packet structure

scenario 1 (working)

ping in single pod (VMs on both hosts)

src VM1 IP dst VM2 IP (from subnet allcocated to vxlan)

src mac of VM1 dst mac of VM2

vxlan encapsulation

src Host1 VTEP IP dst Host2 IP (IPs for vlan transport subnet of single pod)

mac of host1 vmk dst mac host 2 vmk

request packet coming out from host 1 and see packet coming in to host 2

also see packet replay incoming to host 1

scenario 2(not working)

src Pod1 VM1 IP dst Pod2 VM2 IP (from subnet allcocated to vxlan)

src mac of Pod1VM1 dst mac of Pod2 VM2

vxlan encapsulation

src Pod1 Host1 VTEP IP dst Pod2 Host2 IP (IPs for vlan transport from local pod subnets)

src mac of SVI local tor switch dst mac host 2 vmk (src mac seems to be switch vlan interface which is default gateway for local transport zone subent)

request packet comning out from pod 1 host 1 and is gone, never seen cominig to host2 in pod2

no reply packets

Underaly routing is very simple (OSPF area 0) SVI for vtep subnets, iscsi and vmotion are injected to ospf and propagated

so on ToR switches i see redistributed subnets from both pods and i can ping mutualy SVIs IPs from both side switches

Sreec · ‎07-08-2018

From your explanation it looks like packets are dropping at SVI or Host level. May i know what is the output for below commands

1) esxcli network vswitch dvs vmware vxlan network mac list --vds-name xxxxx --vxlan-id=xxxxxdestina

Should check this on source and dest host were vm's are running (across the pods)

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

wperdak · ‎07-08-2018

getting following on both hosts when imputing following command

esxcli network vswitch dvs vmware vxlan network mac list --vds-name dvs --vxlan-id=60004

Error: Unknown command or namespace network vswitch dvs vmware vxlan network mac list

[root@pod1host1:~] esxcli network vswitch dvs vmware

Usage: esxcli network vswitch dvs vmware {cmd} [cmd options]

Available Namespaces:

lacp A set of commands for LACP related operations

Available Commands:

list List the VMware vSphere Distributed Switch currently configured on the ESXi host.

[root@pod1host1:~]

NSX dont indicate any errors on host preparation - all green

even when both hosts have no issues with controller communication and particiate in correct vxlan (similar output from pod2host1)

[root@pod1host1:~] net-vdl2 -l

VXLAN Global States:

Control plane Out-Of-Sync: No

UDP port: 4789

VXLAN VDS: dvs

VDS ID: 50 21 1b ae 8d 31 e2 71-c3 66 36 91 af 41 ba 6f

MTU: 9000

Segment ID: 63.128

Gateway IP: 63.254

Gateway MAC: 00:00:0c:9f:f6:ba

Vmknic count: 1

VXLAN vmknic: vmk4

VDS port ID: 73

Switch port ID: 50331665

Endpoint ID: 0

VLAN ID: 1722

IP: 63.229

Netmask: 255.255.255.128

Segment ID: .128

IP acquire timeout: 0

MTEP Tx Mac: 00:00:00:00:00:00

Multicast group count: 0

Network count: 1

VXLAN network: 60004

Multicast IP: N/A (headend replication)

Control plane: Enabled (multicast proxy,ARP proxy)

Controller: 64.138 (up)

Controller Disconnected Mode: no

MAC entry count: 2

ARP entry count: 0

Port count: 1

esxcli network ip connection list| grep tcp | grep 1234

tcp	0 IP:17073	IP:1234 ESTABLISHED 1578339 newreno netcpa-worker
tcp	0 IP:29752	IP:1234 ESTABLISHED 1578355 newreno netcpa-worker
tcp	0 IP:27336	IP:1234 ESTABLISHED 1578364 newreno netcpa-worker

Sreec · ‎07-08-2018

Please double check the command

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

wperdak · ‎07-08-2018

same with other two commands

Error: Unknown command or namespace network vswitch dvs vmware vxlan network arp list

I checked health host status from NSX manager for all hosts and all looks ok

NSXManager> sh host host-xy health-status detail

wperdak · ‎07-08-2018

had to restart esxi services (hostd and vpxa) to get vxlan commands

here are the outputs

pod1host1

[root@pod1host1:~] esxcli network vswitch dvs vmware vxlan network mac list --vds-name HMS-OC-DVS --vxlan-id=60004

Inner MAC Outer MAC Outer IP Flags

----------------- ----------------- -------------- --------

00:50:56:a1:b0:5d 00:50:56:63:0c:24 62.130 00001101 -local host vtep

00:50:56:a1:8b:cc 00:50:56:60:a5:81 63.129 00001111 -remote host vtep

[root@pod1host1:~] esxcli network vswitch dvs vmware vxlan network arp list --vds-name HMS-OC-DVS --vxlan-id=60004

IP MAC Flags

---------- ----------------- --------

10.10.10.2 00:50:56:a1:8b:cc 00001101

pod2host2

[root@pod2host2:~] esxcli network vswitch dvs vmware vxlan network mac list --vds-name HMS-OC-DVS --vxlan-id=60004

Inner MAC Outer MAC Outer IP Flags

----------------- ----------------- -------------- --------

00:50:56:a1:be:4b 00:50:56:68:ba:9f 63.229 00000001 -local host vtep

00:50:56:a1:d4:1c 00:50:56:61:f1:76 62.132 00001011 -remote host vtep

00:50:56:a1:c0:96 00:50:56:6a:e9:c9 62.131 00000001 -remote host vtep

[root@pod2host2:~] esxcli network vswitch dvs vmware vxlan network arp list --vds-name HMS-OC-DVS --vxlan-id=60004

IP MAC Flags

---------- ----------------- --------

10.10.10.1 00:50:56:a1:d4:1c 00000101

It looks to me that on NSX/VTEP level controllers and hosts know about remote VTEPs and peers VMs,

Sreec · ‎07-09-2018

May i know what are these below IP&MAC ?

10.10.10.2 00:50:56:a1:8b:cc

10.10.10.1 00:50:56:a1:d4:1c

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

wperdak · ‎07-09-2018

those are ip and mac of VMs that are placed on vxlan 60004

.1 is on pod1

.2 is on pod2

Sreec · ‎07-09-2018

You are right, VTEP learning looks fine. What is the exact message you are getting for ICMP packet (VM 1 to VM2 ) ? . Also can we have a traceroute output and route table output (From both the VM's) , DFW rule ?

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

wperdak · ‎07-09-2018

here is packet capture for traffic that leave pod1host1 and tend to going to other side

trace from 10.10.10.1 to 10.10.10.2 (as vxaln is 'strached L2' VMs are unaware of any L3 underayl anyway i think)

DFW - default rule is allow any any as we not use dfw yet

If there is a case that ToR cisco 9k is dropping packet, therefore how to ariculate issue to cisco TAC? As from cisco perspective VTEPs can reach on both sides over OSPFand proper vlan tagged packets are coming to switches.

Is is possible that UTEPs in unicast mode are not behavie?

Sreec · ‎07-10-2018

Like I said earlier , encapsulation part is fine and we are sending packet to right destination. I understand that within the POD it is working fine and across the POD , replication mode of vxlan certainly come into play . Since we are using unicast there is no direct dependency on physical network as well , but you are right head-end replication should work fine. If feasible clear the ARP tables on TOR 9k during the test and do check the interface counters and ARP table.

Note : Better turnoff iptables as well for time being.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

wperdak · ‎07-12-2018

Did packet capture on switches, and see packets from other pod coming to interfaces north routed ports on switches and then going via ports where hosts are connected.

Thats indicate that incoming packets are mistrated by ESXi Host or dvs or something on vcenter layer, as incoming packets are not reaching VMs (tcpdump shows nothing)

VMs are Centos7 with disabled iptables and firewalld

example of port config on switch (fex2k)

interface Ethernet102/1/1

description host1(data vnic0)

switchport mode trunk

switchport trunk allowed vlan 1-3259,3261-4094

no shutdown

Host are A/S without any vPC on switch side

LB on esxi uplinks : explicit use failover

vxlan transport portgroup sec 3x accept

Sreec · ‎07-13-2018

Portconfig looks fine. How about DLR routing table on POD1 and POD-2 ESXI hosts ? Are you sure that routes are showing correctly over there ?

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

jamib · ‎07-17-2018

You said in your post "Yes MTU is set up to VM vnic and is 9000"

If this is correct you might be hitting MTU issues. VXLAN per standard in both Vmware and Cisco implementations set DF-bit to drop packet before fragment. If VM's VNIC is set to 9000 MTU, the additional overhead of VXLAN encap could cause the packet to fragment meaning it will drop due to DF-bit being set. Also, you mentioned the Cisco infrastructure uses VXLAN to (assuming BGP eVPN) so same rules apply. Need to account for additional overhead there.

Should have following MTU's configured:

Cisco Infrastructure = 9216

Vmware VDS = 9000

Vmware VM vNIC = between 1500 and 8900

Sreec · ‎07-17-2018

If it is MTU issue , he should be hitting the same issue within the POD, which is not the case . Within the POD everything works fine.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

wperdak · ‎08-30-2018

Issue is solved, thanks all for your contribution

Solution was to change NSX UDP port to any ohter than 4789, (we chose back to old one 8472), there was conflict becouse we also using cisco VXLAN configuration on Cisco underlay switches and that made subspace fabric unstability

All

NSX vxlan VMs cannot reach each other via routed fabric