Hello Community
I'm designing NSX over routed underlay and stuck on bizzare thing that drive me into insomnia. I susspect i miss something trivial, but ture is that VMs cannot reach each other via NSX vxlan when sitting on pods separated by L3 fabric.
I pulled out summary of my infrastructure below plus HLD diagram
Pod - two racks in DC with pair of L2/L3 Tor switches
vCneter: 6.5U1
NSX: 6.4.1
Underaly: Nexus 9k
Topology
NSX transport layout
vCenter layout
Issue
Addons
Also when i move NSX transport vlan into cisco VXLAN (what basicaly streach this vlan between pods) VMs start to communicate.
Image
cheers
Woj
Issue is solved, thanks all for your contribution
Solution was to change NSX UDP port to any ohter than 4789, (we chose back to old one 8472), there was conflict becouse we also using cisco VXLAN configuration on Cisco underlay switches and that made subspace fabric unstability
First of all , nice summary
From the summary i don't think you have routing or basic connectivity issues(when i move NSX transport vlan into cisco VXLAN (what basicaly streach this vlan between pods) VMs start to communicate). But few points to get clarified.
This test is within the POD or across the PODs ?
I hoping MTU is set correctly even at DVS level .
What is the VXLAN replication mode used in the architecture ?
Can you please do a packet capture at VNIC level of Source and destination VTEP when you initiate a Ping - For Eg-> VM running on HOST-A in POD1 to VM running on Host-B in POD 2 ? If you have dual NIC'S for VTEP , to simplify the test , please use 1 NIC.
thanks
Yes routing seems to be ok so thats why is so bizzare as it MUST work as intended...
I can reach all VTEPs when ping vxlanstack from each ESXi (local and across), also controllers responsible for vxlan see VTEPs and VMs in that vxlan
No we not use dual VTEPs - just single VTEP per host
Yes MTU is set up to VM vnic and is 9000
Mode for transport zone is Unicast
When i do ping packet trace from ESXi host uplink that what is packet structure
scenario 1 (working)
ping in single pod (VMs on both hosts)
src VM1 IP dst VM2 IP (from subnet allcocated to vxlan)
src mac of VM1 dst mac of VM2
vxlan encapsulation
src Host1 VTEP IP dst Host2 IP (IPs for vlan transport subnet of single pod)
mac of host1 vmk dst mac host 2 vmk
request packet coming out from host 1 and see packet coming in to host 2
also see packet replay incoming to host 1
scenario 2(not working)
src Pod1 VM1 IP dst Pod2 VM2 IP (from subnet allcocated to vxlan)
src mac of Pod1VM1 dst mac of Pod2 VM2
vxlan encapsulation
src Pod1 Host1 VTEP IP dst Pod2 Host2 IP (IPs for vlan transport from local pod subnets)
src mac of SVI local tor switch dst mac host 2 vmk (src mac seems to be switch vlan interface which is default gateway for local transport zone subent)
request packet comning out from pod 1 host 1 and is gone, never seen cominig to host2 in pod2
no reply packets
Underaly routing is very simple (OSPF area 0) SVI for vtep subnets, iscsi and vmotion are injected to ospf and propagated
so on ToR switches i see redistributed subnets from both pods and i can ping mutualy SVIs IPs from both side switches
From your explanation it looks like packets are dropping at SVI or Host level. May i know what is the output for below commands
1) esxcli network vswitch dvs vmware vxlan network mac list --vds-name xxxxx --vxlan-id=xxxxxdestina
Should check this on source and dest host were vm's are running (across the pods)
getting following on both hosts when imputing following command
esxcli network vswitch dvs vmware vxlan network mac list --vds-name dvs --vxlan-id=60004
Error: Unknown command or namespace network vswitch dvs vmware vxlan network mac list
[root@pod1host1:~] esxcli network vswitch dvs vmware
Usage: esxcli network vswitch dvs vmware {cmd} [cmd options]
Available Namespaces:
lacp A set of commands for LACP related operations
Available Commands:
list List the VMware vSphere Distributed Switch currently configured on the ESXi host.
[root@pod1host1:~]
NSX dont indicate any errors on host preparation - all green
even when both hosts have no issues with controller communication and particiate in correct vxlan (similar output from pod2host1)
[root@pod1host1:~] net-vdl2 -l
VXLAN Global States:
Control plane Out-Of-Sync: No
UDP port: 4789
VXLAN VDS: dvs
VDS ID: 50 21 1b ae 8d 31 e2 71-c3 66 36 91 af 41 ba 6f
MTU: 9000
Segment ID: 63.128
Gateway IP: 63.254
Gateway MAC: 00:00:0c:9f:f6:ba
Vmknic count: 1
VXLAN vmknic: vmk4
VDS port ID: 73
Switch port ID: 50331665
Endpoint ID: 0
VLAN ID: 1722
IP: 63.229
Netmask: 255.255.255.128
Segment ID: .128
IP acquire timeout: 0
MTEP Tx Mac: 00:00:00:00:00:00
Multicast group count: 0
Network count: 1
VXLAN network: 60004
Multicast IP: N/A (headend replication)
Control plane: Enabled (multicast proxy,ARP proxy)
Controller: 64.138 (up)
Controller Disconnected Mode: no
MAC entry count: 2
ARP entry count: 0
Port count: 1
esxcli network ip connection list| grep tcp | grep 1234
tcp | 0 | 0 IP:17073 | IP:1234 ESTABLISHED 1578339 newreno netcpa-worker | |
tcp | 0 | 0 IP:29752 | IP:1234 ESTABLISHED 1578355 newreno netcpa-worker | |
tcp | 0 | 0 IP:27336 | IP:1234 ESTABLISHED 1578364 newreno netcpa-worker |
Please double check the command
same with other two commands
Error: Unknown command or namespace network vswitch dvs vmware vxlan network arp list
I checked health host status from NSX manager for all hosts and all looks ok
NSXManager> sh host host-xy health-status detail
had to restart esxi services (hostd and vpxa) to get vxlan commands
here are the outputs
pod1host1
[root@pod1host1:~] esxcli network vswitch dvs vmware vxlan network mac list --vds-name HMS-OC-DVS --vxlan-id=60004
Inner MAC Outer MAC Outer IP Flags
----------------- ----------------- -------------- --------
00:50:56:a1:b0:5d 00:50:56:63:0c:24 62.130 00001101 -local host vtep
00:50:56:a1:8b:cc 00:50:56:60:a5:81 63.129 00001111 -remote host vtep
[root@pod1host1:~] esxcli network vswitch dvs vmware vxlan network arp list --vds-name HMS-OC-DVS --vxlan-id=60004
IP MAC Flags
---------- ----------------- --------
10.10.10.2 00:50:56:a1:8b:cc 00001101
pod2host2
[root@pod2host2:~] esxcli network vswitch dvs vmware vxlan network mac list --vds-name HMS-OC-DVS --vxlan-id=60004
Inner MAC Outer MAC Outer IP Flags
----------------- ----------------- -------------- --------
00:50:56:a1:be:4b 00:50:56:68:ba:9f 63.229 00000001 -local host vtep
00:50:56:a1:d4:1c 00:50:56:61:f1:76 62.132 00001011 -remote host vtep
00:50:56:a1:c0:96 00:50:56:6a:e9:c9 62.131 00000001 -remote host vtep
[root@pod2host2:~] esxcli network vswitch dvs vmware vxlan network arp list --vds-name HMS-OC-DVS --vxlan-id=60004
IP MAC Flags
---------- ----------------- --------
10.10.10.1 00:50:56:a1:d4:1c 00000101
It looks to me that on NSX/VTEP level controllers and hosts know about remote VTEPs and peers VMs,
May i know what are these below IP&MAC ?
10.10.10.2 00:50:56:a1:8b:cc
10.10.10.1 00:50:56:a1:d4:1c
those are ip and mac of VMs that are placed on vxlan 60004
.1 is on pod1
.2 is on pod2
You are right, VTEP learning looks fine. What is the exact message you are getting for ICMP packet (VM 1 to VM2 ) ? . Also can we have a traceroute output and route table output (From both the VM's) , DFW rule ?
here is packet capture for traffic that leave pod1host1 and tend to going to other side
trace from 10.10.10.1 to 10.10.10.2 (as vxaln is 'strached L2' VMs are unaware of any L3 underayl anyway i think)
DFW - default rule is allow any any as we not use dfw yet
If there is a case that ToR cisco 9k is dropping packet, therefore how to ariculate issue to cisco TAC? As from cisco perspective VTEPs can reach on both sides over OSPFand proper vlan tagged packets are coming to switches.
Is is possible that UTEPs in unicast mode are not behavie?
Like I said earlier , encapsulation part is fine and we are sending packet to right destination. I understand that within the POD it is working fine and across the POD , replication mode of vxlan certainly come into play . Since we are using unicast there is no direct dependency on physical network as well , but you are right head-end replication should work fine. If feasible clear the ARP tables on TOR 9k during the test and do check the interface counters and ARP table.
Note : Better turnoff iptables as well for time being.
Did packet capture on switches, and see packets from other pod coming to interfaces north routed ports on switches and then going via ports where hosts are connected.
Thats indicate that incoming packets are mistrated by ESXi Host or dvs or something on vcenter layer, as incoming packets are not reaching VMs (tcpdump shows nothing)
VMs are Centos7 with disabled iptables and firewalld
example of port config on switch (fex2k)
interface Ethernet102/1/1
description host1(data vnic0)
switchport mode trunk
switchport trunk allowed vlan 1-3259,3261-4094
no shutdown
Host are A/S without any vPC on switch side
LB on esxi uplinks : explicit use failover
vxlan transport portgroup sec 3x accept
Portconfig looks fine. How about DLR routing table on POD1 and POD-2 ESXI hosts ? Are you sure that routes are showing correctly over there ?
You said in your post "Yes MTU is set up to VM vnic and is 9000"
If this is correct you might be hitting MTU issues. VXLAN per standard in both Vmware and Cisco implementations set DF-bit to drop packet before fragment. If VM's VNIC is set to 9000 MTU, the additional overhead of VXLAN encap could cause the packet to fragment meaning it will drop due to DF-bit being set. Also, you mentioned the Cisco infrastructure uses VXLAN to (assuming BGP eVPN) so same rules apply. Need to account for additional overhead there.
Should have following MTU's configured:
Cisco Infrastructure = 9216
Vmware VDS = 9000
Vmware VM vNIC = between 1500 and 8900
If it is MTU issue , he should be hitting the same issue within the POD, which is not the case . Within the POD everything works fine.
Issue is solved, thanks all for your contribution
Solution was to change NSX UDP port to any ohter than 4789, (we chose back to old one 8472), there was conflict becouse we also using cisco VXLAN configuration on Cisco underlay switches and that made subspace fabric unstability