Hi everybody,
I'm doing my first implementation of NSX-T and have an issue in the T-0 and T1 Gateways that I think is because a fault of mine.
I deployed a Tier-0 Gateway that manages only one uplink to a physical core switch with a VLAN segment as the uplink (192.168.16.0/24) , and for its downlink I have two LIFs on overlay segments (10.10.101.0/24 and 192.168.17.0/24) and the other one connected to a Tier-1 Gateway (the auto plumbed NSX Network).
On the Tier-1 Gateways I have as its uplink the auto plumbed network and as its downlinks two overlay segments (10.10.102.0/24 and 10.10.103.0/24).
The problem I'm facing is that I see that both gateways route the traffic between their downlinks without problem. It is to say, hosts on 192.168.17.0/24 can ping hosts con 10.10.101.0/24 and also to the auto plumbed network (100.64.112.0/31). The same happens on the Tier-1 gateway, hosts on 10.10.102.0/24 can ping hosts on 10.10.103.0/24. But no one can ping hosts or gateway's interfaces that have to traverse the Gateway. For example hosts on overlay segments connected to Tier-1 Gateway cannot ping hosts on Tier-0 Gateway's overlay segment. They neither can ping the inter gateways interfaces.
The question is, could be possible that this happens because the gateways theirself are not the same device but actually two ones, the DR and SR and the DRs don't know how to reach its half in the SR??
One more thing, I'm not using BGP neither statics routes. What I have observed is that when I choose to distribute the connected subnets in the Tier-1 Gateway, from the VRF CLI of the Tier-0 Gateway SR I can see those subnets in the "get routes" command as "t1c" (tier-1 connected). But from that CLI in the Tier-0 SR if I ping those LIFs in the Tier-1 DR I cannot reach them.
It seems that there are no connections between the DR and SR entities, although there is a subnet (169.254.0.0/24).
The other problem I have is that I don't know what the Tier-1 Gateway routing table is because the "get router" command is only available on the Tier-0 Gateway and not in the Tier-1 one. This might be a conceptual error I'm not realizing why that is happening.
All the documents and videos I read and saw, show that these type of implementations are done activating BGP, that perhaps it solves all the issues I'm having. But because the customer doesn't have a core for now that supports BGP, I preferred no to use a dynamic routing protocol. Perhaps, if it is convenient, if the problem is I'm not using BGP, I could enable it just for the internal virtual networking without advertising any route to the physical network. It is to say, I could use BGP for NSX networking and use statics routes between the Edge and physical network (I know it is not optimal). For the moment, the production will have only one or two NSX segments, that is manageable thru static routes. In the future the customer will change his core for a BGP capable one.
I'll appreciate if you can help me why this is going on.
I attach a network diagram below.
Thanks in advance.
Guido.
I finally solved the problem. I had to open a support ticket with VMware.
The problem was that tunnels between the Host Transport Nodes and Edge Nodes were not up and running. However, in the NSX Manager GUI they appeared as up and in green color!!! There is a bug there that the GUI shows you tunnels UP and running where they are actually down and without connection between the VTEPs.
Those tunnels were down because there was no layer-3 connection between the Host Transport Node VTEPs and the Edge Node VTEP. This infrastructure were running on a HP chassis C7000 with two Virtual Connect switches. There was a problem in the configuration of the core switch LAG ports. In the core, the ports that were connected to the Virtual Connect HP switches shouldn't have been configured as a LAG (port-channel) but as single ports.
We unconfigured the LAG and everything began to work.
Thanks for your help.
Guido.
Hi,
I am not seeing this behavior.
When i look on the T0-SR, i see
t1c> * 10.1.192.0/24 [3/0] via 100.64.32.7, downlink-500, 00:00:02
When i move the segment to the t0 is see
t0c> * 10.1.192.0/24 is directly connected, downlink-529, 00:01:11
In both cases i can ping a host in the segment. from behind a t1
You can see some routes on the T1-DR using "get forwarding"
T1-DR > get forwarding
Logical Router
UUID VRF LR-ID Name Type
b5c5f619-4368-4f1f-b316-7a1fd6c1f92d 10 8 DR-T1-test DISTRIBUTED_ROUTER_TIER1
IPv4 Forwarding Table
IP Prefix Gateway IP Type UUID Gateway MAC
0.0.0.0/0 169.254.0.2 route d58d74b5-6537-4c68-9293-6ad10fd97a4f 02:50:56:56:53:00
169.254.0.0/28 route d58d74b5-6537-4c68-9293-6ad10fd97a4f
169.254.0.1/32 route fc58ab32-deee-5d9d-b1f3-8c7aedc10a98
Are all MTU's ok ? do you see geneve tunnels ?
Chris,
First of all, thank you for your answer.
I'll paste the output of some commands:
From Tier-0 SR:
nsxtesg0(tier0_sr)> get route
Flags: t0c - Tier0-Connected, t0s - Tier0-Static, b - BGP,
t0n - Tier0-NAT, t1s - Tier1-Static, t1c - Tier1-Connected,
t1n: Tier1-NAT, t1l: Tier1-LB VIP, t1ls: Tier1-LB SNAT,
t1d: Tier1-DNS FORWARDER, t1ipsec: Tier1-IPSec, isr: Inter-SR,
> - selected route, * - FIB route
Total number of routes: 9
t0c> * 10.10.101.0/24 is directly connected, downlink-337, 1d16h22m
t1c> * 10.10.102.0/24 [3/0] via 100.64.112.1, downlink-294, 1d06h53m
t1c> * 10.10.103.0/24 [3/0] via 100.64.112.1, downlink-294, 1d06h53m
t0c> * 100.64.112.0/31 is directly connected, downlink-294, 1d16h22m
t0c> * 169.254.0.0/24 is directly connected, downlink-344, 1d16h22m
t0c> * 192.168.16.0/24 is directly connected, uplink-346, 1d16h22m
t0c> * 192.168.17.0/24 is directly connected, downlink-328, 1d16h22m
t0c> * fcef:2313:2800:800::/64 is directly connected, downlink-294, 1d16h22m
t0c> * fe80::/64 is directly connected, downlink-344, 1d16h22m
The two routes marked in red, the t1c, appeared when I selected to redistribute connected routes on Tier-1 Gateway. Before that I had only been seeing the "t0c" routes.
If I ping from this T1 SR to 192.168.16.1, the uplink interface that connects to physical routers, that is in a VLAN backed segment I reach the destination (it is logical causa it is the direct-connected interface):
nsxtesg0(tier0_sr)> ping 192.168.16.1
PING 192.168.16.1 (192.168.16.1): 56 data bytes
64 bytes from 192.168.16.1: icmp_seq=0 ttl=255 time=7.479 ms
64 bytes from 192.168.16.1: icmp_seq=1 ttl=255 time=1.731 ms
64 bytes from 192.168.16.1: icmp_seq=2 ttl=255 time=1.995 ms
64 bytes from 192.168.16.1: icmp_seq=3 ttl=255 time=1.234 ms
Moreover, I algo reach the core's switch interface (192.168.16.2):
nsxtesg0(tier0_sr)> ping 192.168.16.2
PING 192.168.16.2 (192.168.16.2): 56 data bytes
64 bytes from 192.168.16.2: icmp_seq=0 ttl=64 time=0.635 ms
64 bytes from 192.168.16.2: icmp_seq=1 ttl=64 time=0.549 ms
But if I ping one of the LIFs that is also "direct-connected" I cannot reach it. For example 192.168.101.1
nsxtesg0(tier0_sr)> ping 10.10.101.1
PING 10.10.101.1 (10.10.101.1): 56 data bytes
^C
--- 10.10.101.1 ping statistics ---
5 packets transmitted, 0 packets received, 100.0% packet loss
At this moment I thought that this happens because that overlay segment is actually direct-connected to the T0 DR, not to the SR. So, I moved to the Tier-0 DR:
From Tier-0 DR:
nsxtesg0> get logical-router
Logical Router
UUID VRF LR-ID Name Type Ports
736a80e3-23f6-5a2d-81d6-bbefb2786666 0 0 TUNNEL 4
b958c9bb-0d97-4c73-b39b-73b045e5ed76 3 1026 SR-RosarioT-1-Gateway SERVICE_ROUTER_TIER1 5
224c5db1-d92f-488f-9787-cb736b2cd396 4 2 DR-RosarioT-0-Gateway DISTRIBUTED_ROUTER_TIER0 6
002b63f3-688b-4fc7-84f1-bed9b00cc035 6 4 SR-RosarioT-0-Gateway SERVICE_ROUTER_TIER0 5
e6eb4986-7e44-4dfa-b1d7-1da0c8ac3d41 8 1025 DR-RosarioT-1-Gateway DISTRIBUTED_ROUTER_TIER1 5
nsxtesg0> vrf 4
nsxtesg0(vrf)> ping 10.10.101.1
PING 10.10.101.1 (10.10.101.1): 56 data bytes
64 bytes from 10.10.101.1: icmp_seq=0 ttl=64 time=0.677 ms
64 bytes from 10.10.101.1: icmp_seq=1 ttl=64 time=1.041 ms
64 bytes from 10.10.101.1: icmp_seq=2 ttl=64 time=0.665 ms
And from the DR I can reach the Tier-0 LIFs. So, I thought it might have been a problem in the connection between the SR and DR. For some reason, the SR knows that it has a LIF direct-connected to him but it doesn't know how to reach it. Very strange behaviour, because in a normal router, is an interface is directly connected, it always can reach it (I don't have too much experience working with VRF, perhaps this is a normal behaviour. Actually I don't think it because you don't usually have a router split in two parts, one the DR and one the SR).
The last test in the Tier-0 Gateway is that if I ping from the host 10.10.101.11 to the host 192.168.17.11 (two segments that are connected by the Tier-0 Gateway, it works ok):
From Tier-1 SR:
If I move to Tier-1 SR:
nsxtesg0> vrf 3
nsxtesg0(tier1_sr)> get forwarding
Logical Router
UUID VRF LR-ID Name Type
b958c9bb-0d97-4c73-b39b-73b045e5ed76 3 1026 SR-RosarioT-1-Gateway SERVICE_ROUTER_TIER1
IPv4 Forwarding Table
IP Prefix Gateway IP Type UUID Gateway MAC
0.0.0.0/0 100.64.112.0 route ad8ae731-1476-4b0f-b6f7-772b7cfdeb23
10.10.102.0/24 route b9ba19cf-af79-4c88-b6a3-8749d767708a
10.10.102.1/32 route ccb76982-83e8-5ef4-8d6f-067542430bab
10.10.103.0/24 route b4d4b235-70f1-4c25-99cf-4c205ecb5044
10.10.103.1/32 route ccb76982-83e8-5ef4-8d6f-067542430bab
100.64.112.0/31 route ad8ae731-1476-4b0f-b6f7-772b7cfdeb23
100.64.112.1/32 route 15f2f498-b4ad-59de-a31c-1d165cd3ecff
127.0.0.1/32 route 2e52e6b9-d924-45fb-aa80-7568525e3630
169.254.0.0/28 route fe859aa2-5095-42d7-973c-2b8dfccad54c
169.254.0.1/32 route ccb76982-83e8-5ef4-8d6f-067542430bab
169.254.0.2/32 route 15f2f498-b4ad-59de-a31c-1d165cd3ecff
IPv6 Forwarding Table
IP Prefix Gateway IP Type UUID Gateway MAC
::/0 fcef:2313:2800:800::1 route ad8ae731-1476-4b0f-b6f7-772b7cfdeb23
::1/128 route 2e52e6b9-d924-45fb-aa80-7568525e3630
fcef:2313:2800:800::/64 route ad8ae731-1476-4b0f-b6f7-772b7cfdeb23
fcef:2313:2800:800::2/128 route 15f2f498-b4ad-59de-a31c-1d165cd3ecff
As you can see, I don't have any entry to get the segments that are direct-connected to Tier-0 DR neither Tier-0 SR but a default route. Something strange here. I can ping the LIF of the overlay segment direct-connected to Tier-0 DR, but I cannot ping a host from that segment:
nsxtesg0(tier1_sr)> ping 10.10.101.1
PING 10.10.101.1 (10.10.101.1): 56 data bytes
64 bytes from 10.10.101.1: icmp_seq=0 ttl=64 time=1.428 ms
64 bytes from 10.10.101.1: icmp_seq=1 ttl=64 time=1.262 ms
64 bytes from 10.10.101.1: icmp_seq=2 ttl=64 time=1.439 ms
nsxtesg0(tier1_sr)> ping 10.10.101.11
PING 10.10.101.11 (10.10.101.11): 56 data bytes
--- 10.10.101.11 ping statistics ---
4 packets transmitted, 0 packets received, 100.0% packet loss
And I also cannot ping its own LIF, although it is direct-connected to him (the same behaviour as in the Tier-0 SR):
nsxtesg0(tier1_sr)> ping 10.10.103.1
PING 10.10.103.1 (10.10.103.1): 56 data bytes
--- 10.10.103.1 ping statistics ---
5 packets transmitted, 0 packets received, 100.0% packet loss
But I can ping the uplink interface that connects to Tier-0 Gateway:
nsxtesg0(tier1_sr)> ping 100.64.112.1
PING 100.64.112.1 (100.64.112.1): 56 data bytes
64 bytes from 100.64.112.1: icmp_seq=0 ttl=64 time=0.708 ms
64 bytes from 100.64.112.1: icmp_seq=1 ttl=64 time=1.432 ms
64 bytes from 100.64.112.1: icmp_seq=2 ttl=64 time=1.384 ms
From Tier-1 DR:
Lastly, if I move to the Tier-1 DR I get:
nsxtesg0> vrf 8
nsxtesg0(vrf)> get forwarding
Logical Router
UUID VRF LR-ID Name Type
e6eb4986-7e44-4dfa-b1d7-1da0c8ac3d41 8 1025 DR-RosarioT-1-Gateway DISTRIBUTED_ROUTER_TIER1
IPv4 Forwarding Table
IP Prefix Gateway IP Type UUID Gateway MAC
0.0.0.0/0 100.64.112.0 route ad8ae731-1476-4b0f-b6f7-772b7cfdeb23 02:50:56:56:44:52
10.10.102.0/24 route b9ba19cf-af79-4c88-b6a3-8749d767708a
10.10.102.1/32 route ccb76982-83e8-5ef4-8d6f-067542430bab
10.10.103.0/24 route b4d4b235-70f1-4c25-99cf-4c205ecb5044
10.10.103.1/32 route ccb76982-83e8-5ef4-8d6f-067542430bab
100.64.112.0/31 route ad8ae731-1476-4b0f-b6f7-772b7cfdeb23
100.64.112.1/32 route 15f2f498-b4ad-59de-a31c-1d165cd3ecff
127.0.0.1/32 route 2e52e6b9-d924-45fb-aa80-7568525e3630
169.254.0.0/28 route fe859aa2-5095-42d7-973c-2b8dfccad54c
169.254.0.1/32 route ccb76982-83e8-5ef4-8d6f-067542430bab
169.254.0.2/32 route 15f2f498-b4ad-59de-a31c-1d165cd3ecff
IPv6 Forwarding Table
IP Prefix Gateway IP Type UUID Gateway MAC
::/0 fcef:2313:2800:800::1 route ad8ae731-1476-4b0f-b6f7-772b7cfdeb23
::1/128 route 2e52e6b9-d924-45fb-aa80-7568525e3630
fcef:2313:2800:800::/64 route ad8ae731-1476-4b0f-b6f7-772b7cfdeb23
fcef:2313:2800:800::2/128 route 15f2f498-b4ad-59de-a31c-1d165cd3ecff
As you can see, the same forwarding table as in the Tier-1 SR (it is logical):
But from here I can ping its LIFs:
nsxtesg0(vrf)> ping 10.10.103.1
PING 10.10.103.1 (10.10.103.1): 56 data bytes
64 bytes from 10.10.103.1: icmp_seq=0 ttl=64 time=0.742 ms
64 bytes from 10.10.103.1: icmp_seq=1 ttl=64 time=0.885 ms
64 bytes from 10.10.103.1: icmp_seq=2 ttl=64 time=1.233 ms
--- 10.10.103.1 ping statistics ---
4 packets transmitted, 3 packets received, 25.0% packet loss
round-trip min/avg/max/stddev = 0.742/0.953/1.233/0.206 ms
But I cannot ping its uplink interface that connect to Tier-0 Gateway:
nsxtesg0(vrf)> ping 100.64.112.1
PING 100.64.112.1 (100.64.112.1): 56 data bytes
--- 100.64.112.1 ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss
So, as I said in my first post, It seems that there is something broken in the connection between the gateways' DR and SR components.
I think the MTUs are OK, everything in 1600 except the VLAN segments. I think this wouldn't generate issues now cause I'm testing with standard pings that have 56 bytes of payload.
Some tunnels are down but I think they are in that state because I don't have any traffic on them, cause when I create new VMs o generate traffic they come up.
One question Chris, cause you say that you don't observe this behaviour in your infrastructure. Are you using BGP or just static routes? Because at the moment, I haven't activated any routing protocol in my infrastructure. Perhaps, activating BGP and redistributing routes in both, Tier-0 and Tier-1 Gateways I would solve the problem of interconnection that I'm having between gateways' SR and DR components. I don't know, just an idea.
Thank you.
Guido.
HI,
I indeed have a tier-0 with bgp running.
But for testing purposes i created a second tier0 without any bgp enabled.
Connected a segment to that tier-0 and connect the T1 with segment also to that tier0
I am using physical edge nodes. But that should not make any difference.
I am able to ping without any problems.
Is it maybe a firewall rule ? on your vm ? esp. windows blocks stuff by default
Double check your mtu settings. Also the setting on the vswitch (when using virtual-edge-nodes)
I'll check the uplink profiles of the edge. In my case my edge node is virtual. I'm using two N-VDS in the edge node, one for the overlay traffic and one for the VLAN backed segment. Perhaps there's something wrong there.
It's a straight forward configuration what I'm testing, I shouldn't be having this problems. And the edge node configuration might have something to do with this, cause my problems are with the connection between the Gateway's DR and SR components, and that connections are taking place inside the edge node.
All the VMs I'm testing are lubuntu, light version of Ubuntu. They have no firewall activated. On the other hand, I'm also having problems pinging the routers interfaces from inside the gateway itself!!!
I'll tell you if I find the problems.
Thanks for everything Chris.
Guido.
Check the mtu on the vswitch the virtual-edge is connected to. (also on you physical switch)
Allot of issues appear when mtu settings are wrong.
I think I found the problem. There are some Geneve tunnels between the transport nodes and edge nodes that are not coming up.
You asked me to check the tunnels status and I saw that some of them were down but I had read somewhere those tunnels were coming up and down depending whether on the transport nodes were VMs connected to logical switches or not. It is to say, if there are no VMs connected to a logical switch (segment) in a transport node, that transport node doesn't activate the Geneve tunnel against other transport node cause it has no sense to maintain an enabled tunnel that won't have any traffic.
What I didn't notice was that I had two transport nodes that had never activated their tunnels against the edge nodes, no to other transport nodes but the edge ones. When I used the trace tool that comes with NSX GUI I realized the traffic were stuck in the tunnel that didn't exist!!!
My infrastructure for now is a 3-fully-collapsed-node, where I used them as transport, management and edge nodes. What I don't understand is why just one of them has the Geneve tunnels up against the edge transport node and the other two don't. All I read about sharing a host for edge and transport node is that you cannot share the same VLAN ID for overlay traffic when you use the same virtual switch for the host transport node VTEPs and for Edge Transport Node ones. I'm using the same VLAN ID and layer-3 segment for all the VTEPs, but I DO NOT use the same virtual switch for them inside the host. For the Host Transport Node I use a N-VDS that uses two physical vmnics (vmnic 2 and 3), and the Edge Transport Node VM is connected to another vSphere VDS that uses another two physical vmnics (0 and 1). I'll open another thread in order to see if someone knows what could be happening.
Thanks you very much.
Guido.
I finally solved the problem. I had to open a support ticket with VMware.
The problem was that tunnels between the Host Transport Nodes and Edge Nodes were not up and running. However, in the NSX Manager GUI they appeared as up and in green color!!! There is a bug there that the GUI shows you tunnels UP and running where they are actually down and without connection between the VTEPs.
Those tunnels were down because there was no layer-3 connection between the Host Transport Node VTEPs and the Edge Node VTEP. This infrastructure were running on a HP chassis C7000 with two Virtual Connect switches. There was a problem in the configuration of the core switch LAG ports. In the core, the ports that were connected to the Virtual Connect HP switches shouldn't have been configured as a LAG (port-channel) but as single ports.
We unconfigured the LAG and everything began to work.
Thanks for your help.
Guido.
Where are you getting the ICONs/Stencils for the drawings?
Hi,
It is in the nsx-t gui itself.
Yeah I know, was hoping there was external icons for concept diagrams.