Skip navigation

Dear readers

Welcome to this new blog post talking about static routing with the NSX-T Tier-0 Gateway. The majority of our customers are using BGP for the Tier-0 Gateway to Top of Rack (ToR) switches connectivity to exchange IP prefixes. For those customers who prefer static routing, this blog post talks about the two design options.

  • Design Option 1: Static Routing using SVI as Next Hop with NSX-T Edge Node in Active/Active Mode to support ECMP for North/South
  • Design Option 2: Static Routing using SVI as Next Hop with NSX-T Edge Node in Active/Standby Mode using HA VIP

I have the impression that the second design option with a Tier-0 Gateway with two NSX-T Edge Node in Active/Standby mode using HA VIP is widely known, but the first option with NSX-T Edge Node in Active/Active mode leveraging ECMP with static routing is pretty unknown. This first option is for example also a valid Enterprise PKS (new name is Tanzu Kubernetes Grid Integration - TKGI) design option (with shared Tier-1 Gateway) or can be used with vSphere 7 with Kubernetes (Project Pacific) as well where BGP is not allowed nor preferred. I am sure the reader is aware, that Tier-0 Gateway in Active/Active mode cannot be enabled for stateful services (e.g. Edge firewall).

 

Before we start to configure these two different design options, we need to describe the overall lab topology, the physical and logical setup along with the NSX-T Edge Node setup including the NSX-T Edge Node main installation steps. For both options we will configure only a single N-VDS on the NSX-T Edge Node. This is not a requirement, but it is considered a pretty simple design option. The other popular design options consist of typically three embedded N-VDS on the NSX-T Edge Node for design option 1 and two embedded N-VDS on the NSX-T Edge Node for design option 2.

 

Logical Lab Topology

The lab setup is pretty simple. For an easy comparison between those two options, I have configured both design options in parallel. The most relevant part for this blog post is between the two Tier-0 Gateways and the two ToR switches acting as Layer 3 Leaf switches. The configuration and design for the Tier-1 Gateway and the compute vSphere cluster hosting the eight workload Ubuntu VMs is identially for both design options. There is only a single Tier-1 Gateway per Tier-0 Gateway configured, each with two overlay segments. The eight workload Ubuntu VMs are installed on different Compute vSphere cluster called NY-CLUSTER-COMPUTE1 with only two ESXi hosts and are evenly distributed on the two ESXi hosts. Those two compute ESXi hosts are prepared with NSX-T and have only a single overlay Transport Zone configured. The four NSX-T Edge Node VMs are running on another vSphere cluster, called NY-CLUSTER-EDGE1. This vSphere cluster has again only two ESXi hosts. A third vSphere cluster called NY-CLUSTER-MGMT is used for the management component, like vCenter and the NSX-T managers. Details about the compute and management vSphere clusters are not relevant for this blog post and hence are deliberately omitted.

The diagram below shows the NSX-T logical topology, the most relevant vSphere objects and underneath the NSX-T overlay and VLAN segments (for the NSX-T Edge Node North/South connectivity.

Overall Lab Topology Combined.png

 

Physical Setup

Lets have first a look at physical setup used for our four NSX-T VM-based Edge Nodes. Understanding the physical is no less important than the logical setup. Two Nexus 3048 ToR switches configured as Layer 3 Leaf switches are used. They have a Layer 3 connection towards a single spine (not shown) and two Layer 2 trunks combined into a single portchannel with LACP between the two ToR switches. Two ESXi hosts (ny-esx50a and ny-esx51a) with 4 pNICs in total assigned to two different virtual Distributed Switches (vDS). Please note, the Nexus 3048 switches are not configured with Cisco vPC, even this would also be a valid option.

Networking – Physical Diagram.png

The relevant physical links for the NSX-T Edge Nodes connectivity are the four green links only connected to vDS2.

 

Those two ESXi hosts (ny-esx50a and ny-esx51a) are NOT prepared. The two ESXi hosts belong to a single vSphere Cluster exclusively used for NSX-T Edge Node VMs. There are a few good reasons NOT to prepare those ESXi hosts with NSX-T where you host only NSX-T Edge Node VMs:

  • It is not required
  • Better NSX-T upgrade-ability (you don't need to evacuate the NSX-T VM-based Edge Nodes during host NSX-T software upgrade with vMotion to enter maintenance mode; every vMotion of the NSX-T VM-based Edge Node will cause a short unnecessary data plane glitch)
  • Shorter NSX-T upgrade cycles (for every NSX-T upgrade you only need to upgrade the ESXi hosts which are used for the payload VMs and only the NSX-T VM-based Edge Nodes, but not the ESXi hosts where you have your Edge Nodes deployed
  • vSphere HA can be turned off (do we want to move a highly loaded packet forwarding node like an NSX-T Edge Node with vMotion in a host vSphere HA event? No I don't think so - as the routing HA model react in a failure event faster)
  • Simplified DRS settings (do we want to move an NSX-T VM-based Edge Node with vMotion to balance the resources?)
  • Typically a resource pool is not required

We should never underestimate how important smooth upgrade cycles are. Upgrade cycles are time consuming events and are typically required multiple times per year.

To have the ESXi host NOT prepared for NSX-T is considered best practice and should always be deployed in any NSX-T deployments which can afford a dedicated vSphere Cluster only for NSX-T VM-based Edge Nodes. Install NSX-T on ESXi hosts where you have deployed your NSX-T VM-based Edge Nodes (called collapsed design) is valid too and appropriate for customers who have a low number of ESXi hosts to keep the CAPEX costs low.

 

ESXi Host vSphere Networking

The first virtual Distributed Switch (vDS1) is used for the host vmkernel networking only. The typical vmkernel interfaces are attached to three different port groups. The second virtual Distributed Switch (vDS2) is used for the NSX-T VM-based Edge Node networking only. All virtual Distributed Switches port groups are tagged with the appropriate VLAN id, with the exception of the three uplink trunk port groups (more details later). Both virtual Distributed Switches are configured for MTU 9000 bytes and I am using a different Geneve Tunnel End Point (TEP) VLAN for the Compute ESXi hosts (VLAN 150 for ny-esx70a and ny-esx71a) and for the two NSX-T VM-based Edge Node (VLAN 151) running on the ESXi hosts (ny-esx50a and ny-esx51a). In such a setup this is not a requirement, but helps to distribute the BUM traffic replication effort leveraging the hierarchical 2-Tier replication mode. The "dummy" port group is used to connect the unused NSX-T Edge Node fast path interface (fp-ethX); the attachment to a dummy port group is done to avoid that NSX-T reports it as interface admin status down.

 

Table 1 - vDS Setup Overview

Name Diagram
vDS NamePhysical Interfaces
Port Groups
vDS1NY-vDS-ESX5x-EDGE1vmnic0 and vmnic1

NY-vDS-PG-ESX5x-EDGE1-VMK0-Mgmt50

NY-vDS-PG-ESX5x-EDGE1-VMK1-vMotion51

NY-vDS-PG-ESX5x-EDGE1-VMK2-ipStorage52

vDS2NY-vDS-ESX5x-EDGE2vmnic2 and vmnic3

NY-vDS-PG-ESX5x-EDGE2-EDGE-Mgmt60 (Uplink 1 active, Uplink 2 standby)

NY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkA (Uplink 1 active, Uplink 2 unused)

NY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkB (Uplink 1 unused, Uplink 2 active)

Ny-vDS-PG-ESX5x-EDGE2-EDGE-TrunkC (Uplink 1 active, Uplink 2 active)

NY-vDS-PG-ESX5x-EDGE2-Dummy999 (Uplink 1 and Uplink 2 are unused)

 

The combined diagram below shows the most relevant NY-vDS-ESX5x-EDGE2 port group settings regarding VLAN trunking and Teaming and Failover.

vDS2 trunk port groups A and B and C.png

 

Logical VLAN Setup

The ToR switches are configured with those relevant four VLANs (60, 151,160 and 161) for the NSX-T Edge Nodes and the associated Switched Virtual Interfaces (SVI). The VLANs 151, 160 and 161 (VLAN 161 is not used in design option 2) are carried over the three vDS trunk port groups (NY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkA, NY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkB and NY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkC). The SVI on the Nexus 3048 for Edge Management (VLAN 60) and for the Edge Node TEP (VLAN 151) are configured with HSRPv2 with a VIP of .254. The two SVIs on the Nexus 3048 for the Uplink VLAN (160 and 161) are configured without HSRP. VLAN999 as the dummy VLAN does not exists on the ToR switches. The Tier-1 Gateway is not shown in the diagrams below.

 

Please note the dotted line to SVI161 respective SVI160 indicates that the VLAN/SVI configuration on the ToR switch exists, but is not used for the static routing when using Active/Active ECMP with static routing (design option 1).

And the dotted line to SVI161 in design option 2 indicates that the VLAN/SVI configuration on the ToR switches exists, but is not used for the static routing when using Active/Standby with HA VIP with static routing. More details about the static routing is shown in a later step.

Networking – Logical VLAN Diagram Option 1&2.png

 

 

NSX-T Edge Node Deployment

The NSX-T Edge Node deployment option with the single Edge Node N-VDS is simple and has been discussed in one of my other blog posts. In this lab exercise I have done an NSX-T Edge Node ova installation, followed by the "join" command followed by the final step of the NSX-T Edge Transport Node configuration. The NSX-T UI installation option is valid as well, but my personal preference is the ova deployment option. The most relevant step for such a NSX-T Edge Node setup is the correct place of the dot1q tagging and the correct mapping of the NSX-T Edge Node interfaces to the virtual Distributed Switches (vDS2) trunk port groups (A & B for option 1 and C for option 2) as shown in the diagrams below.

 

The diagram below shows the NSX-T Edge Node overall setup and the network selection for the NSX-T Edge Node 20 & 21 during the ova deployment for the design option 1:

Networking – NSX-T Edge Combined Design 1.png

 

The diagram below shows the NSX-T Edge Node overall setup and the network selection for the NSX-T Edge Node 22 & 23 during the ova deployment for the design option 2:

Networking – NSX-T Edge Combined Design 2.png

After the successful ova deployment the "join" command must be used to connect the management plane of the NSX-T Edge Nodes to the NSX-T managers. The "join" command requires the NSX-T manager thumbprint. Jump with SSH to the first NSX-T manager and read the API thumbprint. Jump via SSH to every ova deployed NSX-T Edge Node and execute the "join" command. The two steps are shown the in the table below:

 

Table 2 - NSX-T Edge Node "join" to the NSX-T Managers

Step
Command Example
Device
Comments
Read API Thumbprint

ny-nsxt-manager-21> get certificate api thumbprint

ea90e8cc7adb6d66994a9ecc0a930ad4bfd1d09f668a3857e252ee8f74ba1eb4

first NSX-T managerN/A
Join the NSX-T Manager for each NSX-T Edge Node

ny-nsxt-edge-node-20> join management-plane ny-nsxt-manager-21.corp.local thumbprint ea90e8cc7adb6d66994a9ecc0a930ad4bfd1d09f668a3857e252ee8f74ba1eb4 username admin

Password for API user:

Node successfully registered as Fabric Node: 437e2972-bc40-11ea-b89c-005056970bf2

 

ny-nsxt-edge-node-20>

 

--- do the same for all other NSX-T Edge Nodes ---

on all previous deployed NSX-T Edge Node through ova

NSX-T will sync the configuration with the two other NSX-T managers

Do not join using the NSX-T manager VIP FQDN/IP

 

The resulting UI after the "join" command is shown below. The configuration state must be "Configure NSX".

NSX-T View after Edge Join.png

 

NSX-T Edge Transport Node Configuration

Before we can start with the NSX-T Edge Transport Node configuration, we need to be sure, that the Uplink Profiles are ready. The two design options require two different Uplink Profiles. The two diagrams below shows the two different Uplink Profiles for the NSX-T Edge Transport Nodes:

NY-EDGE-UPLINK-PROFILE-COMBINED.png

The Uplink Profile "NY-EDGE-UPLINK-PROFILE-SRC-ID-TEP-VLAN151" is used for design option 1 and is required for Multi-TEP with the teaming policy "LOADBALANCE_SRCID" with two Active Uplinks (EDGE-UPLINK01 and EDGE-UPLINK02). Two additional named teaming policies are configured for a proper ECMP dataplane forwarding; please see blog post "Single NSX-T Edge Node N-VDS with correct VLAN pinning" for more details. I am using the same named teaming configuration for design option 1 as in the other blog post where I have used BGP instead of static routing. As mentioned already, the dot1q tagging (Transport VLAN = 151) for the two TEP interfaces is required as part of this Uplink Profile configuration.

 

The Uplink Profile "NY-EDGE-UPLINK-PROFILE-FAILOVER-TEP-VLAN151" is used for design option 2 and requires the teaming policy "FAILOVER_ORDER" with only a single Active Uplink (EDGE-UPLINK01). Named teaming policies are not required. Again the dot1q tagging for the single TEP interface (Transport VLAN = 151) is required as part of this Uplink Profile configuration.

 

The NSX-T Edge Transport Node configuration itself is straightforward and is shown in the two diagrams below for a single NSX-T Edge Transport Node per design option.

Edge Transport Node Combined.png

NSX-T Edge Transport Node 20 & 21 (design option 1) are using the previous configured Uplink Profile "NY-EDGE-UPLINK-PROFILE-SRC-ID-TEP-VLAN151". Two static TEP IP addresses are configured and the two Uplinks (EDGE-UPLINK01 & EDGE-UPLINK02) are mapped to the fast path interfaces (fp-eth0 & fp-eth1).

 

NSX-T Edge Transport Node 22 & 23 (design option 2) are using the previous configured Uplink Profile "NY-EDGE-UPLINK-PROFILE-FAILOVER-TEP-VLAN151". A single static TEP IP address is configured and the single Uplink (EDGE-UPLINK01) is mapped to the fast path interface (fp-eth0).

 

Please note, the required configuration of the two NSX-T Transport Zones and the single N-VDS switch is not shown.

 

The NSX-T Edge Transport Node ny-nsxt-edge-node-20 and ny-nsxt-edge-node-21 are assigned to the NSX-T Edge cluster NY-NSXT-EDGE-CLUSTER01 and the NSX-T Edge Transport Node ny-nsxt-edge-node-22 and ny-nsxt-edge-node-22 are assigned to the NSX-T Edge cluster NY-NSXT-EDGE-CLUSTER02. This NSX-T Edge cluster configuration is also not shown.

 

NSX-T Tier-0 Gateway Configuration

The base NSX-T Tier-0 Gateway configuration is straightforward and is shown in the two diagrams below.

The Tier-0 Gateway NY-T0-GATEWAY-01 (design option 1) is configured in Active/Active mode along with the association with the NSX-T Edge Cluster NY-NSXT-EDGE-CLUSTER01.

The Tier-0 Gateway NY-T0-GATEWAY-02 (design option 2) is configured in Active/Standby mode along with the association with the NSX-T Edge Cluster NY-NSXT-EDGE-CLUSTER02. In this example preemptive is selected and the first NSX-T Edge Transport Node (ny-nsxt-edge-node-22) is the preferred Edge Transport Node (the active node when both nodes are up and running).

NY-T0-Gateway Combined Design 1&2.png

The next step of Tier-0 Gateway configuration is about the Layer 3 interfaces (LIF) for the northbound connectivity towards the ToR switches.

The next two diagrams show the IP topologies including the ToR switches IP configuration along the resulting NSX-T Tier-0 Gateway Layer 3 interface configuration for the design option 1 (A/A ECMP).

Networking – IP Diagram Combined Option 1.png

The next diagrams show the IP topology including the ToR switches IP configuration along the resulting NSX-T Tier-0 Gateway interface configuration for the design option 2 (A/S HA VIP).

Networking – IP Diagram Combined Option 2.png

The HA VIP configuration requires that both NSX-T Edge Transport Node interfaces belong to the same Layer 2 segment. Here I am using the previous configured Layer 3 interfaces (LIF); both belong to the same VLAN segment 160 (NY-T0-VLAN-SEGMENT-160).

NY-T0-Gateway-02-HA VIP Design 2.png

 

All the previous steps are probably known by the majority of the readers. However, the next step is about the static routing configuration; these steps highlights the relevant configurations to archive ECMP with two NSX-T Edge Transport Node in Active/Active mode.

 

Design Option 1 Static Routing (A/A ECMP)

The first step in design option 1 is the Tier-0 static route configuration for northbound traffic. The most common way is to configure default routes northbound.

Two default routes each with a different Next Hop (172.16.160.254 and 172.16.161.254) are configured on the NY-T0-GATEWAY-01. This is the first step to achieve ECMP for northbound traffic towards the ToR switches. The diagram below shows the corresponding NSX-T Tier-0 Gateway static routing configuration. Please keep in mind, that at the NSX-T Edge Transport Node level, each Edge Transport Node will have two default route entries. This is shown in the table below.

The difference between the logical construct configuration (Tier-0 Gateway) and the "physical" construct configuration (the Edge Transport Nodes) might already be known, as we have the same behavior with BGP. This approach limits configuration errors. With BGP we typically configure only two BGP peers towards the two ToR switches, but each NSX-T Edge Transport Nodes gets two BGP session realized.

 

The diagram below shows the setup with the two default routes (in black) northbound.

Networking – IP StaticRouting North Diagram Combined Option 1.png

 

Please note, the configuration steps how to configure the Tier-1 Gateway (NY-T1-GATEWAY-GREEN) and how to connect it to the Tier-0 Gateway is not shown.

 

Table 3 - NSX-T Edge Transport Node Routing Table for Design Option 1 (A/A ECMP)

ny-nsxt-edge-node-20 (Service Router)
ny-nsxt-edge-node-21 (Service Router)

ny-nsxt-edge-node-20(tier0_sr)> get route 0.0.0.0/0

 

Flags: t0c - Tier0-Connected, t0s - Tier0-Static, b - BGP,

t0n - Tier0-NAT, t1s - Tier1-Static, t1c - Tier1-Connected,

t1n: Tier1-NAT, t1l: Tier1-LB VIP, t1ls: Tier1-LB SNAT,

t1d: Tier1-DNS FORWARDER, t1ipsec: Tier1-IPSec, isr: Inter-SR,

> - selected route, * - FIB route

 

Total number of routes: 1

 

t0s> * 0.0.0.0/0 [1/0] via 172.16.160.254, uplink-307, 03:29:43

t0s> * 0.0.0.0/0 [1/0] via 172.16.161.254, uplink-309, 03:29:43

ny-nsxt-edge-node-20(tier0_sr)>

ny-nsxt-edge-node-21(tier0_sr)> get route 0.0.0.0/0

 

Flags: t0c - Tier0-Connected, t0s - Tier0-Static, b - BGP,

t0n - Tier0-NAT, t1s - Tier1-Static, t1c - Tier1-Connected,

t1n: Tier1-NAT, t1l: Tier1-LB VIP, t1ls: Tier1-LB SNAT,

t1d: Tier1-DNS FORWARDER, t1ipsec: Tier1-IPSec, isr: Inter-SR,

> - selected route, * - FIB route

 

Total number of routes: 1

 

t0s> * 0.0.0.0/0 [1/0] via 172.16.160.254, uplink-292, 03:30:42

t0s> * 0.0.0.0/0 [1/0] via 172.16.161.254, uplink-306, 03:30:42

ny-nsxt-edge-node-21(tier0_sr)>

 

The second step is to configure static routing southbound from the ToR switches towards NSX-T Edge Transport Node. This step is required to achieve ECMP for southbound traffic. Each ToR switch is configured with four static routes in total to forward traffic to the destination overlay networks within NSX-T. We could easily see that each NSX-T Edge Transport Node is used twice as Next Hop for the static route entries.

Networking – IP StaticRouting South Diagram Option 1.png

Table 4 - Nexus ToR Switches Static Routing Configuration and Resulting Routing Table for Design Option 1 (A/A ECMP)

NY-N3K-LEAF-10
NY-N3K-LEAF-11

ip route 172.16.240.0/24 Vlan160 172.16.160.20

ip route 172.16.240.0/24 Vlan160 172.16.160.21

 

ip route 172.16.241.0/24 Vlan160 172.16.160.20

ip route 172.16.241.0/24 Vlan160 172.16.160.21

ip route 172.16.240.0/24 Vlan161 172.16.161.20

ip route 172.16.240.0/24 Vlan161 172.16.161.21

 

ip route 172.16.241.0/24 Vlan161 172.16.161.20

ip route 172.16.241.0/24 Vlan161 172.16.161.21

NY-N3K-LEAF-10# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 2/0

    *via 172.16.160.20, Vlan160, [1/0], 03:26:44, static

    *via 172.16.160.21, Vlan160, [1/0], 03:26:58, static

172.16.241.0/24, ubest/mbest: 2/0

    *via 172.16.160.20, Vlan160, [1/0], 03:26:44, static

    *via 172.16.160.21, Vlan160, [1/0], 03:26:58, static

---snip---

 

NY-N3K-LEAF-10#

NY-N3K-LEAF-11# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 2/0

    *via 172.16.161.20, Vlan161, [1/0], 03:27:39, static

    *via 172.16.161.21, Vlan161, [1/0], 03:27:51, static

172.16.241.0/24, ubest/mbest: 2/0

    *via 172.16.161.20, Vlan161, [1/0], 03:27:39, static

    *via 172.16.161.21, Vlan161, [1/0], 03:27:51, static

---snip---

 

NY-N3K-LEAF-11#


Again, these steps are straightforward and it shows how we can archive ECMP with static routing for North/South traffic. But what will happen, if for as example one of the two NSX-T Edge Transport Node is down? Lets assume, ny-nsxt-edge-node-20 is down. Traffic from the Spine switches will be forwarded still to both ToR switches and once the ECMP hash is calculated, the traffic is forwarded to one of the four Next Hops (the four Edge Transport Node Layer 3 interfaces). Based on the hash calculation, it could be Next Hop 172.16.160.20 or 172.16.161.20, both interfaces belong to ny-nsxt-edge-node-20. This traffic will be blackholed and dropped! But why do the ToR switches still announce these overlay networks 172.16.240.0/24 and 172.16.241.0/24 to the Spine switches? The reason is simple, because for both ToR switches the static route entries are still valid, as VLAN160/161 or/and the Next Hop are still UP. So from the ToR switch routing table perspective all is fine. These static route entries will potentially never go down, as the Next Hop IP addresses belong to the VLAN 160 or VLAN 161 and these VLANs are always in the state UP as long a single physical port is UP and part of one of these VLANs (assuming the ToR switch is up and running).  Even when all attached ESXi host are down, the InterSwitch link between the ToR switches is still UP and hence VLAN 160 and VLAN 161 are still UP.  Please keep in mind, with BGP this problem does not exists, as we have BGP keepalives and once the NSX-T Edge Transport Node is down, the ToR switch tears down the BGP session and invalidate the local route entries.

But how could we solve the blackholing issue with static routing? The answer is Bi-Directional Forwarding (BFD) for static routing.

 

What is BFD?

BFD is nothing else then a purpose build keepalive protocol that typically routing protocols including first hop redundancy protocols (e.g. HSRP or VRRP) subscribe to. Various protocols can piggyback a single BFD session. BFD can detect link failures in milliseconds or sub-seconds (NSX-T Bare Metal Edge Nodes with 3 x 50ms) or near sub-seconds (NSX-T VM-based Edge Nodes 3 x 500ms) in the context of NSX-T. All protocols have some way of detecting failure, usually timer-related. Tuning these timers can theoretically get you sub-second failure detection too, but this produces unnecessary high overhead as theses protocols weren't designed with that in mind. BFD was specifically built for fast failure detection and maintain low CPU load. Please keep in mind, if you have as an example BGP running between two physical routers, there's no need to have BFD sessions for link failure detection, as the routing protocol will detect the link-down event instantly. But for two routers (e.g. Tier-0 Gateways) connected through intermediate Layer 2/3 nodes (physical infra, vDS, etc.) where the routing protocol cannot detect a link-down event, the failure event must be detected through a dead timer. Welcome to the virtual world!! BFD was enhanced with the capability to support static routing too, even the driver using BFD for static routing was not the benefit to keep the CPU low and have fast failure detection, it was about extension of the functionality of static routes with keepalives with BFD.

 

So how can we apply BFD for static routing in our lab? There are multiple configuration steps required.

Before we can associate BFD with the static routes on the NSX-T Tier-0 Gateway NY-T0-GATEWAY-01, the creation of a BFD profile for static routes is required. This is shown in the diagram below. I am using the same BFD parameter (Interval=500ms and Declare Dead Multiple=3) as NSX-T 3.0 has defined a default for BFD registered for BGP.

NY-T0-Gateway-01-BFD-Profile Design 1.png

The next step is the configuration of BFD peers for static routing at Tier-0 Gateway level. I am using the same Next Hop IP addresses (172.16.160.254 and 172.16.161.254) for the BFD peers as I have used for the static routes northbound towards the ToR switches. Again, this BFD peer configuration is configured at Tier-0 Gateway level, but the realization of the BFD peers happens at Edge Transport Node level. On each of the two NSX-T Edge Transport Nodes (Service Router) two BGP sessions are realized. The appropriate BFD peer source interface on the Tier-0 Gateway is automatically selected (the Layer 3 LIF) by NSX-T, but as you see, NSX-T allows you to specify the BFD source interface too.

NY-T0-Gateway-01-BFD for staticRouting with Design 1.png

The table below shows the global BFD timer configuration and the BFD peers with source and peer (destination) IP.

Table 5 - NSX-T Edge Transport Node BFD Configuration

ny-nsxt-edge-node-20 (Service Router)ny-nsxt-edge-node-21 (Service Router)

ny-nsxt-edge-node-20(tier0_sr)> get bfd-config

Logical Router

UUID           : 1cfd7da2-f37c-4108-8f19-7725822f0552

vrf            : 2

lr-id          : 8193

name           : SR-NY-T0-GATEWAY-01

type           : PLR-SR

 

Global BFD configuration

    Enabled        : True

    Min RX Interval: 500

    Min TX Interval: 500

    Min RX TTL     : 255

    Multiplier     : 3

 

 

Port               : 64a2e029-ad69-4ce1-a40e-def0956a9d2d

 

Session BFD configuration

 

   Source         : 172.16.160.20

    Peer           : 172.16.160.254

    Enabled        : True

    Min RX Interval: 500

    Min TX Interval: 500

    Min RX TTL     : 255

    Multiplier     : 3

 

 

Port               : 371a9b3f-d669-493a-a46b-161d3536b261

 

Session BFD configuration

 

    Source         : 172.16.161.20

    Peer           : 172.16.161.254

    Enabled        : True

    Min RX Interval: 500

    Min TX Interval: 500

    Min RX TTL     : 255

    Multiplier     : 3

 

ny-nsxt-edge-node-20(tier0_sr)>

ny-nsxt-edge-node-21(tier0_sr)> get bfd-config

Logical Router

UUID           : a2ea4cbc-c486-46a1-a663-c9c5815253af

vrf            : 1

lr-id          : 8194

name           : SR-NY-T0-GATEWAY-01

type           : PLR-SR

 

Global BFD configuration

    Enabled        : True

    Min RX Interval: 500

    Min TX Interval: 500

    Min RX TTL     : 255

    Multiplier     : 3

 

 

Port               : a5454564-ef1c-4e30-922f-9876b9df38df

 

Session BFD configuration

 

   Source         : 172.16.160.21

    Peer           : 172.16.160.254

    Enabled        : True

    Min RX Interval: 500

    Min TX Interval: 500

    Min RX TTL     : 255

    Multiplier     : 3

 

 

Port               : 8423e83b-0a69-44f4-90d1-07d8ece4f55e

 

Session BFD configuration

 

   Source         : 172.16.161.21

    Peer           : 172.16.161.254

    Enabled        : True

    Min RX Interval: 500

    Min TX Interval: 500

    Min RX TTL     : 255

    Multiplier     : 3

 

ny-nsxt-edge-node-21(tier0_sr)>

 

BFD in general and for static routing as wll requires that the peering site is configured with BFD too to ensure BFD keepalives are send out replied respectively. Once BFD peers are configured on the Tier-0 Gateway, the ToR switches require the appropriate BFD peer configuration too. This is shown in the table below. Each ToR switch gets two BFD peer configurations, one for each of the NSX-T Edge Transport Node.

Table 6 - Nexus ToR Switches BFD for Static Routing Configuration

NY-N3K-LEAF-10
NY-N3K-LEAF-11

feature bfd

!

ip route static bfd Vlan160 172.16.160.20

ip route static bfd Vlan160 172.16.160.21

feature bfd

!

ip route static bfd Vlan161 172.16.161.20

ip route static bfd Vlan161 172.16.161.21

 

Once both ends of the BFD peers are configured correctly, the BFD sessions should come up and the static route should be installed into the routing table.

The table below shows the two BFD neighbors for the static routing (interface VLAN160 respective VLAN161). The BFD neighbor for interface Eth1/49 is used for the BFD peer towards the Spine switch and is registered for OSPF.  The NX-OS operating system does not mention "static routing" for the registered protocol, it shows "netstack" - reason unknown.

Table 7 - Nexus ToR Switches BFD for Static Routing Configuration and Verification

NY-N3K-LEAF-10/11

NY-N3K-LEAF-10# show bfd neighbors

 

OurAddr         NeighAddr       LD/RD                 RH/RS           Holdown(mult)     State       Int                   Vrf                 

172.16.160.254  172.16.160.20   1090519041/2635291218 Up              1099(3)           Up          Vlan160               default                      

172.16.160.254  172.16.160.21   1090519042/3842218904 Up              1413(3)           Up          Vlan160               default               

172.16.3.18     172.16.3.17     1090519043/1090519041 Up              5629(3)           Up          Eth1/49               default             

NY-N3K-LEAF-10#

NY-N3K-LEAF-11# show bfd neighbors

 

OurAddr         NeighAddr       LD/RD                 RH/RS           Holdown(mult)     State       Int                   Vrf                 

172.16.161.254  172.16.161.20   1090519041/591227029  Up              1384(3)           Up          Vlan161               default                      

172.16.161.254  172.16.161.21   1090519042/2646176019 Up              1385(3)           Up          Vlan161               default              

172.16.3.22     172.16.3.21     1090519043/1090519042 Up              4696(3)           Up          Eth1/49               default             

NY-N3K-LEAF-11#

NY-N3K-LEAF-10# show bfd neighbors details

 

OurAddr         NeighAddr       LD/RD                 RH/RS           Holdown(mult)     State       Int                   Vrf                   

172.16.160.254  172.16.160.20   1090519041/2635291218 Up              1151(3)           Up          Vlan160               default                        

 

Session state is Up and not using echo function

Local Diag: 0, Demand mode: 0, Poll bit: 0, Authentication: None

MinTxInt: 500000 us, MinRxInt: 500000 us, Multiplier: 3

Received MinRxInt: 500000 us, Received Multiplier: 3

Holdown (hits): 1500 ms (0), Hello (hits): 500 ms (22759)

Rx Count: 20115, Rx Interval (ms) min/max/avg: 83/1921/437 last: 348 ms ago

Tx Count: 22759, Tx Interval (ms) min/max/avg: 386/386/386 last: 24 ms ago

Registered protocols:  netstack

Uptime: 0 days 2 hrs 26 mins 39 secs, Upcount: 1

Last packet: Version: 1                - Diagnostic: 0

             State bit: Up             - Demand bit: 0

             Poll bit: 0               - Final bit: 0

             Multiplier: 3             - Length: 24

             My Discr.: -1659676078    - Your Discr.: 1090519041

             Min tx interval: 500000   - Min rx interval: 500000

             Min Echo interval: 0      - Authentication bit: 0

Hosting LC: 1, Down reason: None, Reason not-hosted: None

 

 

 

OurAddr         NeighAddr       LD/RD                 RH/RS           Holdown(mult)     State       Int                   Vrf                   

172.16.160.254  172.16.160.21   1090519042/3842218904 Up              1260(3)           Up          Vlan160               default                        

 

Session state is Up and not using echo function

Local Diag: 0, Demand mode: 0, Poll bit: 0, Authentication: None

MinTxInt: 500000 us, MinRxInt: 500000 us, Multiplier: 3

Received MinRxInt: 500000 us, Received Multiplier: 3

Holdown (hits): 1500 ms (0), Hello (hits): 500 ms (22774)

Rx Count: 20105, Rx Interval (ms) min/max/avg: 0/1813/438 last: 239 ms ago

Tx Count: 22774, Tx Interval (ms) min/max/avg: 386/386/386 last: 24 ms ago

Registered protocols:  netstack

Uptime: 0 days 2 hrs 26 mins 46 secs, Upcount: 1

Last packet: Version: 1                - Diagnostic: 0

             State bit: Up             - Demand bit: 0

             Poll bit: 0               - Final bit: 0

             Multiplier: 3             - Length: 24

             My Discr.: -452748392     - Your Discr.: 1090519042

             Min tx interval: 500000   - Min rx interval: 500000

             Min Echo interval: 0      - Authentication bit: 0

Hosting LC: 1, Down reason: None, Reason not-hosted: None

 

 

 

OurAddr         NeighAddr       LD/RD                 RH/RS           Holdown(mult)     State       Int                   Vrf                   

172.16.3.18     172.16.3.17     1090519043/1090519041 Up              5600(3)           Up          Eth1/49               default               

 

Session state is Up and using echo function with 500 ms interval

Local Diag: 0, Demand mode: 0, Poll bit: 0, Authentication: None

MinTxInt: 500000 us, MinRxInt: 2000000 us, Multiplier: 3

Received MinRxInt: 2000000 us, Received Multiplier: 3

Holdown (hits): 6000 ms (0), Hello (hits): 2000 ms (5309)

Rx Count: 5309, Rx Interval (ms) min/max/avg: 7/2101/1690 last: 399 ms ago

Tx Count: 5309, Tx Interval (ms) min/max/avg: 1689/1689/1689 last: 249 ms ago

Registered protocols:  ospf

Uptime: 0 days 2 hrs 29 mins 29 secs, Upcount: 1

Last packet: Version: 1                - Diagnostic: 0

             State bit: Up             - Demand bit: 0

             Poll bit: 0               - Final bit: 0

             Multiplier: 3             - Length: 24

             My Discr.: 1090519041     - Your Discr.: 1090519043

             Min tx interval: 500000   - Min rx interval: 2000000

             Min Echo interval: 500000 - Authentication bit: 0

Hosting LC: 1, Down reason: None, Reason not-hosted: None

 

NY-N3K-LEAF-10#

NY-N3K-LEAF-11# show bfd neighbors details

 

OurAddr         NeighAddr       LD/RD                 RH/RS           Holdown(mult)     State       Int                   Vrf                   

172.16.161.254  172.16.161.20   1090519041/591227029  Up              1235(3)           Up          Vlan161               default                        

 

Session state is Up and not using echo function

Local Diag: 0, Demand mode: 0, Poll bit: 0, Authentication: None

MinTxInt: 500000 us, MinRxInt: 500000 us, Multiplier: 3

Received MinRxInt: 500000 us, Received Multiplier: 3

Holdown (hits): 1500 ms (0), Hello (hits): 500 ms (22634)

Rx Count: 19972, Rx Interval (ms) min/max/avg: 93/1659/438 last: 264 ms ago

Tx Count: 22634, Tx Interval (ms) min/max/avg: 386/386/386 last: 127 ms ago

Registered protocols:  netstack

Uptime: 0 days 2 hrs 25 mins 47 secs, Upcount: 1

Last packet: Version: 1                - Diagnostic: 0

             State bit: Up             - Demand bit: 0

             Poll bit: 0               - Final bit: 0

             Multiplier: 3             - Length: 24

             My Discr.: 591227029      - Your Discr.: 1090519041

             Min tx interval: 500000   - Min rx interval: 500000

             Min Echo interval: 0      - Authentication bit: 0

Hosting LC: 1, Down reason: None, Reason not-hosted: None

 

 

 

OurAddr         NeighAddr       LD/RD                 RH/RS           Holdown(mult)     State       Int                   Vrf                   

172.16.161.254  172.16.161.21   1090519042/2646176019 Up              1162(3)           Up          Vlan161               default                        

 

Session state is Up and not using echo function

Local Diag: 0, Demand mode: 0, Poll bit: 0, Authentication: None

MinTxInt: 500000 us, MinRxInt: 500000 us, Multiplier: 3

Received MinRxInt: 500000 us, Received Multiplier: 3

Holdown (hits): 1500 ms (0), Hello (hits): 500 ms (22652)

Rx Count: 20004, Rx Interval (ms) min/max/avg: 278/1799/438 last: 337 ms ago

Tx Count: 22652, Tx Interval (ms) min/max/avg: 386/386/386 last: 127 ms ago

Registered protocols:  netstack

Uptime: 0 days 2 hrs 25 mins 58 secs, Upcount: 1

Last packet: Version: 1                - Diagnostic: 0

             State bit: Up             - Demand bit: 0

             Poll bit: 0               - Final bit: 0

             Multiplier: 3             - Length: 24

             My Discr.: -1648791277    - Your Discr.: 1090519042

             Min tx interval: 500000   - Min rx interval: 500000

             Min Echo interval: 0      - Authentication bit: 0

Hosting LC: 1, Down reason: None, Reason not-hosted: None

 

 

 

OurAddr         NeighAddr       LD/RD                 RH/RS           Holdown(mult)     State       Int                   Vrf                   

172.16.3.22     172.16.3.21     1090519043/1090519042 Up              4370(3)           Up          Eth1/49               default               

 

Session state is Up and using echo function with 500 ms interval

Local Diag: 0, Demand mode: 0, Poll bit: 0, Authentication: None

MinTxInt: 500000 us, MinRxInt: 2000000 us, Multiplier: 3

Received MinRxInt: 2000000 us, Received Multiplier: 3

Holdown (hits): 6000 ms (0), Hello (hits): 2000 ms (5236)

Rx Count: 5236, Rx Interval (ms) min/max/avg: 553/1698/1690 last: 1629 ms ago

Tx Count: 5236, Tx Interval (ms) min/max/avg: 1689/1689/1689 last: 1020 ms ago

Registered protocols:  ospf

Uptime: 0 days 2 hrs 27 mins 26 secs, Upcount: 1

Last packet: Version: 1                - Diagnostic: 0

             State bit: Up             - Demand bit: 0

             Poll bit: 0               - Final bit: 0

             Multiplier: 3             - Length: 24

             My Discr.: 1090519042     - Your Discr.: 1090519043

             Min tx interval: 500000   - Min rx interval: 2000000

             Min Echo interval: 500000 - Authentication bit: 0

Hosting LC: 1, Down reason: None, Reason not-hosted: None

 

NY-N3K-LEAF-11#

 

The table below shows the BFD session on the Tier-0 Gateway on the Service Router (SR). The CLI shows the BFD peers and source IP addresses along the state. Please note, BFD does not require that both end of the BFD peer are configured with an identically interval and multiplier value, but for troubleshooting reason are identically parameter recommended.

Table 8 - NSX-T Edge Transport Node BFD Verification

ny-nsxt-edge-node-20 (Service Router)ny-nsxt-edge-node-21 (Service Router)

ny-nsxt-edge-node-20(tier0_sr)> get bfd-sessions

BFD Session

Dest_port                     : 3784

Diag                          : No Diagnostic

Encap                         : vlan

Forwarding                    : last true (current true)

Interface                     : 64a2e029-ad69-4ce1-a40e-def0956a9d2d

Keep-down                     : false

Last_cp_diag                  : No Diagnostic

Last_cp_rmt_diag              : No Diagnostic

Last_cp_rmt_state             : up

Last_cp_state                 : up

Last_fwd_state                : UP

Last_local_down_diag          : No Diagnostic

Last_remote_down_diag         : No Diagnostic

Last_up_time                  : 2020-07-07 15:42:23

Local_address                 : 172.16.160.20

Local_discr                   : 2635291218

Min_rx_ttl                    : 255

Multiplier                    : 3

Received_remote_diag          : No Diagnostic

Received_remote_state         : up

Remote_address                : 172.16.160.254

Remote_admin_down             : false

Remote_diag                   : No Diagnostic

Remote_discr                  : 1090519041

Remote_min_rx_interval        : 500

Remote_min_tx_interval        : 500

Remote_multiplier             : 3

Remote_state                  : up

Router                        : 1cfd7da2-f37c-4108-8f19-7725822f0552

Router_down                   : false

Rx_cfg_min                    : 500

Rx_interval                   : 500

Service-link                  : false

Session_type                  : LR_PORT

State                         : up

Tx_cfg_min                    : 500

Tx_interval                   : 500

 

 

BFD Session

Dest_port                     : 3784

Diag                          : No Diagnostic

Encap                         : vlan

Forwarding                    : last true (current true)

Interface                     : 371a9b3f-d669-493a-a46b-161d3536b261

Keep-down                     : false

Last_cp_diag                  : No Diagnostic

Last_cp_rmt_diag              : No Diagnostic

Last_cp_rmt_state             : up

Last_cp_state                 : up

Last_fwd_state                : UP

Last_local_down_diag          : No Diagnostic

Last_remote_down_diag         : No Diagnostic

Last_up_time                  : 2020-07-07 15:42:24

Local_address                 : 172.16.161.20

Local_discr                   : 591227029

Min_rx_ttl                    : 255

Multiplier                    : 3

Received_remote_diag          : No Diagnostic

Received_remote_state         : up

Remote_address                : 172.16.161.254

Remote_admin_down             : false

Remote_diag                   : No Diagnostic

Remote_discr                  : 1090519041

Remote_min_rx_interval        : 500

Remote_min_tx_interval        : 500

Remote_multiplier             : 3

Remote_state                  : up

Router                        : 1cfd7da2-f37c-4108-8f19-7725822f0552

Router_down                   : false

Rx_cfg_min                    : 500

Rx_interval                   : 500

Service-link                  : false

Session_type                  : LR_PORT

State                         : up

Tx_cfg_min                    : 500

Tx_interval                   : 500

 

ny-nsxt-edge-node-20(tier0_sr)>

ny-nsxt-edge-node-21(tier0_sr)> get bfd-sessions

BFD Session

Dest_port                     : 3784

Diag                          : No Diagnostic

Encap                         : vlan

Forwarding                    : last true (current true)

Interface                     : a5454564-ef1c-4e30-922f-9876b9df38df

Keep-down                     : false

Last_cp_diag                  : No Diagnostic

Last_cp_rmt_diag              : No Diagnostic

Last_cp_rmt_state             : up

Last_cp_state                 : up

Last_fwd_state                : UP

Last_local_down_diag          : No Diagnostic

Last_remote_down_diag         : No Diagnostic

Last_up_time                  : 2020-07-07 15:42:15

Local_address                 : 172.16.160.21

Local_discr                   : 3842218904

Min_rx_ttl                    : 255

Multiplier                    : 3

Received_remote_diag          : No Diagnostic

Received_remote_state         : up

Remote_address                : 172.16.160.254

Remote_admin_down             : false

Remote_diag                   : No Diagnostic

Remote_discr                  : 1090519042

Remote_min_rx_interval        : 500

Remote_min_tx_interval        : 500

Remote_multiplier             : 3

Remote_state                  : up

Router                        : a2ea4cbc-c486-46a1-a663-c9c5815253af

Router_down                   : false

Rx_cfg_min                    : 500

Rx_interval                   : 500

Service-link                  : false

Session_type                  : LR_PORT

State                         : up

Tx_cfg_min                    : 500

Tx_interval                   : 500

 

 

BFD Session

Dest_port                     : 3784

Diag                          : No Diagnostic

Encap                         : vlan

Forwarding                    : last true (current true)

Interface                     : 8423e83b-0a69-44f4-90d1-07d8ece4f55e

Keep-down                     : false

Last_cp_diag                  : No Diagnostic

Last_cp_rmt_diag              : No Diagnostic

Last_cp_rmt_state             : up

Last_cp_state                 : up

Last_fwd_state                : UP

Last_local_down_diag          : No Diagnostic

Last_remote_down_diag         : No Diagnostic

Last_up_time                  : 2020-07-07 15:42:15

Local_address                 : 172.16.161.21

Local_discr                   : 2646176019

Min_rx_ttl                    : 255

Multiplier                    : 3

Received_remote_diag          : No Diagnostic

Received_remote_state         : up

Remote_address                : 172.16.161.254

Remote_admin_down             : false

Remote_diag                   : No Diagnostic

Remote_discr                  : 1090519042

Remote_min_rx_interval        : 500

Remote_min_tx_interval        : 500

Remote_multiplier             : 3

Remote_state                  : up

Router                        : a2ea4cbc-c486-46a1-a663-c9c5815253af

Router_down                   : false

Rx_cfg_min                    : 500

Rx_interval                   : 500

Service-link                  : false

Session_type                  : LR_PORT

State                         : up

Tx_cfg_min                    : 500

Tx_interval                   : 500

 

ny-nsxt-edge-node-21(tier0_sr)>

 

I would really like to emphasize, that static routing with NSX-T Edge Transport Node in A/A mode must use BFD to avoid blackholing. In case of BFD for static routing is not supported on the ToR switches, then I highly recommend to use A/S mode with HA VIP instead or switch back to BGP.

 

 

 

Design Option 2 - Static Routing (A/S HA VIP)

The first step in design option 2 is the Tier-0 static route configuration for northbound traffic. The most common way is to configure a default route northbound. The diagram below shows the setup with the two default routes (in black) northbound. As already mentioned, HA VIP requires that both NSX-T Edge Transport Node interfaces belong to the same Layer 2 segment (NY-T0-VLAN-SEGMENT-160). A single default route with with two different Next Hops (172.16.160.254 and 172.16.161.254) is configured on the NY-T0-GATEWAY-02. With this design we could also achieve ECMP for northbound traffic towards the ToR switches. The diagram below shows the corresponding NSX-T Tier-0 Gateway static routing configuration. Please keep in mind again, that at the NSX-T Edge Transport Node level, each Edge Transport Node will have two default route entries even though we have configured only two default routes at Tier-0 Gateway level , not four. This is shown in the table below.

Networking – IP StaticRouting North Diagram Combined Option 2.png

Please note, the configuration steps how to configure the Tier-1 Gateway (NY-T1-GATEWAY-BLUE) and how to connect it to the Tier-0 Gateway is not shown.

 

 

Table 9 - NSX-T Edge Transport Node Routing Table for Design Option 2 (A/S HA VIP)

ny-nsxt-edge-node-22 (Service Router)
ny-nsxt-edge-node-23 (Service Router)

ny-nsxt-edge-node-22(tier0_sr)> get route 0.0.0.0/0

 

Flags: t0c - Tier0-Connected, t0s - Tier0-Static, b - BGP,

t0n - Tier0-NAT, t1s - Tier1-Static, t1c - Tier1-Connected,

t1n: Tier1-NAT, t1l: Tier1-LB VIP, t1ls: Tier1-LB SNAT,

t1d: Tier1-DNS FORWARDER, t1ipsec: Tier1-IPSec, isr: Inter-SR,

> - selected route, * - FIB route

 

Total number of routes: 1

 

t0s> * 0.0.0.0/0 [1/0] via 172.16.160.253, uplink-278, 00:00:27

t0s> * 0.0.0.0/0 [1/0] via 172.16.160.254, uplink-278, 00:00:27

ny-nsxt-edge-node-22(tier0_sr)>

ny-nsxt-edge-node-23(tier0_sr)> get route 0.0.0.0/0

 

Flags: t0c - Tier0-Connected, t0s - Tier0-Static, b - BGP,

t0n - Tier0-NAT, t1s - Tier1-Static, t1c - Tier1-Connected,

t1n: Tier1-NAT, t1l: Tier1-LB VIP, t1ls: Tier1-LB SNAT,

t1d: Tier1-DNS FORWARDER, t1ipsec: Tier1-IPSec, isr: Inter-SR,

> - selected route, * - FIB route

 

Total number of routes: 1

 

t0s> * 0.0.0.0/0 [1/0] via 172.16.160.253, uplink-279, 00:00:57

t0s> * 0.0.0.0/0 [1/0] via 172.16.160.254, uplink-279, 00:00:57

ny-nsxt-edge-node-23(tier0_sr)>

 

The second step is to configure static routing southbound from the ToR switches towards NSX-T Edge Transport Node. Each ToR switch is configured with two static routes to forward traffic to the destination overlay networks (overlay segments 172.16.242.0/24 and 172.16.243.0/24) within NSX-T. For each of the static routes the Next Hop is the NSX-T Tier-0 Gateway HA VIP.

Networking – IP StaticRouting South Diagram Option 2.png

The table below shows the static routing configuration on the ToR switch and the resulting routing table. The Next Hop is the Tier-0 Gateway HA VIP 172.16.160.24 for all static routes.

Table 10 - Nexus ToR Switches Static Routing Configuration and Resulting Routing Table for Design Option 2 (A/S HA VIP)

NY-N3K-LEAF-10
NY-N3K-LEAF-11

ip route 172.16.242.0/24 Vlan160 172.16.160.24

ip route 172.16.243.0/24 Vlan160 172.16.160.24

ip route 172.16.242.0/24 Vlan160 172.16.160.24

ip route 172.16.243.0/24 Vlan160 172.16.160.24

NY-N3K-LEAF-10# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 2/0

    *via 172.16.160.20, Vlan160, [1/0], 02:51:34, static

    *via 172.16.160.21, Vlan160, [1/0], 02:51:41, static

172.16.241.0/24, ubest/mbest: 2/0

    *via 172.16.160.20, Vlan160, [1/0], 02:51:34, static

    *via 172.16.160.21, Vlan160, [1/0], 02:51:41, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 02:55:42, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 02:55:42, static

 

NY-N3K-LEAF-10#

NY-N3K-LEAF-11# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 2/0

    *via 172.16.161.20, Vlan161, [1/0], 02:53:04, static

    *via 172.16.161.21, Vlan161, [1/0], 02:53:12, static

172.16.241.0/24, ubest/mbest: 2/0

    *via 172.16.161.20, Vlan161, [1/0], 02:53:04, static

    *via 172.16.161.21, Vlan161, [1/0], 02:53:12, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 02:55:03, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 02:55:03, static

 

NY-N3K-LEAF-11#

 

Failover Sanity checks

The table below

Table 11 - Failover Sanity Check

Failover Case
NY-N3K-LEAF-10 (Routing Table)
NY-N3K-LEAF-11 (Routing Table)
Comments
All NSX-T Edge Transport Nodes are UP

NY-N3K-LEAF-10# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 2/0

    *via 172.16.160.20, Vlan160, [1/0], 00:58:27, static

    *via 172.16.160.21, Vlan160, [1/0], 00:58:43, static

172.16.241.0/24, ubest/mbest: 2/0

    *via 172.16.160.20, Vlan160, [1/0], 00:58:27, static

    *via 172.16.160.21, Vlan160, [1/0], 00:58:43, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:02:47, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:02:47, static

NY-N3K-LEAF-10#

NY-N3K-LEAF-11# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 2/0

    *via 172.16.161.20, Vlan161, [1/0], 00:59:10, static

    *via 172.16.161.21, Vlan161, [1/0], 00:59:25, static

172.16.241.0/24, ubest/mbest: 2/0

    *via 172.16.161.20, Vlan161, [1/0], 00:59:10, static

    *via 172.16.161.21, Vlan161, [1/0], 00:59:25, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:01:21, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:01:21, static

NY-N3K-LEAF-11#

NSX-T Edge Transport Node

ny-nsxt-edge-node-20 is DOWN

All other NSX-T Edge Transport Node are UP

NY-N3K-LEAF-10# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 1/0

    *via 172.16.160.21, Vlan160, [1/0], 01:01:01, static

172.16.241.0/24, ubest/mbest: 1/0

    *via 172.16.160.21, Vlan160, [1/0], 01:01:01, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:05:05, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:05:05, static

NY-N3K-LEAF-10#

NY-N3K-LEAF-11# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 1/0

    *via 172.16.161.21, Vlan161, [1/0], 01:01:21, static

172.16.241.0/24, ubest/mbest: 1/0

    *via 172.16.161.21, Vlan161, [1/0], 01:01:21, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:03:17, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:03:17, static

NY-N3K-LEAF-11#

Route entries with

ny-nsxt-edge-node-20

(172.16.160.20

and 172.16.161.20)

are removed by BFD

NSX-T Edge Transport Node

ny-nsxt-edge-node-21 is DOWN

All other NSX-T Edge Transport Node are UP

NY-N3K-LEAF-10# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 1/0

    *via 172.16.160.20, Vlan160, [1/0], 00:02:40, static

172.16.241.0/24, ubest/mbest: 1/0

    *via 172.16.160.20, Vlan160, [1/0], 00:02:40, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:12:13, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:12:13, static

 

NY-N3K-LEAF-10#

NY-N3K-LEAF-11# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 1/0

    *via 172.16.161.20, Vlan161, [1/0], 00:03:04, static

172.16.241.0/24, ubest/mbest: 1/0

    *via 172.16.161.20, Vlan161, [1/0], 00:03:04, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:10:28, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:10:28, static

 

NY-N3K-LEAF-11#

Route entries with

ny-nsxt-edge-node-21

(172.16.160.21

and 172.16.161.21)

are removed by BFD

NSX-T Edge Transport Node

ny-nsxt-edge-node-22 is DOWN

All other NSX-T Edge Transport Node are UP

NY-N3K-LEAF-10# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 2/0

    *via 172.16.160.20, Vlan160, [1/0], 00:06:55, static

    *via 172.16.160.21, Vlan160, [1/0], 00:00:09, static

172.16.241.0/24, ubest/mbest: 2/0

    *via 172.16.160.20, Vlan160, [1/0], 00:06:55, static

    *via 172.16.160.21, Vlan160, [1/0], 00:00:09, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:16:28, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:16:28, static

 

NY-N3K-LEAF-10#

NY-N3K-LEAF-11# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 2/0

    *via 172.16.161.20, Vlan161, [1/0], 00:07:01, static

    *via 172.16.161.21, Vlan161, [1/0], 00:00:16, static

172.16.241.0/24, ubest/mbest: 2/0

    *via 172.16.161.20, Vlan161, [1/0], 00:07:01, static

    *via 172.16.161.21, Vlan161, [1/0], 00:00:16, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:14:25, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:14:25, static

 

NY-N3K-LEAF-11#

A single NSX-T Edge

Transport Node down

used for HA VIP

does not change the

routing table

NSX-T Edge Transport Node

ny-nsxt-edge-node-23 is DOWN

All other NSX-T Edge Transport Node are UP

NY-N3K-LEAF-10# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 2/0

    *via 172.16.160.20, Vlan160, [1/0], 00:10:58, static

    *via 172.16.160.21, Vlan160, [1/0], 00:04:12, static

172.16.241.0/24, ubest/mbest: 2/0

    *via 172.16.160.20, Vlan160, [1/0], 00:10:58, static

    *via 172.16.160.21, Vlan160, [1/0], 00:04:12, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:20:31, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:20:31, static

 

NY-N3K-LEAF-10#

NY-N3K-LEAF-11# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.240.0/24, ubest/mbest: 2/0

    *via 172.16.161.20, Vlan161, [1/0], 00:11:30, static

    *via 172.16.161.21, Vlan161, [1/0], 00:04:45, static

172.16.241.0/24, ubest/mbest: 2/0

    *via 172.16.161.20, Vlan161, [1/0], 00:11:30, static

    *via 172.16.161.21, Vlan161, [1/0], 00:04:45, static

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:18:54, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:18:54, static

 

NY-N3K-LEAF-11#

A single NSX-T Edge

Transport Node down

used for HA VIP

does not change the

routing table

NSX-T Edge Transport Node

ny-nsxt-edge-node-20 and

ny-nsxt-edge-node-21 are DOWN

All other NSX-T Edge Transport Node are UP

NY-N3K-LEAF-10# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:24:06, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:24:06, static

 

NY-N3K-LEAF-10#

NY-N3K-LEAF-11# show ip route static

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

 

172.16.242.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:22:54, static

172.16.243.0/24, ubest/mbest: 1/0

    *via 172.16.160.24, Vlan160, [1/0], 01:22:54, static

 

NY-N3K-LEAF-11#

All route entries

related to design

option 1 are removed

by BFD

 

I hope you had a little bit of fun reading this blog post about a static routing with NSX-T. Now with the knowledge how to archive ECMP with static routing, you might have a new and interessting design option for your customers NSX-T deployments.

 

Software Inventory:

vSphere version: VMware ESXi, 6.5.0, 15256549

vCenter version: 6.5.0, 10964411

NSX-T version: 3.0.0.0.0.15946738 (GA)

Cisco Nexus 3048 NX-OS version: 7.0(3)I7(6)

 

 

Blog history

Version 1.0 - 08.07.2020 - first published version

Version 1.1 - 09.07.2020 - minor changes

Version 1.2 - 30.07.2020 - grammar updates - thanks to James Lepthien :-)

Dear readers

Welcome to a new blog post talking about a specific NSX-T Edge Node VM deployment with only a single Edge Node N-VDS. You may have seen the 2019 VMworld session "Next-Generation Reference Design with NSX-T: Part 1" (CNET2061BU or CNET2061BE) from Nimish Desai. On one of his slides he mentions how we could deploy a single NSX-T Edge Node N-VDS instead of the three Edge Node N-VDS. This new approach (available since NSX-T 2.5 for Edge Node VM) with a single Edge Node N-VDS has the following advantages:

  • Multiple TEPs to load balance overlay traffic for different overlay segments
  • Same NSX-T Edge Node N-VDS design for VM-based and Bare Metal (with 2 pNIC)
  • Only two Transport Zones (Overlay & VLAN) assigned to a single N-VDS

The diagram below shows the slide with a single Edge Node N-VDS from one of the VMware sessions (CNET2061BU):

Edge Support with Multi-TEP-Nimish-Desai-VM.png

However, the single NSX-T Edge Node design comes with additional requirements respective recommendations:

  • vDS port group Trunks configuration to carry multiple VLANs (requirement)
  • VLAN pinning for deterministic North/South flows (recommendation)

This blog talks mainly about the second bullet point and how we can achieve the correct VLAN pinning. A correct VLAN pinning requires multiple individual configuration steps at different levels, as an example vDS trunk port group teaming or N-VDS named teaming policy configuration. The goal behind this VLAN pinning is a deterministic end-to-end path.

When configured correctly the BGP session is enforced to be aligned with the data forwarding path and hence the MAC addresses from the Tier-0 Gateway Layer 3 Interfaces (LIF) are only learnt at the expected ToR/Leaf switch trunk interfaces.

 

In this blog post the NSX-T Edge Node VMs are deployed on ESXi hosts which are NOT prepared for NSX-T. The two ESXi hosts belong to a single vSphere Cluster exclusively used for NSX-T Edge Node VMs. There are a few good reasons NOT to prepare these ESXi hosts with NSX-T where you host only NSX-T Edge Node VMs:

  • It is not required
  • Better NSX-T upgrade-ability (you don't need to evacuate the NSX-T Edge Node VM during host NSX-T software upgrade with vMotion to enter maintenance mode; every vMotion of the NSX-T Edge Node VM will cause a short unnecessary data plane glitch)
  • Shorter NSX-T upgrade cycles (for every NSX-T upgrade you only need to upgrade the ESXi hosts which are used for the payload VMs and only the NSX-T Edge Node VMs, but not the ESXi hosts where you have your Edge Nodes deployed
  • vSphere HA can be turned off (do we want to move a highly loaded packet forwarding node with vMotion in a host vSphere HA event? No I don't think so - as the routing HA model is much quicker)
  • Simplified DRS settings (do we want to move an NSX-T Edge Node with vMotion to balance the resources?)
  • Typically a resource pool is not required

We should never underestimate how important smooth upgrade cycles are. Upgrade cycles are time consuming events and are typically required multiple times per year.

To have the ESXi host NOT prepared for NSX-T is considered best practice and should always be deployed in any NSX-T deployments which can afford a dedicated vSphere Cluster only for NSX-T Edge Node VMs. Install NSX-T on ESXi hosts where you have deployed your NSX-T Edge Node VMs (called collapsed design) is appropriate for customers who have a low number of ESXi hosts to keep the CAPEX costs low.

 

The diagram below shows the lab test bed of a single ESXi host with a single Edge Node appliance which uses only a single N-VDS. The relevant configuration steps are marked with 1 to 4.

Networking – NSX-T Edge Topology-NEW.png

 

The NSX-T Edge Node VM is configured with two transport zones. The same overlay transport zone is used for the compute ESXi hosts where I host the payload VMs. Both transport zones are assigned to a single N-VDS, called NY-HOST-NVDS. The name of the N-VDS might confuse you a little bit due to the selected name, but the same NY-HOST-NVDS is used for all compute ESXi hosts prepped with NSX-T and indicate that only a single N-VDS is required independent of Edge Node or compute ESXi host. However, you might select a different name for the N-VDS.

Screen Shot 2020-04-11 at 11.40.18.png

The single N-VDS (NY-HOST-NVDS) on the Edge Node is configured with an Uplink Profile (please see more details below) with two static TEP IP addresses, which allow us to load balance the Geneve encapsulated overlay traffic for North/South. Both Edge Node FastPath interfaces (fp-eth0 & fp-eth1) are mapped to a labelled Active Uplink name as part of the default teaming policy.

Screen Shot 2020-04-11 at 11.40.26.png

There are 4 areas where we need to take care of the correct settings.

<1> - At the physical ToR/Leaf Switch Level

The trunk ports will allow only the required VLANs

  • VLAN 60 - NSX-T Edge Node management interface
  • VLAN 151 - Geneve TEP (Edge Nodes) VLAN
  • VLAN 160 - Northbound Uplink VLAN for NY-N3K-LEAF-10
  • VLAN 161 - Northbound Uplink VLAN for NY-N3K-LEAF-11

The resulting interface configuration along with the relevant BGP configuration is in the table shown below. Please note for redundancy reason both Northbound Uplink VLAN 160 and 161 are allowed on the trunk configuration. Under normal conditions, NY-N3K-LEAF-10 will learn only MAC addresses from VLAN 60, 151 and 160 and NY-N3K-LEAF-11 will learn only MAC addresses from VLAN 60, 151 and 161.

Table 1 - Nexus ToR/LEAF Switch Configuration

NY-N3K-LEAF-10 Interface Configuration
NY-N3K-LEAF-11 Interface Configuration

interface Ethernet1/2

  description *NY-ESX50A-VMNIC2*

  switchport mode trunk

  switchport trunk allowed vlan 60,151,160-161

  spanning-tree port type edge trunk

interface Ethernet1/2

  description *NY-ESX50A-VMNIC3*

  switchport mode trunk

  switchport trunk allowed vlan 60,151,160-161

  spanning-tree port type edge trunk

interface Ethernet1/4

  description *NY-ESX51A-VMNIC2*

  switchport mode trunk

  switchport trunk allowed vlan 60,151,160-161

  spanning-tree port type edge trunk

interface Ethernet1/4

  description *NY-ESX51A-VMNIC3*

  switchport mode trunk

  switchport trunk allowed vlan 60,151,160-161

  spanning-tree port type edge trunk

router bgp 64512

  router-id 172.16.3.10

  log-neighbor-changes

  ---snip---

  neighbor 172.16.160.20 remote-as 64513

    update-source Vlan160

    timers 4 12

    address-family ipv4 unicast

  neighbor 172.16.160.21 remote-as 64513

   update-source Vlan160

    timers 4 12

    address-family ipv4 unicast

router bgp 64512

  router-id 172.16.3.11

  log-neighbor-changes

  ---snip---

  neighbor 172.16.161.20 remote-as 64513

    update-source Vlan161

    timers 4 12

    address-family ipv4 unicast

  neighbor 172.16.161.21 remote-as 64513

    update-source Vlan161

    timers 4 12

    address-family ipv4 unicast

As part of the Cisco Nexus 3048 BGP configuration we see that only NY-N3K-LEAF-10 terminates the BGP session on VLAN 160 and only NY-N3K-LEAF-11 terminates the BGP session on VLAN 161.

 

<2> - At the vDS Port Group Level

The vDS is configured with four vDS port groups in total:

  • Port Group (Type VLAN): NY-VDS-PG-ESX5x-NSXT-EDGE-MGMT60: carries only VLAN 60 and has an active/standby teaming policy
  • Port Group (Type VLAN): NY-vDS-PG-ESX5x-EDGE2-Dummy999: this dummy port group is used for the remaining unused Edge Node Fastpath (fp-eth2) interface to avoid that NSX-T reports it as admin status down
  • Port Group (Type VLAN trunking): NY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkA: Carries the Edge Node TEP VLAN 151 and Uplink VLAN 160
  • Port Group (Type VLAN trunking): NY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkB: Carries the Edge Node TEP VLAN 151 and Uplink VLAN 161

The two trunk port groups have only one vDS-Uplink active, the other vDS-Uplink is set to standby. This is required so that the Uplink VLAN traffic along with the BGP session can only be forwarded on the specific vDS-Uplink (vDS-Uplink is mapped to the corresponding pNIC) during normal condition. With these settings we can achieve

  • Failover order gets deterministic
  • Symmetric Bandwidth for both overlay and North/South traffic
  • The BGP session between the Tier-0 Gateway and the ToR/Leaf switches should stay UP even if one or both physical links between the ToR/Leaf switches and the ESXi hosts goes down (the BGP session is then carried over the Trunk Link between the ToR/Leaf switches).

 

The table below highlights the relevant VLAN and Teaming settings:

Table 2 - vDS Port Group Configuration

NY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkA ConfigurationNY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkB Configuration
Trunka-vlan-Screen Shot 2020-04-11 at 10.38.25.pngTrunkb-vlan-Screen Shot 2020-04-11 at 10.39.49.png
Trunka-teaming-Screen Shot 2020-04-11 at 10.38.06.pngTrunkb-teaming-Screen Shot 2020-04-11 at 10.39.58.png

 

<3> - At the NSX-T Edge Uplink Profile Level

The NSX-T Uplink Profile is a global construct that defines how traffic will leave a Transport Node respective Edge Transport Node.

The single Uplink Profile used for the two Edge Node FastPath interfaces (fp-eth0 & fp-eth1) needs to be extended with two additional Named Teaming Policies to steer the North/South uplink traffic to the corresponding ToR/Leaf switch.

  • The default teaming requires to be configured as Source-port-ID with the two Active Uplinks (I am using label EDGE-UPLINK1 & EDGE-UPLINK2)
  • An additional teaming policy called NY-Named-Teaming-N3K-LEAF-10 is configured with failover teaming policy with a single Active Uplink (label EDGE-UPLINK1)
  • An additional teaming policy called NY-Named-Teaming-N3K-LEAF-11 is configured with failover teaming policy with a single Active Uplink (label EDGE-UPLINK2)

Please note, the Active Uplink labels for the default and the additional Named Teaming Policies need to be the same.

Screen Shot 2020-04-11 at 10.58.49.png

 

<4> - At the NSX-T Uplink VLAN Segment Level

To activate the previous configured Named Teaming Policies for the specific Tier-0 VLAN segment 160 respective segment 161 we need to first assign the Named Teaming Policy to the VLAN transport zone.

Screen Shot 2020-04-11 at 11.07.12.png

The last step involves the configuration of each of the two Uplink VLAN segments (160 & 161) with the corresponding Named Teaming Policy. NSX-T 2.5.1 requires to configure the VLAN segment with the Named Teaming Policy in the "legacy" Advance Networking&Security UI. The recently released NSX-T 3.0 will support Policy UI.

Table 3 - NSX-T VLAN Segment Configuration

VLAN Segment NY-T0-EDGE-UPLINK-SEGMENT-160
VLAN Segment NY-T0-EDGE-UPLINK-SEGMENT-161

Screen Shot 2020-04-11 at 11.09.50.png

Screen Shot 2020-04-11 at 11.09.37.png
Screen Shot 2020-04-11 at 11.29.17.pngScreen Shot 2020-04-11 at 11.29.25.png

 

Verification

The resulting topology with both NSX-T Edge Nodes and the previous shown steps is shown below. It shows how the Tier-0 VLAN Segment 160 respective 161 is "routed" through the different levels from the Tier-0 Gateway towards the Nexus Leaf switches via the vDS trunk port groups.

Networking – NSX-T Edge Pinned VLAN.png

The best option to verify if all your settings are correct is to validate on which ToR/Leaf trunk port you learn the appropriate MAC address of the Tier-0 Gateway Layer 3 interfaces. These Layer 3 interfaces belong to the Tier-0 Service Router (SR). You can get the MAC address via CLI.

Table 4 - NSX-T Tier-0 Layer 3 Interface Configuration

ny-edge-transport-node-20(tier0_sr)> get interfacesny-edge-transport-node-21(tier0_sr)> get interfaces

Interface: 2f83fda5-0da5-4764-87ea-63c0989bf059

Ifuid: 276

Name: NY-T0-LIF160-EDGE-20

Internal name: uplink-276

Mode: lif

IP/Mask: 172.16.160.20/24

MAC: 00:50:56:97:51:65

LS port: 40102113-c8af-4d4e-a94d-ca44f9efe9a5

Urpf-mode: STRICT_MODE

DAD-mode: LOOSE

RA-mode: SLAAC_DNS_TRHOUGH_RA(M=0, O=0)

Admin: up

Op_state: up

MTU: 9000

Interface: a3d7669a-e81c-43ea-81c0-dd60438284bc

Ifuid: 289

Name: NY-T0-LIF160-EDGE-21

Internal name: uplink-289

Mode: lif

IP/Mask: 172.16.160.21/24

MAC: 00:50:56:97:84:c3

LS port: 045cd486-d8c5-4df5-8784-2e49862771f4

Urpf-mode: STRICT_MODE

DAD-mode: LOOSE

RA-mode: SLAAC_DNS_TRHOUGH_RA(M=0, O=0)

Admin: up

Op_state: up

MTU: 9000

Interface: a1f0d5d0-3883-4e04-b985-e391ec1d9711

Ifuid: 281

Name: NY-T0-LIF161-EDGE-20

Internal name: uplink-281

Mode: lif

IP/Mask: 172.16.161.20/24

MAC: 00:50:56:97:a7:33

LS port: d180ee9a-8e82-4c59-8195-ea65660ea71a

Urpf-mode: STRICT_MODE

DAD-mode: LOOSE

RA-mode: SLAAC_DNS_TRHOUGH_RA(M=0, O=0)

Admin: up

Op_state: up

MTU: 9000

Interface: 2de46a54-3dba-4ddc-abe7-5b713260e7d4

Ifuid: 296

Name: NY-T0-LIF161-EDGE-21

Internal name: uplink-296

Mode: lif

IP/Mask: 172.16.161.21/24

MAC: 00:50:56:97:ec:1b

LS port: c32e2109-32d0-4c0f-a916-bfba01fdd6ac

Urpf-mode: STRICT_MODE

DAD-mode: LOOSE

RA-mode: SLAAC_DNS_TRHOUGH_RA(M=0, O=0)

Admin: up

Op_state: up

MTU: 9000

 

The MAC address tables show that ToR/Leaf switch NY-N3K-LEAF-10 learns the Tier-0 Layer 3 MAC addresses from VLAN 160 locally and from VLAN 161 via Portchannel 1 (Po1).

And the MAC address tables show that ToR/Leaf switch NY-N3K-LEAF-11 learns the Tier-0 Layer 3 MAC addresses from VLAN 161 locally and from VLAN 160 via Portchannel 1 (Po1).

Table 5 - ToR/Leaf Switch MAC Address Table for Northbound Uplink VLAN 160 and 161

ToR/Leaf Switch NY-N3K-LEAF-10
ToR/Leaf Switch NY-N3K-LEAF-11

NY-N3K-LEAF-10# show mac address-table dynamic vlan 160

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  160     0050.5697.5165   dynamic  0         F      F    Eth1/2

*  160     0050.5697.84c3   dynamic  0         F      F    Eth1/4

NY-N3K-LEAF-11# show mac address-table dynamic vlan 160

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  160     0050.5697.5165   dynamic  0         F      F    Po1

*  160     0050.5697.84c3   dynamic  0         F      F    Po1

*  160     780c.f049.0c81   dynamic  0         F      F    Po1

NY-N3K-LEAF-10# show mac address-table dynamic vlan 161

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  161     0050.5697.a733   dynamic  0         F      F    Po1

*  161     0050.5697.ec1b   dynamic  0         F      F    Po1

*  161     502f.a8a8.717c   dynamic  0         F      F    Po1

NY-N3K-LEAF-11# show mac address-table dynamic vlan 161

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  161     0050.5697.a733   dynamic  0         F      F    Eth1/2

*  161     0050.5697.ec1b   dynamic  0         F      F    Eth1/4

*  161     780c.f049.0c81   dynamic  0         F      F    Po1

 

As we have seen in the Edge Transport Node configuration each Edge Node has two TEP IP addresses statically configured. Both Fastpath interfaces load balance the Geneve encapsulated overlay traffic. Table 8 shows the TEP MAC address in order to verify the Edge Node TEP MAC addresses.

Table 7 - ToR/Leaf Switch MAC Address Table for Edge Node TEP VLAN 151

ToR/Leaf Switch NY-N3K-LEAF-10ToR/Leaf Switch NY-N3K-LEAF-11

NY-N3K-LEAF-10# show mac address-table dynamic vlan 151

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  151     0050.5697.5165   dynamic  0         F      F    Eth1/2

*  151     0050.5697.84c3   dynamic  0         F      F    Eth1/4

*  151     0050.5697.a733   dynamic  0         F      F    Po1

*  151     0050.5697.ec1b   dynamic  0         F      F    Po1

*  151     502f.a8a8.717c   dynamic  0         F      F    Po1

NY-N3K-LEAF-11# show mac address-table dynamic vlan 151

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  151     0000.0c9f.f097   dynamic  0         F      F    Po1

*  151     0050.5697.5165   dynamic  0         F      F    Po1

*  151     0050.5697.84c3   dynamic  0         F      F    Po1

*  151     0050.5697.a733   dynamic  0         F      F    Eth1/2

*  151     0050.5697.ec1b   dynamic  0         F      F    Eth1/4

*  151     780c.f049.0c81   dynamic  0         F      F    Po1

 

Table 8 - NSX-T Edge Node TEP MAC Addresses

ny-edge-transport-node-20>ny-edge-transport-node-21>

ny-edge-transport-node-20> get interface fp-eth0 | find MAC

  MAC address: 00:50:56:97:51:65

 

ny-edge-transport-node-20> get interface fp-eth1 | find MAC

  MAC address: 00:50:56:97:a7:33

ny-edge-transport-node-21> get interface fp-eth0 | find MAC

  MAC address: 00:50:56:97:84:c3

 

ny-edge-transport-node-21> get interface fp-eth1 | find MAC

MAC address: 00:50:56:97:ec:1b

 

For the sake of completeness, the table below shows that only ToR/Leaf Switch NY-N3K-LEAF-10 learns the two Edge Node Management MAC address from VLAN 60 locally, ToR/Leaf Switch NY-N3K-LEAF-11 only via Portchannel 1 (Po1). This is expected, as we have configured the teaming policy in active/standby on the vDS port group. The Edge Node N-VDS is not relevant for the Edge Node management interface.

Table 8 - ToR/Leaf Switch MAC Address Table for Edge Node Management VLAN 60

ToR/Leaf Switch NY-N3K-LEAF-10
ToR/Leaf Switch NY-N3K-LEAF-11

NY-N3K-LEAF-10# show mac address-table dynamic vlan 60

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*   60     0050.5697.1e49   dynamic  0         F      F    Eth1/4

*   60     0050.5697.4555   dynamic  0         F      F    Eth1/2

*   60     502f.a8a8.717c   dynamic  0         F      F    Po1

NY-N3K-LEAF-11# show mac address-table dynamic vlan 60

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*   60     0000.0c9f.f03c   dynamic  0         F      F    Po1

*   60     0050.5697.1e49   dynamic  0         F      F    Po1

*   60     0050.5697.4555   dynamic  0         F      F    Po1

 

Please note, I highly recommend always to run a few failover tests to confirm that the NSX-T Edge Node deployment works as expected.

 

I hope you had a little bit of fun reading this blog post about a single N-VDS on the Edge Node with VLAN pinning.

 

Software Inventory:

vSphere version: VMware ESXi, 6.5.0, 15256549

vCenter version:6.5.0, 10964411

NSX-T version: 2.5.1.0.0.15314288 (GA)

Cisco Nexus 3048 NX-OS version: 7.0(3)I7(6)

 

 

Blog history

Version 1.0 - 13.04.2020 - first published version

Version 1.1 - 14.04.2020 - minor changes (license)

Version 1.2 - 25.04.2020 - minor changes (typos)

Version 1.3 - 04.06.2020 - adding the 2nd Edge TEP in the second diagram and minor changes (typos)

Dear readers

Welcome to a new series of blogs talking about the network readiness. As you might be already aware, NSX-T requires from the physical underlay network mainly two things:

  • IP Connectivity – IP connectivity between all components of NSX-T and compute hosts. This includes on one hand the Geneve Tunnel Endpoint (TEP) interfaces and an other management interfaces (typically vmk0) on hosts as well NSX-T Edge nodes (management interface) - both bare metal and virtual NSX-T Edge nodes.
  • Jumbo Frame Support – A minimum required MTU is 1600, however MTU of 1700 bytes is recommended to address the full possibility of variety of functions and future proof the environment for an expanding Geneve header. To get out most of your VMware SDDC your physical underlay network should support at least an MTU of 9000 bytes.

This blog has a focus on the MTU readiness for NSX-T. There are other VMkernel interfaces than for the overlay encapsulation with Geneve, like vSAN or vMotion which perform better with a higher MTU. So we keep this discussion on the MTU more generally. Physical network gear vendors, like Cisco with the Nexus Data Center switch family typically support a MTU of 9216 bytes. Other vendors might have the same MTU upper size.

 

This blog is about the correct MTU configuration and the verification within the Data Center spine-leaf architecture with Nexus 3K switches running NX-OS. Lets have a look to a very basic and simple lab spine-leaf topology with only three Nexus N3K-C3048TP-1GE switches:

Lab Spine Leaf Topology.png

Out of the box, the Nexus 3048 switches are configured with a MTU of 1500 bytes only. For an MTU of 9216 bytes we need to configure three pieces.

  • Layer 3 Interfaces MTU Configuration – This type of interface is used between the Leaf-10 and the Borderspine-12 switch respective between the Leaf-11 and Borderpine-12 switch. We run on this interface OSPF to announce the Loopback0 interface for the iBGP peering connectivity. As example the MTU Layer 3 interface configuration on interface e1/49 from the Leaf-10 is shown below:
Nexus 3048 Layer 3 Interface MTU Configuration

NY-N3K-LEAF-10# show run inter e1/49

---snip---

interface Ethernet1/49

  description **L3 to NY-N3K-BORDERSPINE-12**

  no switchport

  mtu 9216

  no ip redirects

  ip address 172.16.3.18/30

  ip ospf network point-to-point

  no ip ospf passive-interface

  ip router ospf 1 area 0.0.0.0

NY-N3K-LEAF-10#

 

  • Layer 3 Switch Virtual Interfaces (SVI) MTU Configuration – This type of interface is required as example to establish an IP connectivity between the Leaf-10 and Leaf-11 switches when the interfaces between the Leaf switches are configured as Layer 2 interfaces. We are using a dedicated SVI for VLAN 3 for the OSPF neighborship and the iBGP peering connectivity between the Leaf-10 and Leaf-11. In this lab topology are the interfaces e1/51 and e1/52 configured as dot1q trunk to carry multiple VLANs (including VLAN 3) and these to interfaces are combined into a portchannel running LACP for redundancy reason. As example the MTU configuration of the SVI for VLAN 3 from the Leaf-10 is shown below:
Nexus 3048 Switch Virtual Interface (SVI) MTU Configuration

NY-N3K-LEAF-10# show run inter vlan 3

---snip---

interface Vlan3

  description *iBGP-OSPF-Peering*

  no shutdown

  mtu 9216

  no ip redirects

  ip address 172.16.3.1/30

  ip ospf network point-to-point

  no ip ospf passive-interface

  ip router ospf 1 area 0.0.0.0

NY-N3K-LEAF-10#

 

  • Global Layer 2 Interface MTU Configuration – This global configuration is required for this type of Nexus switches and a few other Nexus switches (please see footnote 1 for more details). This Nexus 3000 does not support individual Layer 2 interface MTU configuration; the MTU for Layer 2 interfaces must be configured via a network-qos policy command. All interfaces configured as access or trunk port for host connectivity and as well for the dot1q trunk between the Leaf switches (e1/51 and e1/52) requires the network-qos configuration as shown below:
Nexus 3048 Global MTU QoS Policy Configuration

NY-N3K-LEAF-10#show run

---snip---

policy-map type network-qos POLICY-MAP-JUMBO

  class type network-qos class-default

   mtu 9216

system qos

  service-policy type network-qos POLICY-MAP-JUMBO

NY-N3K-LEAF-10#

 

The network-qos global MTU configuration needs to be verified with the command as shown below:

Nexus 3048 Global MTU QoS Policy Verification

NY-N3K-LEAF-10# show queuing interface ethernet 1/51-52 | include MTU

HW MTU of Ethernet1/51 : 9216 bytes

HW MTU of Ethernet1/52 : 9216 bytes

NY-N3K-LEAF-10#

 

The verification of the end-to-end MTU of 9216 bytes within the physical network should be done already typically before you attach your first hypervisor ESXi hosts. Please keep in mind, the virtual distributed switch (vDS) and the NSX-T N-VDS (e.g uplink profile MTU configuration) supports today up to 9000 bytes. This MTU includes the overhead for the Geneve encapsulation. As you could see in the table below of an ESXi host, the MTU is set to the maximum of 9000 bytes for the VMkernel interfaces used for Geneve (we label it unfortunately still with vxlan) respective for vMotion and IP storage.

ESXi Host MTU VMkernel Interface Verification

[root@NY-ESX50A:~] esxcfg-vmknic -l

Interface  Port Group/DVPort/Opaque Network        IP Family IP Address      Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type     NetStack           

vmk0       2                                       IPv4      172.16.50.10    255.255.255.0   172.16.50.255   b4:b5:2f:64:f9:48 1500    65535     true    STATIC   defaultTcpipStack  

vmk2       17                                      IPv4      172.16.52.10    255.255.255.0   172.16.52.255   00:50:56:63:4c:85 9000    65535     true    STATIC   defaultTcpipStack  

vmk10      10                                      IPv4      172.16.150.12   255.255.255.0   172.16.150.255  00:50:56:67:d5:b4 9000    65535     true    STATIC   vxlan              

vmk50      910dba45-2f63-40aa-9ce5-85c51a138a7d    IPv4      169.254.1.1     255.255.0.0     169.254.255.255 00:50:56:69:68:74 1500    65535     true    STATIC   hyperbus           

vmk1       8                                       IPv4      172.16.51.10    255.255.255.0   172.16.51.255   00:50:56:6c:7c:f9 9000    65535     true    STATIC   vmotion            

[root@NY-ESX50A:~]

 

For sure, the verification of the end-to-end MTU between two ESXi hosts I still highly recommend by sending VMkernel pings with the don't-fragment bit set (e.g. vmkping ++netstack=vxlan -d -c 3 -s 8972 -I vmk10 172.16.150.13).

 

But for a serious end-to-end MTU 9216 physical network verification we need to look for another tool than the VMkernel ping. In my case I just using BGP running on the Nexus 3048 switches. BGP is running on the top of TCP and TCP support the option "Maximum Segment Size" to maximize the TCP datagrams.

 

The TCP Maximum Segment Size (MSS) is a parameter of the options field of the TCP header that specifies the largest amount of data, specified in bytes. This information is part of the SYN TCP three-way handshake, as the diagram below shows from a wireshark sniffer trace.

Wireshark-MTU9216-MSS-TCP.png

The TCP MSS defines the maximum amount of data that an IPv4 endpoint is willing to accept in a single TCP/IPv4 datagram. RFC879 explicit mention that MSS counts only data octets in the segment, but it does not count the TCP header or the IP header. In the wireshark trace example the two IPv4 endpoints (Loopback 172.16.3.10 and 172.16.3.12) have accepted an MSS of 9176 bytes on a physical Layer 3 link with MTU 9216 during the TCP three-way handshake. The difference of 40 bytes is based on the default TCP header of 20 bytes and IP header of again 20 bytes.

Please keep in mind, a small MSS values will reduce or eliminate IP fragmentation for any TCP based application, but will result in higher overhead. This is also truth for BGP messages.

BGP update messages carry all the BGP prefixes as part of the Network Layer Reachability Information (NLRI) Path Attribute. In regards for an optimal BGP performance in a spine-leaf architecture running BGP, it is advisable to set the MSS for BGP to the maximum value but avoid fragmentation. As defined RFC879 all IPv4 endpoints are required to handle an MSS of 536 bytes (=MTU 576 bytes minus 20 bytes for TCP Header*** minus 20 bytes IP Header).

But are these Nexus switches using MSS of 536 bytes only? Nope!

These Nexus 3048 switches running NX-OS 7.0(3)I7(6) are by default configured to discover the maximal MTU path between the two IPv4 endpoints leveraging Path MTU Discovery (PMTUD) feature. Other Nexus switches may requires the configuration of the global command "ip tcp path-mtu-discovery" to enable PMTUD.

 

MSS is sometimes mistaken for PMTUD. MSS is a concept used by TCP in the Transport Layer and it specifies the largest amount of data that a computer or communications device can receive in a single TCP segment. While PMTUD is used to specifies the largest packet size that can be sent over this path without suffering fragmentation.

 

But how we could verify the MSS used for the BGP peering session between the Nexus 3048 switches?

Nexus 3048 switches running NX-OS software allows the administrator to check the MSS of the TCP BGP session with the following command: show sockets connection tcp details.

Below we see two TCP BGP sessions between the IPv4 endpoints (Switch Loopback Interfaces) and each of the session shows a MSS of 9164 bytes.

BGP TCP Session Maximum Segment Size Verification

NY-N3K-LEAF-10# show sockets connection tcp local 172.16.3.10 detail

 

---snip---

 

Kernel Socket Connection:

State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port

 

ESTAB      0      0               172.16.3.10:24415          172.16.3.11:179    ino:78187 sk:ffff88011f352700

 

     skmem:(r0,rb262144,t0,tb262144,f0,w0,o0) ts sack cubic wscale:2,2 rto:210 rtt:12.916/14.166 ato:40 mss:9164 cwnd:10 send 56.8Mbps rcv_space:18352

 

 

ESTAB      0      0               172.16.3.10:45719          172.16.3.12:179    ino:79218 sk:ffff880115de6800

 

     skmem:(r0,rb262144,t0,tb262144,f0,w0,o0) ts sack cubic wscale:2,2 rto:203.333 rtt:3.333/1.666 ato:40 mss:9164 cwnd:10 send 220.0Mbps rcv_space:18352

 

 

NY-N3K-LEAF-10#

Please reset always the BGP session when you change the MTU, as the MSS is only discovered during the initial TCP three-way handshake.

 

The MSS value of 9164 bytes confirms that the underlay physical network is ready with an end-to-end MTU of 9216 bytes. But why is the MSS value (9164) of BGP 12 bytes smaller than the TCP MSS value (9176) negotiated during the TCP three-way handshake?

Again, in many TCP IP stacks implementation we could see a MSS of 1460 bytes with the interface MTU of 1500 bytes respective a MSS of 9176 bytes for a interface MTU of 9216 bytes (40 bytes difference) , but there are other factors that can change this. For example, if both sides support RFC 1323/7323 (enhanced timestamps, windows scaling, PAWS***) this will add 12 bytes to the TCP header, reducing the payload to 1448 bytes respective 9164 bytes.

And indeed, the Nexus NX-OS TCP/IP stacks used for BGP supports by default the TCP enhanced timestamps option and leverage the PMTUD (RFC 1191) feature to handle the 12 byte extra room and hence reduce the maximal payload (payload in our case is BGP) to a MSS of 9164 bytes.

 

The below diagram from a wireshark sniffer trace confirms the extra 12 byte used for the TCP timestamps option.

Wireshark-TCP-12bytes-Option-timestamps.png

Hope you had a little bit fun reading this small Network Readiness write-up.

 

Footnote 1: Configure and Verify Maximum Transmission Unit on Cisco Nexus Platforms - Cisco

** 20 bytes TCP Header is only correct when default TCP header options are used, RFC 1323 - TCP Extensions for High Performance and replaced by RFC 7323 - TCP Extensions for High Performance  defines TCP extension which requires up to 12 bytes more.

*** PAWS = Protect Against Wrapped Sequences

 

Software Inventory:

vSphere version: VMware ESXi, 6.5.0, 15256549

vCenter version:6.5.0, 10964411

NSX-T version: 2.5.1.0.0.15314288 (GA)

Cisco Nexus 3048 NX-OS version: 7.0(3)I7(6)

 

Blog history:

Version 1.0 - 23.03.2020 - first published version

oziltener Novice
VMware Employees

NSX-T N-VDS VLAN Pinning

Posted by oziltener Aug 19, 2019

Dear readers

As you are probably aware NSX-T use its own vSwitch called N-VDS. The N-VDS is primarily used to encapsulate and decapsulate GENEVE overlay traffic between NSX-T transport nodes along supporting the distributed Firewall (dFW) for micro-segmentation. The N-VDS requires its own dedicated pNIC interfaces. These pNIC cannot be shared with vSphere vSwitches (vDS or vSS). Each NSX-T transport node has in a typically NSX-T deployment one or two Tunnel End Points (TEPs) to terminate the GENEVE overlay traffic. The number of TEP is directly related to the attached Uplink Profile. In case you use an uplink teaming policy "failover", then only a single TEP is used. In case of a teaming policy "Load Balance Source" then you have for each physical NIC a TEP assigned. Such an "Load Balance Source" Uplink Profile is showed below and will be used for this lab exercise.

Screen Shot 2019-08-19 at 20.07.00.png

The mapping of the "Uplinks" is as follow:

  • ActiveUplink1 is the pNIC (vmnic2) connected to ToR switch NY-CAT3750G-A
  • ActiveUplink2 is the pNIC (vmnic3) connected to ToR switch NY-CAT3750G-B

 

Additionally, you could see the VLAN 150 to carry the GENEVE encapsulated traffic.

 

However, the N-VDS can also be used for VLAN-based segments. VLAN-based segments are very similar as vDS portgroups. In deployment, where your hosts has only two pNICs and both pNICs are used for the N-VDS (yes, for redundancy reason), you have to use VLAN-based segments to carry VmKernel interfaces (e.g. mgmt, vMotion or vSAN). When your VLAN-based segments are used to carry VMKernel interface traffic and you use an Uplink Profile as shown above, then it is difficult to figure out on which pNIC the VmKernel traffic is carried, as these traffic is following the default teaming policy, in our case "Load Balance Source". Please note, VLAN-based segments is not limited to VmKernel traffic, such segment can also carry regular virtual machine traffic.

 

There are often good reasons to do traffic steering to have a predicable traffic flow behavior, as example you would like to transport Management and vMotion VmKernel traffic under normal conditions (all physical links are up) on pNIC_A and vSAN on pNIC_B. One of the top two reasons are:

1.) predict the forwarding traffic pattern under normal conditions (all links are up) and align as example the VmKernel traffic with the active First Hop Gateway Protocol (e.g. HSRP)

2.) reduce ISL traffic between the two ToR Switches or ToR-to-Spine traffic for high load traffic (e.g. vSAN or vMotion) along with predictable and low latency traffic forwarding (assume as example you have 20 hosts in a single rack and all hosts use for vSAN the left ToR Switch, in such situation the ISL is not carrying vSAN traffic)

 

This is where NSX-T "VLAN Pinning" comes into the game. The term "VLAN Pinning" is in our NSX-T public documentation referred as "Named Teaming Policy". Actually I like the term "VLAN Pinning". In this lab exercise for this blog, I would like to show you how you could configure "VLAN Pinning". The physical lab setup looks like the diagram below:

Physical Host Representation-Version1.png

For this exercise is only host NY-ESX72A relevant. This host NY-ESX72A is attached to two Top of Rack (ToR) Layer 3 Switches, called NY-CAT3750G-A and NY-CAT3750G-B. As you see, this hosts has four pNICs (vmnic0...3). But only the pNIC vmnic2 and vmnic3 assigned to the N-VDS are relevant for this lab exercise. On the host NY-ESX72A, I have created three additional "artificial/dummy" VmKernel interfaces (vmk3, vmk4, vmk5). Each of the three VmKernel is assigned to a dedicated NSX-T VLAN-based segment. The diagram below shows the three VmKernel interfaces, all attached to a dedicated VLAN-based segment owned by the N-VDS (NY-NVDS) and the MAC address from vmk3 as example.

Screen Shot 2019-08-19 at 21.00.26.png

 

The simplified logical setup is shown below:

Logical Representation-default-teaming-Version1.png

 

 

From the NSX-T perspective we actually have configured three VLAN-based segments. These VLAN-based segments are created with the new policy UI/API.

NSX-T-VLAN-Segments-red-marked.png

The policy UI/API is the new interface since NSX-T 2.4.0 which is the preferred interface for the majority of NSX-T deployments. The "legacy" UI/API is still available and is visible in the UI under the tab "Advanced Networking & Security".

 

As already mentioned, the three VLAN-based segments use the default teaming policy (Load Balance Source), so the VMkernel traffic is distributed over the two pNIC (vmnic2 or vmnic3). Hence, we typically cannot predict, which of the ToR switches will learn the associated MAC address from the three individual VMkernel interfaces. Before we move forward and configure "VLAN Pinning", lets see how the three VmKernel traffic is distributed. One of the easiest way is to check the "MAC address" table for the two ToR switches for interface Gi1/0/10.

Screen Shot 2019-08-19 at 20.53.59.png

As you could see NY-CAT3750G-A is learning the MAC address from vmk3 (0050.5663.f4eb) only, whereas NY-CAT3750G-B is learning the MAC address from vmk4 (0050.5667.50eb) and vmk5 (0050.566d.410d). With the default teaming option "Load Balance Source", the administrator has actually no option to steer the traffic. Please ignore the two learned MAC addresses from VLAN 150, these are TEP MAC addresses.

 

Before we now configure VLAN Pinning, lets assume we would like that vmk3 and vmk4 are learnt on NY-CAT3750-A and vmk5 on the NY-CAT3750-B (when all links are up). We would like to use two new "Named Teaming Policies" with failover. The traffic flows should look like the diagram below --> dotted line means "standby link".

Logical Representation-vlan-pinning-teaming-Version1.png

The first step is to create two additionally "Named Teaming Policies". Please compare this diagram with the very first diagram above. Please be sure you use the identically names for the uplinks (ActiveUplink1 and ActiveUplink2) as for the default teaming policy.

Edit-Uplink-Profile.png

 

The second step is we need to make these two new "Named Teaming Policy" or the associated VLAN transport zone (TZ) available.

Edit-TZ-for-vlan-pinning.png

The third and last step is to edit the three VLAN-based segments according to your traffic steering policy. As you could see, we unfortunately need to edit the VLAN-based segments in the "legacy" "Advanced Networking & Security" UI section. We plan to support this editing option to be available in the new policy UI/API in one of the future NSX-T releases.

NY-VLAN-SEGMENT-90.png

NY-VLAN-SEGMENT-91.png

NY-VLAN-SEGMENT-92.png

As soon you edit the VLAN-based segments with the new "Named Teaming Policy", the ToR switches will immediately learn the MAC address from the associated physical interfaces.

The two ToR switches learn after applying "VLAN Pinning" through two new "Named Teaming Policy" in the following way:

Catalyst-MAC-table-with-vlan-pinning.png

As you could see NY-CAT3750G-A is learning now the MAC address from vmk3 and vmk4, whereas NY-CAT3750G-B is learning only the MAC address from vmk5.

Hope you had a little bit fun reading this NSX-T VLAN Pinning write-up.

 

 

Software Inventory:

vSphere version: 6.5.0, build 13635690

vCenter version: 6.5.0, build 10964411

NSX-T version: 2.4.1.0.0.13716575

 

Blog history

Version 1.0 - 19.08.2019 - first published version

Dear readers

I was recently at the customer site, where we have discussed the details about the NSX-T north/south connectivity with active/active edge node virtual machines to maximizing throughput and resiliency. To achieve the highest north to south and vice versa bandwidth requires the installation of multiple edge nodes in active/active mode leveraging ECMP routing.

But lets have first a basic view of a NSX-T ECMP deployment.

The physical router is in a typical deployment a Layer 3 leaf switch acting as Top of Rack (ToR) device. Two of them are required to provide redundancy. NSX-T support basically two edge node deployment option. Active/standby and active/active deployments. For maximizing throughput and highest level of resiliency is the active/active deployment option the right choice. NSX-T is able to install up to eight paths leveraging ECMP routing. As you are most likely already familiar with NSX-T, then you know that NSX-T requires the Service Router (SR) component on each individual edge nodes (VM or Bare Metal) to setup the BGP peering with the physical router. But have you ever thought about the details what does eight ECMP path entries really mean? Are these eight paths counted on the Tier0 logical router or on the edge node itself or where?

 

Before we talk about the eight ECMP paths let us have a closer look to the physical setup. For this exercise I have in my lab only 4 ESXi hosts available. Each host is equipped with four 1Gbit/s pNIC. Two of these ESXi hosts are purely used to provide CPU and memory resources to the edge node VMs and the other two ESXi hosts are prepared with NSX-T (NSX-T VIBs installed). The two "Edge" ESXi hosts have two vDS, each with 2 pNIC configured. The first vDS is used for vmk0 management, vMotion and IPStorage, the second vDS is used for the Tunnel End Point (TEP) encapsulated GENEVE traffic and the routed uplink(s) traffic towards the ToR switches. The edge node VM is acting as NSX-T transport nodes, they have typically two or three N-VDS embedded (future release will support a single N-VDS per edge node). The two compute hosts are prepared with NSX-T, they act also as transport nodes and they have a slightly different setup regarding vSwitches. The first vSwitch is again a vDS with two pNIC and is used for vmk0 management, vMotion and IPStorage. The other two pNIC are assigned to the NSX-T N-VDS and is responsible for the TEP traffic. The diagram below shows the simplified physical setup.

Physical Host Representation-Version1.png

As you could easily see, the two "Edge" vSphere hosts have totally eight edge node VMs installed. This is a purpose-built "Edge" vSphere cluster to serve edge node VMs only. Is this kind of deployment recommend in a real customer deployment? It depends :-)

To have 4 pNICs probably is a good choice, but most likely are 10Gbit/s or 25Gbit/s interfaces instead 1Gbit/s interfaces preferred respective required for high bandwidth throughput. When you host more than one edge node VM per ESXi hosts, then I recommend to use at least 25Gbit/s interfaces. As our focus is on maximizing throughput and resiliency, a customer deployment would have likely 4 or more ESXi hosts for the Edge" vSphere cluster.  Other aspects should be consider as well, like the used storage system (e.g vSAN), operational aspects (e.g. maintenance mode) or vSphere cluster settings. For this lab are "small" sized edge node VM used; real deployment should use "large" sized edge node VM where maximal throughput is required. To have a dedicated purpose-built "Edge" vSphere cluster can be considered as best practice when maximal throughput and highest resiliency along with operation simplification is required. Here two additional diagrams from the edge node VM deployment in my lab.

Screen Shot 2019-08-06 at 06.06.20.png

Screen Shot 2019-08-06 at 06.14.38.png

 

As we now have already an idea, how the physical environment looks, it is now time to move forward and dig into the logical routing design.

 

Multi Tier Logical Routing-Version1.png

To simplify the diagram, the diagram shows only a single compute transport node (NY-ESX70A) and only six of the eight edge node VMs. All these eight edge node VMs are assigned to a single NSX-T edge cluster and these edge cluster is assigned to the Tier0 logical router. The logical design show a two tier architecture with Tier0 logical routers and two Tier1 logical routers. This is very common design. Centralized services are not deployed at Tier1 level in this exercise. A Tier0 logical router consist in almost all cases (as you normally want use static or dynamic routing to reach the physical world) of a Service Router (SR) and a Distributed Router (DR). Only the edge node VM can host the Service Router (SR). As already said, the Tier1 logical router has in this exercise only the DR component instantiated, a Service Router (SR) is not required, as centralized service (e.g. Load Balancer) are not configured. Each SR has two eBGP peerings with the physical routers. Please keep in mind, only the two overlay segments green-240 and blue-241 are user configured segments. Workload VMs are attached to these overlay segments. This overlay segment provides VM mobility across physical boundaries. The segment between the Tier0 SR and DR and the segments between the Tier0 DR and Tier1 DR are automatically configured overlay segments through NSX-T, including the IP addressing assignment.

Meanwhile, you might have already recognized that eight edge node might be equally with eight ECMP path. Yes this is true....but where we have these eight ECMP path installed in the routing respective in the forwarding table? These eight paths are not installed on the logical construct Tier0 logical router nor on a single edge node. The eight ECMP path are installed on each Tier0 DR component of the individual compute transport node, in our case on the NY-70A Tier0 DR and NY-71A Tier0 DR. The CLI output below shows the forwarding table on the compute transport node NY-ESX70A.

 

IPv4 Forwarding Table NY-ESX70A Tier0 DR

NY-ESX70A> get logical-router e4a0be38-e1b6-458a-8fad-d47222d04875 forwarding ipv4

                                   Logical Routers Forwarding Table - IPv4                            

--------------------------------------------------------------------------------------------------------------

Flags Legend: [U: Up], [G: Gateway], [C: Connected], [I: Interface]

[H: Host], [R: Reject], [B: Blackhole], [F: Soft Flush], [E: ECMP]

 

                   Network                               Gateway                Type               Interface UUID   

==============================================================================================================

0.0.0.0/0                                              169.254.0.2              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.3              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.4              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.5              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.6              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.7              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.8              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.9              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

100.64.48.0/31                                           0.0.0.0                UCI     03ae946a-bef4-45f5-a807-8e74fea878b6

100.64.48.2/31                                           0.0.0.0                UCI     923cbdaf-ad8a-45ce-9d9f-81d984c426e4

169.254.0.0/25                                           0.0.0.0                UCI     48d83fc7-1117-4a28-92c0-7cd7597e525f

--snip--

Each compute transport node can distribute the traffic sourced from the attached workload VMs from south to north for these eight paths (as we have eight different next hops), a single paths per Service Router. With such a active/active ECMP deployment we can maximize the forwarding bandwidth from south to north. This is shown in the diagram below.

Multi Tier Logical Routing-South-to-North-Version1.png

On the other hand, from north to south, each ToR switch has eight path installed (indicated with "multipath") to reach the destination networks green-240 or blue-241. The ToR switch will distributed the traffic from the physical world to all of the eight next hops. Here we achieve as well the maximum of throughput from north to south. Lets have a look to the two ToR switches routing table for the destination network green-240.

 

BGP Table for "green" prefix 172.16.240.0/24 on RouterA and RouterB

NY-CAT3750G-A#show ip bgp 172.16.240.0/0

BGP routing table entry for 172.16.240.0/24, version 189

Paths: (9 available, best #8, table Default-IP-Routing-Table)

Multipath: eBGP

Flag: 0x1800

  Advertised to update-groups:

     1          2  

  64513

    172.16.160.20 from 172.16.160.20 (172.16.160.20)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.22 from 172.16.160.22 (172.16.160.22)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.23 from 172.16.160.23 (172.16.160.23)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.21 from 172.16.160.21 (172.16.160.21)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.27 from 172.16.160.27 (172.16.160.27)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.26 from 172.16.160.26 (172.16.160.26)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.25 from 172.16.160.25 (172.16.160.25)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.24 from 172.16.160.24 (172.16.160.24)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath, best

  64513

    172.16.3.11 (metric 11) from 172.16.3.11 (172.16.3.11)

      Origin incomplete, metric 0, localpref 100, valid, internal

NY-CAT3750G-A#

NY-CAT3750G-B#show ip bgp 172.16.240.0/0

BGP routing table entry for 172.16.240.0/24, version 201

Paths: (9 available, best #9, table Default-IP-Routing-Table)

Multipath: eBGP

Flag: 0x1800

  Advertised to update-groups:

     1          2  

  64513

    172.16.161.20 from 172.16.161.20 (172.16.160.20)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.23 from 172.16.161.23 (172.16.160.23)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.21 from 172.16.161.21 (172.16.160.21)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.26 from 172.16.161.26 (172.16.160.26)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.22 from 172.16.161.22 (172.16.160.22)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.27 from 172.16.161.27 (172.16.160.27)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.25 from 172.16.161.25 (172.16.160.25)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.3.10 (metric 11) from 172.16.3.10 (172.16.3.10)

      Origin incomplete, metric 0, localpref 100, valid, internal

  64513

    172.16.161.24 from 172.16.161.24 (172.16.160.24)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath, best

NY-CAT3750G-B#

 

Traffic arriving at the Service Router (SR) from the ToR switches is kept locally on the edge node before the traffic is forwarded to the destination VM (GENEVE encapsulated). This is shown in the next diagram below.

Multi Tier Logical Routing-North-to-South-Version1.png

And what is the final conclusion of this little lab exercise?

Each single Service Router on an edge node provide to each individual compute transport node exactly a single next hop. The number of BGP peerings per edge node VM is not relevant for the eight ECMP path, the number of edge nodes is relevant. Theoretically a single eBGP peer from each edge node would achieve the same number of ECMP path. But please keep in mind, two BGP session per edge node provide better resiliency. Hope you had a little bit fun reading this NSX-T ECMP edge node write-up.

 

Software Inventory:

vSphere version: 6.5.0, build 13635690

vCenter version: 6.5.0, build 10964411

NSX-T version: 2.4.1.0.0.13716575

 

Blog history

Version 1.0 - 06.08.2019 - first published version

Version 1.1 - 19.08.2019 - minor changes

Dear readers

this is the second blog of a series related to NSX-T. This second coverage provide you relevant information required to better understand the implication of a centralized services in NSX-T. In the first blog where I have provided you an introduction of the lab setup, this second blog will now discuss the impact when you add a Tier-1 Edge Firewall for the tenant BLUE. The diagram below shows the logical representation of the lab setup with the Edge Firewall attached to the Tier-1 uplink interface of the Logical Router for tenant BLUE.

Blog-Diagram-2.1.png

 

For this blog I have selected to add an Edge Firewall for a Tier-1 Logical Router, but I could have also selected a Load Balancer, VPN service or NAT service. The implication to the "internal" NSX-T networking are similar. However, please keep in mind, not all NSX-T centralized services are supported at the Tier-1 level (as example VPN) or at Tier-0 (as example Load Balancer) with NSX-T 2.3 and not all services (as example DHCP or Metadata Proxy) will instantiate a Service Router.

 

Before I move forward and try to explain what is happen under the hood when you enable an Edge Firewall, I would like to update you with some additional information to the diagram below.

Blog-Diagram-2.2.png

I am sure you are already familiar with this diagram above, as we have talked about the same in my first blog. Each of the four Transport Nodes (TN) has two tenants Tier-1 Logical Routers instantiated. Inside of each Transport Node there are two Logical Switches with VNI 17295 and 17296 between the Tier-1 tenant DR and Tier-0 DR used. These two automatically instantiated (sometimes referred as auto-plumbing) transit overlay Logical Switches have got subnets 100.64.144.18/31 and 100.64.144.20/31 automatic assigned. Internal filtering avoids duplicate IP address challenges;  in the same way as NSX-T is doing already for gateways IP (.254) the Logical Switches 17289 and 17294 where the VMs are attached. Each of this Tier-1 to Tier-0 transit Logical Switch (17295 and 17296) could be showed as linked together in the diagram, but as internal filtering takes place, this is for the moment irrelevant.

The intra Tier-0 Logical Switch with the VNI 17292 is used to forward traffic between the Tier-0 DRs and towards northbound via the Service Router (SR). This Logical Switch 17292 has again an automatic assigned IP subnet (169.254.0.0/28). Each Tier-0 DR has assigned the same IP address (.1), but the two Service Routers use different IPs (.2 and .3), otherwise the Tier-0 DR would not be able to forward based on equal cost with two different next hops.

 

Before the network administrator is able to configure an Edge Firewall for tenant BLUE at the Tier-1 level, he has to assign and edge-cluster to the Tier-1 Logical Router along the edge-cluster members. This is shown in the diagram below.

Blog2-add-edge-nodes-to-Tier1-BLUE.png

Please be aware, as soon as you assign an edge-cluster to a Tier-1 Logical Router a Service Router is automatically instantiated, independent of the Edge Firewall.

 

These two new Service Routers are running on the edge-nodes and they are in active/standby mode. Please see in the next diagram below.

Blog2-routing-tier1-blue-overview.png

 

The configuration of the tenant BLUE Edge Firewall itself is shown in the next diagram. Here we use for this lab the default firewall policy.

Blog2-enable-edge-firewall.png

This simple configuration step with adding the two edge-nodes to the Tier-1 Logical Router for tenant BLUE will cause that NSX-T "re-organize" the internal auto-plumbing network. To understand what is happening under the hood, I have divided these internal network changes into four steps instead showing only the final result.

 

In step 1, NSX-T will internally disconnect the Tier-0 DR to Tier-1 DR for the BLUE tenant, as the northbound traffic needs to be redirected to the two edge-nodes, where the Tier-1 Service Routers are running. The internal Logical Switch with VNI 17295 is now explicit linked together between the four Transport Nodes (TN).

Blog-Diagram-2.3.png

 

In step 2, NSX-T automatically instantiate on each edge-node a new Service Router at Tier-1 level for the tenant BLUE with an Edge Firewall. The Service Router are active/standby mode. In this example, the Service Router running on the Transport Node EN1-TN is active, where the Service Router running on EN2-TN is standby. The Tier-1 Service Router uplink interface with the IP address 100.64.144.19 is either UP or DOWN.

Blog-Diagram-2.4.png

 

In step 3, NSX-T connects the Tier-1 Service Router and the Distributed Router for the BLUE tenant together. For this connection is a new Logical Switch with VNI 17288 added. Again, the Service Router running on EN1-TN has the active interface with the IP address 169.254.0.2 up and the Service Router on EN2-TN is down. This ensure, that only the active Service Router can forward traffic.

Blog-Diagram-2.5.png

 

In the final step 4, NSX-T extends the Logical Switch with VNI 17288 to the two compute Transport Nodes ESX70A and ESX71A. This extension is required to route traffic from vm1 as example on the local host before the traffic is forwarded to the Edge Transport Nodes. NSX-T adds finally the required static routing between the different Distributed and Service Routers. NSX-T does all these steps under the hood automatically.

Blog-Diagram-2.6.png

 

The next diagram below shows a traffic flow between vm1 and vm3. The traffic sourced from vm1 will first hit the local DR in the BLUE tenant on ESX70A-TN. The traffic now needs to be forwarded to the active Tier-1 Service Router (SR) with the Edge Firewall running on Edge Transport Node EN1-TN. The traffic reach then the Tier-0 DR on EN1-TN and then is the traffic forwarded to the RED Tier-1 DR and finally arrives at vm3. The return traffic will hit first the local DR in the RED tenant on ESX71A-TN before the traffic reach the Tier-0 DR on the same host. The next hop is the BLUE Tier-1 Service Router (SR). The Edge Firewall inspects the return traffic and forwards the traffic locally the BLUE Tier-1 DR before finally the traffic arrives back at vm1. The majority of the traffic is handled locally on the EN1-TN. The used bandwidth between the physically hosts and therefore the GENEVE encapsulated traffic is the same as without the Edge Firewall. But as everybody could imagine an edge-node which might hosts multiple Edge Firewalls for multiple tenants or any other centralized services should be designed accordingly.

Blog-Diagram-2.7.png

 

Hope you had a little bit fun reading these two blogs. Feel free to share this blog!

 

Lab Software Details:

NSX-T: 2.3.0.0

vSphere: 6.5.0 Update 1 Build 5969303

vCenter:  6.5 Update 1d Build 2143838

 

Version 1.0 - 10.12.2018

Dear readers

this is the first blog of a series related to NSX-T. This first coverage provide you a simple introduction to the most relevant information required to better understand the implication of a centralized services in NSX-T. A centralized service could be as example a Load Balancer or an Edge Firewall.

 

NSX-T has the ability to do distributed routing and supports distributed firewall. Distributed routing in the context that each host, which is prepared for NSX-T, can do local routing. From the logical view is this part called Distributed Router (DR). The DR is part of a Logical Router (LR) and this LR can be configured at Tier-0 or at Tier-1 level. Distributed routing is perfect for scale and could reduce the bandwidth utilization of each physical NIC on the host, as the routing decision is done on the local host. As example, when the source and the destination VM is located on the same host, but connected to different IP subnets and therefore attached to different overlay Logical Switches, then the traffic never leaves the host. All traffic forwarding is processed on the host itself instead at the physical network as example on the ToR switch.

Each host which is prepared with NSX-T and attached to a NSX-T Transport Zone is called a Transport Node (TN). Transport Nodes have implicit a N-VDS configured, which provides as example GENEVE Tunnel Endpoint or is responsible for the distributed Firewall processing. However, there are services like load balancing or edge firewalling which is not a distributed service. VMware call these services "centralized services". Centralized services instantiate a Service Router (SR) and this SR runs on the NSX-T edge-node (EN). An edge-node could be a VM or a bare metal server. Each edge-node is also a Transport Node (TN).

 

Lets have now a look to a simple two tier NSX-T topology with a tenant BLUE and a tenant RED. Both have for new no centralized services at Tier-1 level enabled. For the North-South connectivity to the physical world, there is already a centralized services at the Tier-0 instantiated. However, we don't want focus on this North-South routing part, but as we later would like the understand, what it means to have a centralized services configured on a Tier-1 logical router, it is important to understand this part as well, because North-South routing is also a centralized service. The diagram below shows the logical representation of a simple lab setup. This lab setup will later be used to instantiate the a centralized service at a Tier-1 Logical Router.

Blog-Diagram-1.png

For those which like to get a better understanding of the topology, I have included a diagram of the physical view below. In this lab, we actually use 4 ESXi hosts. For simplification we focus in this blog on the Hypervisor ESXi, instead KVM, even we could build a similar lab with KVM too. On each of these two Transport Nodes ESX70A-TN and ESX71A-TN is a VM installed. The two other hosts ESX50A and ESX51A are NOT* prepared for NSX-T, but they host on each a single edge-node (EN1 and EN2) VM. These two edge-nodes don't have to run on two different ESXi hosts, but it is recommended for redundancy reason.

Blog-Diagram-2.png

As shown in the next diagram, we combine now the physical and logical view. The two Transport Nodes ESX70A-TN and ESX71A-TN have only DRs at Tier-1 and Tier-0 level instantiated, but no Service Router. That means the Logical Router consists of only a DR. These DRs at Tier-1 level provide the gateway (.254) for the attached Logical Switch. The tenant BLUE uses VNI 17289 and the tenant RED uses VNI 17294. NSX-T assign these VNIs out of a VNI pool (default pool: 5000 - 65535). The edge-nodes VMs, now showed as Edge Transport Node (EN1-TN and EN2-TN) have the same Tier-1 and Tier-0 DRs instantiated, but only the Tier-0 includes a Service Router (SR).

Blog-Diagram-1.3.png

The two Tier-1 Logical Routers respective DRs can only talk to each other via the green Tier-0 DR. But before you are able to attach the two Tier-1 DRs to a Tier-0 DR a Tier-0 Logical Router is required. And a Tier-0 Logical Router mandates the assignment of an edge-cluster during the configuration of the Tier-0 Logical Router. Lets assume at this point, we have already configured two edge-node VMs and these edge-node VMs are assigned to an edge-cluster. A Tier-0 Logical Router consists always of a Distributed Router (DR) and depending on the node type as well with a Service Router. A Service Router is always required for the Tier-0 Logical Router, as the Service Router is responsible for the routing connectivity to the physical world. But the Service Router is only instantiated on the edge-nodes. In this lab both Service Router are configured on two edge-nodes respective as Edge Transport Node in active/active mode to provide ECMP to the physical world.

All the internal transit links, as shown in the diagram below, are automatically configured through NSX-T. The only task for the network administrator is to connected the Tier-0 DR to the Tier-1 DRs.

The northbound connection to the physical world requires further a configuration of an additional (or better two Transport Zones for routing redundancy) VLAN based Transport Zone plus the routing peering (typically eBGP). Below is the resulting logical network topology.

One probably ask, why NSX-T instantiate on each edge-node the two Tier-1 DRs too? Well, this is required for an optimized forwarding. As already mentioned, routing decisions are always done on that hosts where the traffic is sourced. Assume vm1 in tenant BLUE would like to talk to a server in the physical world. Traffic sourced at vm1 is forwarded to its local gateway on the Tier-1 DR and then towards the traffic to the Tier-0 DR on the same host. From the Tier-0 DR is then the traffic forwarded to the left Tier-0 SR on EN1-TN (lets assume, traffic is hashed accordingly) and then the flow reach the external destination. The return traffic reach first Tier-0 SR on EN2-TN (lets assume again based on the hash), then the traffic is forwarded locally to Tier-0 DR on the same Edge Transport Node and then to the Tier-1 DR in tenant BLUE. The traffic never leaves EN2-TN until the traffic reach locally the Logical Switch where the vm1 is attached. This is what is called optimized forwarding which is possible due the distributed NSX-T architecture. The traffic needs to be forwarded only once over the physical data center infrastructure and therefore encapsulated into GENEVE per direction!

Blog-Diagram-1.4.png

For now we close this first blog. For the second blog we will dive into the instantiation of a centralized service at Tier-1. Hope you had a little big fun reading this first write-up.

 

 

 

 

*Today, NSX-T supports to run edge-node VMs on NSX-T prepared hosts too. This capability is important to combine compute and edge-node services on the same host.

Version 1.0 - 19.11.2018

Version 1.1 - 27.11.2018 (minor changes)

Version 1.2 - 04.12.2018 (cosmetic changes)

Version 1.3 - 10.12.2018 (link for second blog added)