Dear readers

I was recently at the customer site, where we have discussed the details about the NSX-T north/south connectivity with active/active edge node virtual machines to maximizing throughput and resiliency. To achieve the highest north to south and vice versa bandwidth requires the installation of multiple edge nodes in active/active mode leveraging ECMP routing.

But lets have first a basic view of a NSX-T ECMP deployment.

The physical router is in a typical deployment a Layer 3 leaf switch acting as Top of Rack (ToR) device. Two of them are required to provide redundancy. NSX-T support basically two edge node deployment option. Active/standby and active/active deployments. For maximizing throughput and highest level of resiliency is the active/active deployment option the right choice. NSX-T is able to install up to eight paths leveraging ECMP routing. As you are most likely already familiar with NSX-T, then you know that NSX-T requires the Service Router (SR) component on each individual edge nodes (VM or Bare Metal) to setup the BGP peering with the physical router. But have you ever thought about the details what does eight ECMP path entries really mean? Are these eight paths counted on the Tier0 logical router or on the edge node itself or where?

 

Before we talk about the eight ECMP paths let us have a closer look to the physical setup. For this exercise I have in my lab only 4 ESXi hosts available. Each host is equipped with four 1Gbit/s pNIC. Two of these ESXi hosts are purely used to provide CPU and memory resources to the edge node VMs and the other two ESXi hosts are prepared with NSX-T (NSX-T VIBs installed). The two "Edge" ESXi hosts have two vDS, each with 2 pNIC configured. The first vDS is used for vmk0 management, vMotion and IPStorage, the second vDS is used for the Tunnel End Point (TEP) encapsulated GENEVE traffic and the routed uplink(s) traffic towards the ToR switches. The edge node VM is acting as NSX-T transport nodes, they have typically two or three N-VDS embedded (future release will support a single N-VDS per edge node). The two compute hosts are prepared with NSX-T, they act also as transport nodes and they have a slightly different setup regarding vSwitches. The first vSwitch is again a vDS with two pNIC and is used for vmk0 management, vMotion and IPStorage. The other two pNIC are assigned to the NSX-T N-VDS and is responsible for the TEP traffic. The diagram below shows the simplified physical setup.

Physical Host Representation-Version1.png

As you could easily see, the two "Edge" vSphere hosts have totally eight edge node VMs installed. This is a purpose-built "Edge" vSphere cluster to serve edge node VMs only. Is this kind of deployment recommend in a real customer deployment? It depends :-)

To have 4 pNICs probably is a good choice, but most likely are 10Gbit/s or 25Gbit/s interfaces instead 1Gbit/s interfaces preferred respective required for high bandwidth throughput. When you host more than one edge node VM per ESXi hosts, then I recommend to use at least 25Gbit/s interfaces. As our focus is on maximizing throughput and resiliency, a customer deployment would have likely 4 or more ESXi hosts for the Edge" vSphere cluster.  Other aspects should be consider as well, like the used storage system (e.g vSAN), operational aspects (e.g. maintenance mode) or vSphere cluster settings. For this lab are "small" sized edge node VM used; real deployment should use "large" sized edge node VM where maximal throughput is required. To have a dedicated purpose-built "Edge" vSphere cluster can be considered as best practice when maximal throughput and highest resiliency along with operation simplification is required. Here two additional diagrams from the edge node VM deployment in my lab.

Screen Shot 2019-08-06 at 06.06.20.png

Screen Shot 2019-08-06 at 06.14.38.png

 

As we now have already an idea, how the physical environment looks, it is now time to move forward and dig into the logical routing design.

 

Multi Tier Logical Routing-Version1.png

To simplify the diagram, the diagram shows only a single compute transport node (NY-ESX70A) and only six of the eight edge node VMs. All these eight edge node VMs are assigned to a single NSX-T edge cluster and these edge cluster is assigned to the Tier0 logical router. The logical design show a two tier architecture with Tier0 logical routers and two Tier1 logical routers. This is very common design. Centralized services are not deployed at Tier1 level in this exercise. A Tier0 logical router consist in almost all cases (as you normally want use static or dynamic routing to reach the physical world) of a Service Router (SR) and a Distributed Router (DR). Only the edge node VM can host the Service Router (SR). As already said, the Tier1 logical router has in this exercise only the DR component instantiated, a Service Router (SR) is not required, as centralized service (e.g. Load Balancer) are not configured. Each SR has two eBGP peerings with the physical routers. Please keep in mind, only the two overlay segments green-240 and blue-241 are user configured segments. Workload VMs are attached to these overlay segments. This overlay segment provides VM mobility across physical boundaries. The segment between the Tier0 SR and DR and the segments between the Tier0 DR and Tier1 DR are automatically configured overlay segments through NSX-T, including the IP addressing assignment.

Meanwhile, you might have already recognized that eight edge node might be equally with eight ECMP path. Yes this is true....but where we have these eight ECMP path installed in the routing respective in the forwarding table? These eight paths are not installed on the logical construct Tier0 logical router nor on a single edge node. The eight ECMP path are installed on each Tier0 DR component of the individual compute transport node, in our case on the NY-70A Tier0 DR and NY-71A Tier0 DR. The CLI output below shows the forwarding table on the compute transport node NY-ESX70A.

 

IPv4 Forwarding Table NY-ESX70A Tier0 DR

NY-ESX70A> get logical-router e4a0be38-e1b6-458a-8fad-d47222d04875 forwarding ipv4

                                   Logical Routers Forwarding Table - IPv4                            

--------------------------------------------------------------------------------------------------------------

Flags Legend: [U: Up], [G: Gateway], [C: Connected], [I: Interface]

[H: Host], [R: Reject], [B: Blackhole], [F: Soft Flush], [E: ECMP]

 

                   Network                               Gateway                Type               Interface UUID   

==============================================================================================================

0.0.0.0/0                                              169.254.0.2              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.3              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.4              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.5              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.6              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.7              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.8              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.9              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

100.64.48.0/31                                           0.0.0.0                UCI     03ae946a-bef4-45f5-a807-8e74fea878b6

100.64.48.2/31                                           0.0.0.0                UCI     923cbdaf-ad8a-45ce-9d9f-81d984c426e4

169.254.0.0/25                                           0.0.0.0                UCI     48d83fc7-1117-4a28-92c0-7cd7597e525f

--snip--

Each compute transport node can distribute the traffic sourced from the attached workload VMs from south to north for these eight paths (as we have eight different next hops), a single paths per Service Router. With such a active/active ECMP deployment we can maximize the forwarding bandwidth from south to north. This is shown in the diagram below.

Multi Tier Logical Routing-South-to-North-Version1.png

On the other hand, from north to south, each ToR switch has eight path installed (indicated with "multipath") to reach the destination networks green-240 or blue-241. The ToR switch will distributed the traffic from the physical world to all of the eight next hops. Here we achieve as well the maximum of throughput from north to south. Lets have a look to the two ToR switches routing table for the destination network green-240.

 

BGP Table for "green" prefix 172.16.240.0/24 on RouterA and RouterB

NY-CAT3750G-A#show ip bgp 172.16.240.0/0

BGP routing table entry for 172.16.240.0/24, version 189

Paths: (9 available, best #8, table Default-IP-Routing-Table)

Multipath: eBGP

Flag: 0x1800

  Advertised to update-groups:

     1          2  

  64513

    172.16.160.20 from 172.16.160.20 (172.16.160.20)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.22 from 172.16.160.22 (172.16.160.22)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.23 from 172.16.160.23 (172.16.160.23)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.21 from 172.16.160.21 (172.16.160.21)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.27 from 172.16.160.27 (172.16.160.27)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.26 from 172.16.160.26 (172.16.160.26)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.25 from 172.16.160.25 (172.16.160.25)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.24 from 172.16.160.24 (172.16.160.24)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath, best

  64513

    172.16.3.11 (metric 11) from 172.16.3.11 (172.16.3.11)

      Origin incomplete, metric 0, localpref 100, valid, internal

NY-CAT3750G-A#

NY-CAT3750G-B#show ip bgp 172.16.240.0/0

BGP routing table entry for 172.16.240.0/24, version 201

Paths: (9 available, best #9, table Default-IP-Routing-Table)

Multipath: eBGP

Flag: 0x1800

  Advertised to update-groups:

     1          2  

  64513

    172.16.161.20 from 172.16.161.20 (172.16.160.20)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.23 from 172.16.161.23 (172.16.160.23)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.21 from 172.16.161.21 (172.16.160.21)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.26 from 172.16.161.26 (172.16.160.26)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.22 from 172.16.161.22 (172.16.160.22)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.27 from 172.16.161.27 (172.16.160.27)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.25 from 172.16.161.25 (172.16.160.25)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.3.10 (metric 11) from 172.16.3.10 (172.16.3.10)

      Origin incomplete, metric 0, localpref 100, valid, internal

  64513

    172.16.161.24 from 172.16.161.24 (172.16.160.24)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath, best

NY-CAT3750G-B#

 

Traffic arriving at the Service Router (SR) from the ToR switches is kept locally on the edge node before the traffic is forwarded to the destination VM (GENEVE encapsulated). This is shown in the next diagram below.

Multi Tier Logical Routing-North-to-South-Version1.png

And what is the final conclusion of this little lab exercise?

Each single Service Router on an edge node provide to each individual compute transport node exactly a single next hop. The number of BGP peerings per edge node VM is not relevant for the eight ECMP path, the number of edge nodes is relevant. Theoretically a single eBGP peer from each edge node would achieve the same number of ECMP path. But please keep in mind, two BGP session per edge node provide better resiliency. Hope you had a little bit fun reading this NSX-T ECMP edge node write-up.

 

Software Inventory:

vSphere version: 6.5.0, build 13635690

vCenter version: 6.5.0, build 10964411

NSX-T version: 2.4.1.0.0.13716575

 

Blog history

Version 1.0 - 06.08.2019 - first published version

Version 1.1 - 19.08.2019 - minor changes