Skip navigation
2019
oziltener Novice
VMware Employees

NSX-T N-VDS VLAN Pinning

Posted by oziltener Aug 19, 2019

Dear readers

As you are probably aware NSX-T use its own vSwitch called N-VDS. The N-VDS is primarily used to encapsulate and decapsulate GENEVE overlay traffic between NSX-T transport nodes along supporting the distributed Firewall (dFW) for micro-segmentation. The N-VDS requires its own dedicated pNIC interfaces. These pNIC cannot be shared with vSphere vSwitches (vDS or vSS). Each NSX-T transport node has in a typically NSX-T deployment one or two Tunnel End Points (TEPs) to terminate the GENEVE overlay traffic. The number of TEP is directly related to the attached Uplink Profile. In case you use an uplink teaming policy "failover", then only a single TEP is used. In case of a teaming policy "Load Balance Source" then you have for each physical NIC a TEP assigned. Such an "Load Balance Source" Uplink Profile is showed below and will be used for this lab exercise.

Screen Shot 2019-08-19 at 20.07.00.png

The mapping of the "Uplinks" is as follow:

  • ActiveUplink1 is the pNIC (vmnic2) connected to ToR switch NY-CAT3750G-A
  • ActiveUplink2 is the pNIC (vmnic3) connected to ToR switch NY-CAT3750G-B

 

Additionally, you could see the VLAN 150 to carry the GENEVE encapsulated traffic.

 

However, the N-VDS can also be used for VLAN-based segments. VLAN-based segments are very similar as vDS portgroups. In deployment, where your hosts has only two pNICs and both pNICs are used for the N-VDS (yes, for redundancy reason), you have to use VLAN-based segments to carry VmKernel interfaces (e.g. mgmt, vMotion or vSAN). When your VLAN-based segments are used to carry VMKernel interface traffic and you use an Uplink Profile as shown above, then it is difficult to figure out on which pNIC the VmKernel traffic is carried, as these traffic is following the default teaming policy, in our case "Load Balance Source". Please note, VLAN-based segments is not limited to VmKernel traffic, such segment can also carry regular virtual machine traffic.

 

There are often good reasons to do traffic steering to have a predicable traffic flow behavior, as example you would like to transport Management and vMotion VmKernel traffic under normal conditions (all physical links are up) on pNIC_A and vSAN on pNIC_B. One of the top two reasons are:

1.) predict the forwarding traffic pattern under normal conditions (all links are up) and align as example the VmKernel traffic with the active First Hop Gateway Protocol (e.g. HSRP)

2.) reduce ISL traffic between the two ToR Switches or ToR-to-Spine traffic for high load traffic (e.g. vSAN or vMotion) along with predictable and low latency traffic forwarding (assume as example you have 20 hosts in a single rack and all hosts use for vSAN the left ToR Switch, in such situation the ISL is not carrying vSAN traffic)

 

This is where NSX-T "VLAN Pinning" comes into the game. The term "VLAN Pinning" is in our NSX-T public documentation referred as "Named Teaming Policy". Actually I like the term "VLAN Pinning". In this lab exercise for this blog, I would like to show you how you could configure "VLAN Pinning". The physical lab setup looks like the diagram below:

Physical Host Representation-Version1.png

For this exercise is only host NY-ESX72A relevant. This host NY-ESX72A is attached to two Top of Rack (ToR) Layer 3 Switches, called NY-CAT3750G-A and NY-CAT3750G-B. As you see, this hosts has four pNICs (vmnic0...3). But only the pNIC vmnic2 and vmnic3 assigned to the N-VDS are relevant for this lab exercise. On the host NY-ESX72A, I have created three additional "artificial/dummy" VmKernel interfaces (vmk3, vmk4, vmk5). Each of the three VmKernel is assigned to a dedicated NSX-T VLAN-based segment. The diagram below shows the three VmKernel interfaces, all attached to a dedicated VLAN-based segment owned by the N-VDS (NY-NVDS) and the MAC address from vmk3 as example.

Screen Shot 2019-08-19 at 21.00.26.png

 

The simplified logical setup is shown below:

Logical Representation-default-teaming-Version1.png

 

 

From the NSX-T perspective we actually have configured three VLAN-based segments. These VLAN-based segments are created with the new policy UI/API.

NSX-T-VLAN-Segments-red-marked.png

The policy UI/API is the new interface since NSX-T 2.4.0 which is the preferred interface for the majority of NSX-T deployments. The "legacy" UI/API is still available and is visible in the UI under the tab "Advanced Networking & Security".

 

As already mentioned, the three VLAN-based segments use the default teaming policy (Load Balance Source), so the VMkernel traffic is distributed over the two pNIC (vmnic2 or vmnic3). Hence, we typically cannot predict, which of the ToR switches will learn the associated MAC address from the three individual VMkernel interfaces. Before we move forward and configure "VLAN Pinning", lets see how the three VmKernel traffic is distributed. One of the easiest way is to check the "MAC address" table for the two ToR switches for interface Gi1/0/10.

Screen Shot 2019-08-19 at 20.53.59.png

As you could see NY-CAT3750G-A is learning the MAC address from vmk3 (0050.5663.f4eb) only, whereas NY-CAT3750G-B is learning the MAC address from vmk4 (0050.5667.50eb) and vmk5 (0050.566d.410d). With the default teaming option "Load Balance Source", the administrator has actually no option to steer the traffic. Please ignore the two learned MAC addresses from VLAN 150, these are TEP MAC addresses.

 

Before we now configure VLAN Pinning, lets assume we would like that vmk3 and vmk4 are learnt on NY-CAT3750-A and vmk5 on the NY-CAT3750-B (when all links are up). We would like to use two new "Named Teaming Policies" with failover. The traffic flows should look like the diagram below --> dotted line means "standby link".

Logical Representation-vlan-pinning-teaming-Version1.png

The first step is to create two additionally "Named Teaming Policies". Please compare this diagram with the very first diagram above. Please be sure you use the identically names for the uplinks (ActiveUplink1 and ActiveUplink2) as for the default teaming policy.

Edit-Uplink-Profile.png

 

The second step is we need to make these two new "Named Teaming Policy" or the associated VLAN transport zone (TZ) available.

Edit-TZ-for-vlan-pinning.png

The third and last step is to edit the three VLAN-based segments according to your traffic steering policy. As you could see, we unfortunately need to edit the VLAN-based segments in the "legacy" "Advanced Networking & Security" UI section. We plan to support this editing option to be available in the new policy UI/API in one of the future NSX-T releases.

NY-VLAN-SEGMENT-90.png

NY-VLAN-SEGMENT-91.png

NY-VLAN-SEGMENT-92.png

As soon you edit the VLAN-based segments with the new "Named Teaming Policy", the ToR switches will immediately learn the MAC address from the associated physical interfaces.

The two ToR switches learn after applying "VLAN Pinning" through two new "Named Teaming Policy" in the following way:

Catalyst-MAC-table-with-vlan-pinning.png

As you could see NY-CAT3750G-A is learning now the MAC address from vmk3 and vmk4, whereas NY-CAT3750G-B is learning only the MAC address from vmk5.

Hope you had a little bit fun reading this NSX-T VLAN Pinning write-up.

 

 

Software Inventory:

vSphere version: 6.5.0, build 13635690

vCenter version: 6.5.0, build 10964411

NSX-T version: 2.4.1.0.0.13716575

 

Blog history

Version 1.0 - 19.08.2019 - first published version

Dear readers

I was recently at the customer site, where we have discussed the details about the NSX-T north/south connectivity with active/active edge node virtual machines to maximizing throughput and resiliency. To achieve the highest north to south and vice versa bandwidth requires the installation of multiple edge nodes in active/active mode leveraging ECMP routing.

But lets have first a basic view of a NSX-T ECMP deployment.

The physical router is in a typical deployment a Layer 3 leaf switch acting as Top of Rack (ToR) device. Two of them are required to provide redundancy. NSX-T support basically two edge node deployment option. Active/standby and active/active deployments. For maximizing throughput and highest level of resiliency is the active/active deployment option the right choice. NSX-T is able to install up to eight paths leveraging ECMP routing. As you are most likely already familiar with NSX-T, then you know that NSX-T requires the Service Router (SR) component on each individual edge nodes (VM or Bare Metal) to setup the BGP peering with the physical router. But have you ever thought about the details what does eight ECMP path entries really mean? Are these eight paths counted on the Tier0 logical router or on the edge node itself or where?

 

Before we talk about the eight ECMP paths let us have a closer look to the physical setup. For this exercise I have in my lab only 4 ESXi hosts available. Each host is equipped with four 1Gbit/s pNIC. Two of these ESXi hosts are purely used to provide CPU and memory resources to the edge node VMs and the other two ESXi hosts are prepared with NSX-T (NSX-T VIBs installed). The two "Edge" ESXi hosts have two vDS, each with 2 pNIC configured. The first vDS is used for vmk0 management, vMotion and IPStorage, the second vDS is used for the Tunnel End Point (TEP) encapsulated GENEVE traffic and the routed uplink(s) traffic towards the ToR switches. The edge node VM is acting as NSX-T transport nodes, they have typically two or three N-VDS embedded (future release will support a single N-VDS per edge node). The two compute hosts are prepared with NSX-T, they act also as transport nodes and they have a slightly different setup regarding vSwitches. The first vSwitch is again a vDS with two pNIC and is used for vmk0 management, vMotion and IPStorage. The other two pNIC are assigned to the NSX-T N-VDS and is responsible for the TEP traffic. The diagram below shows the simplified physical setup.

Physical Host Representation-Version1.png

As you could easily see, the two "Edge" vSphere hosts have totally eight edge node VMs installed. This is a purpose-built "Edge" vSphere cluster to serve edge node VMs only. Is this kind of deployment recommend in a real customer deployment? It depends :-)

To have 4 pNICs probably is a good choice, but most likely are 10Gbit/s or 25Gbit/s interfaces instead 1Gbit/s interfaces preferred respective required for high bandwidth throughput. When you host more than one edge node VM per ESXi hosts, then I recommend to use at least 25Gbit/s interfaces. As our focus is on maximizing throughput and resiliency, a customer deployment would have likely 4 or more ESXi hosts for the Edge" vSphere cluster.  Other aspects should be consider as well, like the used storage system (e.g vSAN), operational aspects (e.g. maintenance mode) or vSphere cluster settings. For this lab are "small" sized edge node VM used; real deployment should use "large" sized edge node VM where maximal throughput is required. To have a dedicated purpose-built "Edge" vSphere cluster can be considered as best practice when maximal throughput and highest resiliency along with operation simplification is required. Here two additional diagrams from the edge node VM deployment in my lab.

Screen Shot 2019-08-06 at 06.06.20.png

Screen Shot 2019-08-06 at 06.14.38.png

 

As we now have already an idea, how the physical environment looks, it is now time to move forward and dig into the logical routing design.

 

Multi Tier Logical Routing-Version1.png

To simplify the diagram, the diagram shows only a single compute transport node (NY-ESX70A) and only six of the eight edge node VMs. All these eight edge node VMs are assigned to a single NSX-T edge cluster and these edge cluster is assigned to the Tier0 logical router. The logical design show a two tier architecture with Tier0 logical routers and two Tier1 logical routers. This is very common design. Centralized services are not deployed at Tier1 level in this exercise. A Tier0 logical router consist in almost all cases (as you normally want use static or dynamic routing to reach the physical world) of a Service Router (SR) and a Distributed Router (DR). Only the edge node VM can host the Service Router (SR). As already said, the Tier1 logical router has in this exercise only the DR component instantiated, a Service Router (SR) is not required, as centralized service (e.g. Load Balancer) are not configured. Each SR has two eBGP peerings with the physical routers. Please keep in mind, only the two overlay segments green-240 and blue-241 are user configured segments. Workload VMs are attached to these overlay segments. This overlay segment provides VM mobility across physical boundaries. The segment between the Tier0 SR and DR and the segments between the Tier0 DR and Tier1 DR are automatically configured overlay segments through NSX-T, including the IP addressing assignment.

Meanwhile, you might have already recognized that eight edge node might be equally with eight ECMP path. Yes this is true....but where we have these eight ECMP path installed in the routing respective in the forwarding table? These eight paths are not installed on the logical construct Tier0 logical router nor on a single edge node. The eight ECMP path are installed on each Tier0 DR component of the individual compute transport node, in our case on the NY-70A Tier0 DR and NY-71A Tier0 DR. The CLI output below shows the forwarding table on the compute transport node NY-ESX70A.

 

IPv4 Forwarding Table NY-ESX70A Tier0 DR

NY-ESX70A> get logical-router e4a0be38-e1b6-458a-8fad-d47222d04875 forwarding ipv4

                                   Logical Routers Forwarding Table - IPv4                            

--------------------------------------------------------------------------------------------------------------

Flags Legend: [U: Up], [G: Gateway], [C: Connected], [I: Interface]

[H: Host], [R: Reject], [B: Blackhole], [F: Soft Flush], [E: ECMP]

 

                   Network                               Gateway                Type               Interface UUID   

==============================================================================================================

0.0.0.0/0                                              169.254.0.2              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.3              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.4              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.5              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.6              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.7              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.8              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.9              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

100.64.48.0/31                                           0.0.0.0                UCI     03ae946a-bef4-45f5-a807-8e74fea878b6

100.64.48.2/31                                           0.0.0.0                UCI     923cbdaf-ad8a-45ce-9d9f-81d984c426e4

169.254.0.0/25                                           0.0.0.0                UCI     48d83fc7-1117-4a28-92c0-7cd7597e525f

--snip--

Each compute transport node can distribute the traffic sourced from the attached workload VMs from south to north for these eight paths (as we have eight different next hops), a single paths per Service Router. With such a active/active ECMP deployment we can maximize the forwarding bandwidth from south to north. This is shown in the diagram below.

Multi Tier Logical Routing-South-to-North-Version1.png

On the other hand, from north to south, each ToR switch has eight path installed (indicated with "multipath") to reach the destination networks green-240 or blue-241. The ToR switch will distributed the traffic from the physical world to all of the eight next hops. Here we achieve as well the maximum of throughput from north to south. Lets have a look to the two ToR switches routing table for the destination network green-240.

 

BGP Table for "green" prefix 172.16.240.0/24 on RouterA and RouterB

NY-CAT3750G-A#show ip bgp 172.16.240.0/0

BGP routing table entry for 172.16.240.0/24, version 189

Paths: (9 available, best #8, table Default-IP-Routing-Table)

Multipath: eBGP

Flag: 0x1800

  Advertised to update-groups:

     1          2  

  64513

    172.16.160.20 from 172.16.160.20 (172.16.160.20)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.22 from 172.16.160.22 (172.16.160.22)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.23 from 172.16.160.23 (172.16.160.23)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.21 from 172.16.160.21 (172.16.160.21)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.27 from 172.16.160.27 (172.16.160.27)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.26 from 172.16.160.26 (172.16.160.26)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.25 from 172.16.160.25 (172.16.160.25)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.24 from 172.16.160.24 (172.16.160.24)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath, best

  64513

    172.16.3.11 (metric 11) from 172.16.3.11 (172.16.3.11)

      Origin incomplete, metric 0, localpref 100, valid, internal

NY-CAT3750G-A#

NY-CAT3750G-B#show ip bgp 172.16.240.0/0

BGP routing table entry for 172.16.240.0/24, version 201

Paths: (9 available, best #9, table Default-IP-Routing-Table)

Multipath: eBGP

Flag: 0x1800

  Advertised to update-groups:

     1          2  

  64513

    172.16.161.20 from 172.16.161.20 (172.16.160.20)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.23 from 172.16.161.23 (172.16.160.23)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.21 from 172.16.161.21 (172.16.160.21)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.26 from 172.16.161.26 (172.16.160.26)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.22 from 172.16.161.22 (172.16.160.22)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.27 from 172.16.161.27 (172.16.160.27)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.25 from 172.16.161.25 (172.16.160.25)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.3.10 (metric 11) from 172.16.3.10 (172.16.3.10)

      Origin incomplete, metric 0, localpref 100, valid, internal

  64513

    172.16.161.24 from 172.16.161.24 (172.16.160.24)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath, best

NY-CAT3750G-B#

 

Traffic arriving at the Service Router (SR) from the ToR switches is kept locally on the edge node before the traffic is forwarded to the destination VM (GENEVE encapsulated). This is shown in the next diagram below.

Multi Tier Logical Routing-North-to-South-Version1.png

And what is the final conclusion of this little lab exercise?

Each single Service Router on an edge node provide to each individual compute transport node exactly a single next hop. The number of BGP peerings per edge node VM is not relevant for the eight ECMP path, the number of edge nodes is relevant. Theoretically a single eBGP peer from each edge node would achieve the same number of ECMP path. But please keep in mind, two BGP session per edge node provide better resiliency. Hope you had a little bit fun reading this NSX-T ECMP edge node write-up.

 

Software Inventory:

vSphere version: 6.5.0, build 13635690

vCenter version: 6.5.0, build 10964411

NSX-T version: 2.4.1.0.0.13716575

 

Blog history

Version 1.0 - 06.08.2019 - first published version

Version 1.1 - 19.08.2019 - minor changes