Skip navigation

Dear readers

Welcome to a new blog post talking about a specific NSX-T Edge Node VM deployment with only a single Edge Node N-VDS. You may have seen the 2019 VMworld session "Next-Generation Reference Design with NSX-T: Part 1" (CNET2061BU or CNET2061BE) from Nimish Desai. On one of his slides he mentions how we could deploy a single NSX-T Edge Node N-VDS instead of the three Edge Node N-VDS. This new approach (available since NSX-T 2.5 for Edge Node VM) with a single Edge Node N-VDS has the following advantages:

  • Multiple TEPs to load balance overlay traffic for different overlay segments
  • Same NSX-T Edge Node N-VDS design for VM-based and Bare Metal (with 2 pNIC)
  • Only two Transport Zones (Overlay & VLAN) assigned to a single N-VDS

The diagram below shows the slide with a single Edge Node N-VDS from one of the VMware sessions (CNET2061BU):

Edge Support with Multi-TEP-Nimish-Desai-VM.png

However, the single NSX-T Edge Node design comes with additional requirements respective recommendations:

  • vDS port group Trunks configuration to carry multiple VLANs (requirement)
  • VLAN pinning for deterministic North/South flows (recommendation)

This blog talks mainly about the second bullet point and how we can achieve the correct VLAN pinning. A correct VLAN pinning requires multiple individual configuration steps at different levels, as an example vDS trunk port group teaming or N-VDS named teaming policy configuration. The goal behind this VLAN pinning is a deterministic end-to-end path.

When configured correctly the BGP session is enforced to be aligned with the data forwarding path and hence the MAC addresses from the Tier-0 Gateway Layer 3 Interfaces (LIF) are only learnt at the expected ToR/Leaf switch trunk interfaces.

 

In this blog post the NSX-T Edge Node VMs are deployed on ESXi hosts which are NOT prepared for NSX-T. The two ESXi hosts belong to a single vSphere Cluster exclusively used for NSX-T Edge Node VMs. There are a few good reasons NOT to prepare these ESXi hosts with NSX-T where you host only NSX-T Edge Node VMs:

  • It is not required
  • Better NSX-T upgrade-ability (you don't need to evacuate the NSX-T Edge Node VM during host NSX-T software upgrade with vMotion to enter maintenance mode; every vMotion of the NSX-T Edge Node VM will cause a short unnecessary data plane glitch)
  • Shorter NSX-T upgrade cycles (for every NSX-T upgrade you only need to upgrade the ESXi hosts which are used for the payload VMs and only the NSX-T Edge Node VMs, but not the ESXi hosts where you have your Edge Nodes deployed
  • vSphere HA can be turned off (do we want to move a highly loaded packet forwarding node with vMotion in a host vSphere HA event? No I don't think so - as the routing HA model is much quicker)
  • Simplified DRS settings (do we want to move an NSX-T Edge Node with vMotion to balance the resources?)
  • Typically a resource pool is not required

We should never underestimate how important smooth upgrade cycles are. Upgrade cycles are time consuming events and are typically required multiple times per year.

To have the ESXi host NOT prepared for NSX-T is considered best practice and should always be deployed in any NSX-T deployments which can afford a dedicated vSphere Cluster only for NSX-T Edge Node VMs. Install NSX-T on ESXi hosts where you have deployed your NSX-T Edge Node VMs (called collapsed design) is appropriate for customers who have a low number of ESXi hosts to keep the CAPEX costs low.

 

The diagram below shows the lab test bed of a single ESXi host with a single Edge Node appliance which uses only a single N-VDS. The relevant configuration steps are marked with 1 to 4.

Networking – NSX-T Edge Topology.png

 

The NSX-T Edge Node VM is configured with two transport zones. The same overlay transport zone is used for the compute ESXi hosts where I host the payload VMs. Both transport zones are assigned to a single N-VDS, called NY-HOST-NVDS. The name of the N-VDS might confuse you a little bit due to the selected name, but the same NY-HOST-NVDS is used for all compute ESXi hosts prepped with NSX-T and indicate that only a single N-VDS is required independent of Edge Node or compute ESXi host. However, you might select a different name for the N-VDS.

Screen Shot 2020-04-11 at 11.40.18.png

The single N-VDS (NY-HOST-NVDS) on the Edge Node is configured with an Uplink Profile (please see more details below) with two static TEP IP addresses, which allow us to load balance the Geneve encapsulated overlay traffic for North/South. Both Edge Node FastPath interfaces (fp-eth0 & fp-eth1) are mapped to a labelled Active Uplink name as part of the default teaming policy.

Screen Shot 2020-04-11 at 11.40.26.png

There are 4 areas where we need to take care of the correct settings.

<1> - At the physical ToR/Leaf Switch Level

The trunk ports will allow only the required VLANs

  • VLAN 60 - NSX-T Edge Node management interface
  • VLAN 151 - Geneve TEP VLAN
  • VLAN 160 - Northbound Uplink VLAN for NY-N3K-LEAF-10
  • VLAN 161 - Northbound Uplink VLAN for NY-N3K-LEAF-11

The resulting interface configuration along with the relevant BGP configuration is in the table shown below. Please note for redundancy reason both Northbound Uplink VLAN 160 and 161 are allowed on the trunk configuration. Under normal conditions, NY-N3K-LEAF-10 will learn only MAC addresses from VLAN 60, 151 and 160 and NY-N3K-LEAF-11 will learn only MAC addresses from VLAN 60, 151 and 161.

Table 1 - Nexus ToR/LEAF Switch Configuration

NY-N3K-LEAF-10 Interface Configuration
NY-N3K-LEAF-11 Interface Configuration

interface Ethernet1/2

  description *NY-ESX50A-VMNIC2*

  switchport mode trunk

  switchport trunk allowed vlan 60,151,160-161

  spanning-tree port type edge trunk

interface Ethernet1/2

  description *NY-ESX50A-VMNIC3*

  switchport mode trunk

  switchport trunk allowed vlan 60,151,160-161

  spanning-tree port type edge trunk

interface Ethernet1/4

  description *NY-ESX51A-VMNIC2*

  switchport mode trunk

  switchport trunk allowed vlan 60,151,160-161

  spanning-tree port type edge trunk

interface Ethernet1/4

  description *NY-ESX51A-VMNIC3*

  switchport mode trunk

  switchport trunk allowed vlan 60,151,160-161

  spanning-tree port type edge trunk

router bgp 64512

  router-id 172.16.3.10

  log-neighbor-changes

  ---snip---

  neighbor 172.16.160.20 remote-as 64513

    update-source Vlan160

    timers 4 12

    address-family ipv4 unicast

  neighbor 172.16.160.21 remote-as 64513

   update-source Vlan160

    timers 4 12

    address-family ipv4 unicast

router bgp 64512

  router-id 172.16.3.11

  log-neighbor-changes

  ---snip---

  neighbor 172.16.161.20 remote-as 64513

    update-source Vlan161

    timers 4 12

    address-family ipv4 unicast

  neighbor 172.16.161.21 remote-as 64513

    update-source Vlan161

    timers 4 12

    address-family ipv4 unicast

As part of the Cisco Nexus 3048 BGP configuration we see that only NY-N3K-LEAF-10 terminates the BGP session on VLAN 160 and only NY-N3K-LEAF-11 terminates the BGP session on VLAN 161.

 

<2> - At the vDS Port Group Level

The vDS is configured with four vDS port groups in total:

  • Port Group (Type VLAN): NY-VDS-PG-ESX5x-NSXT-EDGE-MGMT60: carries only VLAN 60 and has an active/standby teaming policy
  • Port Group (Type VLAN): NY-vDS-PG-ESX5x-EDGE2-Dummy999: this dummy port group is used for the remaining unused Edge Node Fastpath (fp-eth2) interface to avoid that NSX-T reports it as admin status down
  • Port Group (Type VLAN trunking): NY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkA: Carries the Edge Node TEP VLAN 151 and Uplink VLAN 160
  • Port Group (Type VLAN trunking): NY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkB: Carries the Edge Node TEP VLAN 151 and Uplink VLAN 161

The two trunk port groups have only one vDS-Uplink active, the other vDS-Uplink is set to standby. This is required so that the Uplink VLAN traffic along with the BGP session can only be forwarded on the specific vDS-Uplink (vDS-Uplink is mapped to the corresponding pNIC) during normal condition. With these settings we can achieve

  • Failover order gets deterministic
  • Symmetric Bandwidth for both overlay and North/South traffic
  • The BGP session between the Tier-0 Gateway and the ToR/Leaf switches should stay UP even if one or both physical links between the ToR/Leaf switches and the ESXi hosts goes down (the BGP session is then carried over the Trunk Link between the ToR/Leaf switches).

 

The table below highlights the relevant VLAN and Teaming settings:

Table 2 - vDS Port Group Configuration

NY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkA ConfigurationNY-vDS-PG-ESX5x-EDGE2-EDGE-TrunkB Configuration
Trunka-vlan-Screen Shot 2020-04-11 at 10.38.25.pngTrunkb-vlan-Screen Shot 2020-04-11 at 10.39.49.png
Trunka-teaming-Screen Shot 2020-04-11 at 10.38.06.pngTrunkb-teaming-Screen Shot 2020-04-11 at 10.39.58.png

 

<3> - At the NSX-T Uplink Profile Level

The NSX-T Uplink Profile is a global construct that defines how traffic will leave a Transport Node respective Edge Transport Node.

The single Uplink Profile used for the two Edge Node FastPath interfaces (fp-eth0 & fp-eth1) needs to be extended with two additional Named Teaming Policies to steer the North/South uplink traffic to the corresponding ToR/Leaf switch.

  • The default teaming requires to be configured as Source-port-ID with the two Active Uplinks (I am using label EDGE-UPLINK1 & EDGE-UPLINK2)
  • An additional teaming policy called NY-Named-Teaming-N3K-LEAF-10 is configured with failover teaming policy with a single Active Uplink (label EDGE-UPLINK1)
  • An additional teaming policy called NY-Named-Teaming-N3K-LEAF-11 is configured with failover teaming policy with a single Active Uplink (label EDGE-UPLINK2)

Please note, the Active Uplink labels for the default and the additional Named Teaming Policies need to be the same.

Screen Shot 2020-04-11 at 10.58.49.png

 

<4> - At the NSX-T Uplink VLAN Segment Level

To activate the previous configured Named Teaming Policies for the specific Tier-0 VLAN segment 160 respective segment 161 we need to first assign the Named Teaming Policy to the VLAN transport zone.

Screen Shot 2020-04-11 at 11.07.12.png

The last step involves the configuration of each of the two Uplink VLAN segments (160 & 161) with the corresponding Named Teaming Policy. NSX-T 2.5.1 requires to configure the VLAN segment with the Named Teaming Policy in the "legacy" Advance Networking&Security UI. The recently released NSX-T 3.0 will support Policy UI.

Table 3 - NSX-T VLAN Segment Configuration

VLAN Segment NY-T0-EDGE-UPLINK-SEGMENT-160
VLAN Segment NY-T0-EDGE-UPLINK-SEGMENT-161

Screen Shot 2020-04-11 at 11.09.50.png

Screen Shot 2020-04-11 at 11.09.37.png
Screen Shot 2020-04-11 at 11.29.17.pngScreen Shot 2020-04-11 at 11.29.25.png

 

Verification

The resulting topology with both NSX-T Edge Nodes and the previous shown steps is shown below. It shows how the Tier-0 VLAN Segment 160 respective 161 is "routed" through the different levels from the Tier-0 Gateway towards the Nexus Leaf switches via the vDS trunk port groups.

Networking – NSX-T Edge Pinned VLAN.png

The best option to verify if all your settings are correct is to validate on which ToR/Leaf trunk port you learn the appropriate MAC address of the Tier-0 Gateway Layer 3 interfaces. These Layer 3 interfaces belong to the Tier-0 Service Router (SR). You can get the MAC address via CLI.

Table 4 - NSX-T Tier-0 Layer 3 Interface Configuration

ny-edge-transport-node-20(tier0_sr)> get interfacesny-edge-transport-node-21(tier0_sr)> get interfaces

Interface: 2f83fda5-0da5-4764-87ea-63c0989bf059

Ifuid: 276

Name: NY-T0-LIF160-EDGE-20

Internal name: uplink-276

Mode: lif

IP/Mask: 172.16.160.20/24

MAC: 00:50:56:97:51:65

LS port: 40102113-c8af-4d4e-a94d-ca44f9efe9a5

Urpf-mode: STRICT_MODE

DAD-mode: LOOSE

RA-mode: SLAAC_DNS_TRHOUGH_RA(M=0, O=0)

Admin: up

Op_state: up

MTU: 9000

Interface: a3d7669a-e81c-43ea-81c0-dd60438284bc

Ifuid: 289

Name: NY-T0-LIF160-EDGE-21

Internal name: uplink-289

Mode: lif

IP/Mask: 172.16.160.21/24

MAC: 00:50:56:97:84:c3

LS port: 045cd486-d8c5-4df5-8784-2e49862771f4

Urpf-mode: STRICT_MODE

DAD-mode: LOOSE

RA-mode: SLAAC_DNS_TRHOUGH_RA(M=0, O=0)

Admin: up

Op_state: up

MTU: 9000

Interface: a1f0d5d0-3883-4e04-b985-e391ec1d9711

Ifuid: 281

Name: NY-T0-LIF161-EDGE-20

Internal name: uplink-281

Mode: lif

IP/Mask: 172.16.161.20/24

MAC: 00:50:56:97:a7:33

LS port: d180ee9a-8e82-4c59-8195-ea65660ea71a

Urpf-mode: STRICT_MODE

DAD-mode: LOOSE

RA-mode: SLAAC_DNS_TRHOUGH_RA(M=0, O=0)

Admin: up

Op_state: up

MTU: 9000

Interface: 2de46a54-3dba-4ddc-abe7-5b713260e7d4

Ifuid: 296

Name: NY-T0-LIF161-EDGE-21

Internal name: uplink-296

Mode: lif

IP/Mask: 172.16.161.21/24

MAC: 00:50:56:97:ec:1b

LS port: c32e2109-32d0-4c0f-a916-bfba01fdd6ac

Urpf-mode: STRICT_MODE

DAD-mode: LOOSE

RA-mode: SLAAC_DNS_TRHOUGH_RA(M=0, O=0)

Admin: up

Op_state: up

MTU: 9000

 

The MAC address tables show that ToR/Leaf switch NY-N3K-LEAF-10 learns the Tier-0 Layer 3 MAC addresses from VLAN 160 locally and from VLAN 161 via Portchannel 1 (Po1).

And the MAC address tables show that ToR/Leaf switch NY-N3K-LEAF-11 learns the Tier-0 Layer 3 MAC addresses from VLAN 161 locally and from VLAN 160 via Portchannel 1 (Po1).

Table 5 - ToR/Leaf Switch MAC Address Table for Northbound Uplink VLAN 160 and 161

ToR/Leaf Switch NY-N3K-LEAF-10
ToR/Leaf Switch NY-N3K-LEAF-11

NY-N3K-LEAF-10# show mac address-table dynamic vlan 160

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  160     0050.5697.5165   dynamic  0         F      F    Eth1/2

*  160     0050.5697.84c3   dynamic  0         F      F    Eth1/4

NY-N3K-LEAF-11# show mac address-table dynamic vlan 160

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  160     0050.5697.5165   dynamic  0         F      F    Po1

*  160     0050.5697.84c3   dynamic  0         F      F    Po1

*  160     780c.f049.0c81   dynamic  0         F      F    Po1

NY-N3K-LEAF-10# show mac address-table dynamic vlan 161

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  161     0050.5697.a733   dynamic  0         F      F    Po1

*  161     0050.5697.ec1b   dynamic  0         F      F    Po1

*  161     502f.a8a8.717c   dynamic  0         F      F    Po1

NY-N3K-LEAF-11# show mac address-table dynamic vlan 161

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  161     0050.5697.a733   dynamic  0         F      F    Eth1/2

*  161     0050.5697.ec1b   dynamic  0         F      F    Eth1/4

*  161     780c.f049.0c81   dynamic  0         F      F    Po1

 

As we have seen in the Edge Transport Node configuration each Edge Node has two TEP IP addresses statically configured. Both Fastpath interfaces load balance the Geneve encapsulated overlay traffic. Table 8 shows the TEP MAC address in order to verify the Edge Node TEP MAC addresses.

Table 7 - ToR/Leaf Switch MAC Address Table for Edge Node TEP VLAN 151

ToR/Leaf Switch NY-N3K-LEAF-10ToR/Leaf Switch NY-N3K-LEAF-11

NY-N3K-LEAF-10# show mac address-table dynamic vlan 151

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  151     0050.5697.5165   dynamic  0         F      F    Eth1/2

*  151     0050.5697.84c3   dynamic  0         F      F    Eth1/4

*  151     0050.5697.a733   dynamic  0         F      F    Po1

*  151     0050.5697.ec1b   dynamic  0         F      F    Po1

*  151     502f.a8a8.717c   dynamic  0         F      F    Po1

NY-N3K-LEAF-11# show mac address-table dynamic vlan 151

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*  151     0000.0c9f.f097   dynamic  0         F      F    Po1

*  151     0050.5697.5165   dynamic  0         F      F    Po1

*  151     0050.5697.84c3   dynamic  0         F      F    Po1

*  151     0050.5697.a733   dynamic  0         F      F    Eth1/2

*  151     0050.5697.ec1b   dynamic  0         F      F    Eth1/4

*  151     780c.f049.0c81   dynamic  0         F      F    Po1

 

Table 8 - NSX-T Edge Node TEP MAC Addresses

ny-edge-transport-node-20>ny-edge-transport-node-21>

ny-edge-transport-node-20> get interface fp-eth0 | find MAC

  MAC address: 00:50:56:97:51:65

 

ny-edge-transport-node-20> get interface fp-eth1 | find MAC

  MAC address: 00:50:56:97:a7:33

ny-edge-transport-node-21> get interface fp-eth0 | find MAC

  MAC address: 00:50:56:97:84:c3

 

ny-edge-transport-node-21> get interface fp-eth1 | find MAC

MAC address: 00:50:56:97:ec:1b

 

For the sake of completeness, the table below shows that only ToR/Leaf Switch NY-N3K-LEAF-10 learns the two Edge Node Management MAC address from VLAN 60 locally, ToR/Leaf Switch NY-N3K-LEAF-11 only via Portchannel 1 (Po1). This is expected, as we have configured the teaming policy in active/standby on the vDS port group. The Edge Node N-VDS is not relevant for the Edge Node management interface.

Table 8 - ToR/Leaf Switch MAC Address Table for Edge Node Management VLAN 60

ToR/Leaf Switch NY-N3K-LEAF-10
ToR/Leaf Switch NY-N3K-LEAF-11

NY-N3K-LEAF-10# show mac address-table dynamic vlan 60

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*   60     0050.5697.1e49   dynamic  0         F      F    Eth1/4

*   60     0050.5697.4555   dynamic  0         F      F    Eth1/2

*   60     502f.a8a8.717c   dynamic  0         F      F    Po1

NY-N3K-LEAF-11# show mac address-table dynamic vlan 60

Legend:

        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

        age - seconds since last seen,+ - primary entry using vPC Peer-Link,

        (T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan

   VLAN     MAC Address      Type      age     Secure NTFY Ports

---------+-----------------+--------+---------+------+----+------------------

*   60     0000.0c9f.f03c   dynamic  0         F      F    Po1

*   60     0050.5697.1e49   dynamic  0         F      F    Po1

*   60     0050.5697.4555   dynamic  0         F      F    Po1

 

Please note, I highly recommend always to run a few failover tests to confirm that the NSX-T Edge Node deployment works as expected.

 

I hope you had a little bit of fun reading this blog post about a single N-VDS on the Edge Node with VLAN pinning.

 

Software Inventory:

vSphere version: VMware ESXi, 6.5.0, 15256549

vCenter version:6.5.0, 10964411

NSX-T version: 2.5.1.0.0.15314288 (GA)

Cisco Nexus 3048 NX-OS version: 7.0(3)I7(6)

 

 

Blog history

Version 1.0 - 13.04.2020 - first published version

Version 1.1 - 14.04.2020 - minor changes (license)

Version 1.2 - 25.04.2020 - minor changes (typos)

Dear readers

Welcome to a new series of blogs talking about the network readiness. As you might be already aware, NSX-T requires from the physical underlay network mainly two things:

  • IP Connectivity – IP connectivity between all components of NSX-T and compute hosts. This includes on one hand the Geneve Tunnel Endpoint (TEP) interfaces and an other management interfaces (typically vmk0) on hosts as well NSX-T Edge nodes (management interface) - both bare metal and virtual NSX-T Edge nodes.
  • Jumbo Frame Support – A minimum required MTU is 1600, however MTU of 1700 bytes is recommended to address the full possibility of variety of functions and future proof the environment for an expanding Geneve header. To get out most of your VMware SDDC your physical underlay network should support at least an MTU of 9000 bytes.

This blog has a focus on the MTU readiness for NSX-T. There are other VMkernel interfaces than for the overlay encapsulation with Geneve, like vSAN or vMotion which perform better with a higher MTU. So we keep this discussion on the MTU more generally. Physical network gear vendors, like Cisco with the Nexus Data Center switch family typically support a MTU of 9216 bytes. Other vendors might have the same MTU upper size.

 

This blog is about the correct MTU configuration and the verification within the Data Center spine-leaf architecture with Nexus 3K switches running NX-OS. Lets have a look to a very basic and simple lab spine-leaf topology with only three Nexus N3K-C3048TP-1GE switches:

Lab Spine Leaf Topology.png

Out of the box, the Nexus 3048 switches are configured with a MTU of 1500 bytes only. For an MTU of 9216 bytes we need to configure three pieces.

  • Layer 3 Interfaces MTU Configuration – This type of interface is used between the Leaf-10 and the Borderspine-12 switch respective between the Leaf-11 and Borderpine-12 switch. We run on this interface OSPF to announce the Loopback0 interface for the iBGP peering connectivity. As example the MTU Layer 3 interface configuration on interface e1/49 from the Leaf-10 is shown below:
Nexus 3048 Layer 3 Interface MTU Configuration

NY-N3K-LEAF-10# show run inter e1/49

---snip---

interface Ethernet1/49

  description **L3 to NY-N3K-BORDERSPINE-12**

  no switchport

  mtu 9216

  no ip redirects

  ip address 172.16.3.18/30

  ip ospf network point-to-point

  no ip ospf passive-interface

  ip router ospf 1 area 0.0.0.0

NY-N3K-LEAF-10#

 

  • Layer 3 Switch Virtual Interfaces (SVI) MTU Configuration – This type of interface is required as example to establish an IP connectivity between the Leaf-10 and Leaf-11 switches when the interfaces between the Leaf switches are configured as Layer 2 interfaces. We are using a dedicated SVI for VLAN 3 for the OSPF neighborship and the iBGP peering connectivity between the Leaf-10 and Leaf-11. In this lab topology are the interfaces e1/51 and e1/52 configured as dot1q trunk to carry multiple VLANs (including VLAN 3) and these to interfaces are combined into a portchannel running LACP for redundancy reason. As example the MTU configuration of the SVI for VLAN 3 from the Leaf-10 is shown below:
Nexus 3048 Switch Virtual Interface (SVI) MTU Configuration

NY-N3K-LEAF-10# show run inter vlan 3

---snip---

interface Vlan3

  description *iBGP-OSPF-Peering*

  no shutdown

  mtu 9216

  no ip redirects

  ip address 172.16.3.1/30

  ip ospf network point-to-point

  no ip ospf passive-interface

  ip router ospf 1 area 0.0.0.0

NY-N3K-LEAF-10#

 

  • Global Layer 2 Interface MTU Configuration – This global configuration is required for this type of Nexus switches and a few other Nexus switches (please see footnote 1 for more details). This Nexus 3000 does not support individual Layer 2 interface MTU configuration; the MTU for Layer 2 interfaces must be configured via a network-qos policy command. All interfaces configured as access or trunk port for host connectivity and as well for the dot1q trunk between the Leaf switches (e1/51 and e1/52) requires the network-qos configuration as shown below:
Nexus 3048 Global MTU QoS Policy Configuration

NY-N3K-LEAF-10#show run

---snip---

policy-map type network-qos POLICY-MAP-JUMBO

  class type network-qos class-default

   mtu 9216

system qos

  service-policy type network-qos POLICY-MAP-JUMBO

NY-N3K-LEAF-10#

 

The network-qos global MTU configuration needs to be verified with the command as shown below:

Nexus 3048 Global MTU QoS Policy Verification

NY-N3K-LEAF-10# show queuing interface ethernet 1/51-52 | include MTU

HW MTU of Ethernet1/51 : 9216 bytes

HW MTU of Ethernet1/52 : 9216 bytes

NY-N3K-LEAF-10#

 

The verification of the end-to-end MTU of 9216 bytes within the physical network should be done already typically before you attach your first hypervisor ESXi hosts. Please keep in mind, the virtual distributed switch (vDS) and the NSX-T N-VDS (e.g uplink profile MTU configuration) supports today up to 9000 bytes. This MTU includes the overhead for the Geneve encapsulation. As you could see in the table below of an ESXi host, the MTU is set to the maximum of 9000 bytes for the VMkernel interfaces used for Geneve (we label it unfortunately still with vxlan) respective for vMotion and IP storage.

ESXi Host MTU VMkernel Interface Verification

[root@NY-ESX50A:~] esxcfg-vmknic -l

Interface  Port Group/DVPort/Opaque Network        IP Family IP Address      Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type     NetStack           

vmk0       2                                       IPv4      172.16.50.10    255.255.255.0   172.16.50.255   b4:b5:2f:64:f9:48 1500    65535     true    STATIC   defaultTcpipStack  

vmk2       17                                      IPv4      172.16.52.10    255.255.255.0   172.16.52.255   00:50:56:63:4c:85 9000    65535     true    STATIC   defaultTcpipStack  

vmk10      10                                      IPv4      172.16.150.12   255.255.255.0   172.16.150.255  00:50:56:67:d5:b4 9000    65535     true    STATIC   vxlan              

vmk50      910dba45-2f63-40aa-9ce5-85c51a138a7d    IPv4      169.254.1.1     255.255.0.0     169.254.255.255 00:50:56:69:68:74 1500    65535     true    STATIC   hyperbus           

vmk1       8                                       IPv4      172.16.51.10    255.255.255.0   172.16.51.255   00:50:56:6c:7c:f9 9000    65535     true    STATIC   vmotion            

[root@NY-ESX50A:~]

 

For sure, the verification of the end-to-end MTU between two ESXi hosts I still highly recommend by sending VMkernel pings with the don't-fragment bit set (e.g. vmkping ++netstack=vxlan -d -c 3 -s 8972 -I vmk10 172.16.150.13).

 

But for a serious end-to-end MTU 9216 physical network verification we need to look for another tool than the VMkernel ping. In my case I just using BGP running on the Nexus 3048 switches. BGP is running on the top of TCP and TCP support the option "Maximum Segment Size" to maximize the TCP datagrams.

 

The TCP Maximum Segment Size (MSS) is a parameter of the options field of the TCP header that specifies the largest amount of data, specified in bytes. This information is part of the SYN TCP three-way handshake, as the diagram below shows from a wireshark sniffer trace.

Wireshark-MTU9216-MSS-TCP.png

The TCP MSS defines the maximum amount of data that an IPv4 endpoint is willing to accept in a single TCP/IPv4 datagram. RFC879 explicit mention that MSS counts only data octets in the segment, but it does not count the TCP header or the IP header. In the wireshark trace example the two IPv4 endpoints (Loopback 172.16.3.10 and 172.16.3.12) have accepted an MSS of 9176 bytes on a physical Layer 3 link with MTU 9216 during the TCP three-way handshake. The difference of 40 bytes is based on the default TCP header of 20 bytes and IP header of again 20 bytes.

Please keep in mind, a small MSS values will reduce or eliminate IP fragmentation for any TCP based application, but will result in higher overhead. This is also truth for BGP messages.

BGP update messages carry all the BGP prefixes as part of the Network Layer Reachability Information (NLRI) Path Attribute. In regards for an optimal BGP performance in a spine-leaf architecture running BGP, it is advisable to set the MSS for BGP to the maximum value but avoid fragmentation. As defined RFC879 all IPv4 endpoints are required to handle an MSS of 536 bytes (=MTU 576 bytes minus 20 bytes for TCP Header*** minus 20 bytes IP Header).

But are these Nexus switches using MSS of 536 bytes only? Nope!

These Nexus 3048 switches running NX-OS 7.0(3)I7(6) are by default configured to discover the maximal MTU path between the two IPv4 endpoints leveraging Path MTU Discovery (PMTUD) feature. Other Nexus switches may requires the configuration of the global command "ip tcp path-mtu-discovery" to enable PMTUD.

 

MSS is sometimes mistaken for PMTUD. MSS is a concept used by TCP in the Transport Layer and it specifies the largest amount of data that a computer or communications device can receive in a single TCP segment. While PMTUD is used to specifies the largest packet size that can be sent over this path without suffering fragmentation.

 

But how we could verify the MSS used for the BGP peering session between the Nexus 3048 switches?

Nexus 3048 switches running NX-OS software allows the administrator to check the MSS of the TCP BGP session with the following command: show sockets connection tcp details.

Below we see two TCP BGP sessions between the IPv4 endpoints (Switch Loopback Interfaces) and each of the session shows a MSS of 9164 bytes.

BGP TCP Session Maximum Segment Size Verification

NY-N3K-LEAF-10# show sockets connection tcp local 172.16.3.10 detail

 

---snip---

 

Kernel Socket Connection:

State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port

 

ESTAB      0      0               172.16.3.10:24415          172.16.3.11:179    ino:78187 sk:ffff88011f352700

 

     skmem:(r0,rb262144,t0,tb262144,f0,w0,o0) ts sack cubic wscale:2,2 rto:210 rtt:12.916/14.166 ato:40 mss:9164 cwnd:10 send 56.8Mbps rcv_space:18352

 

 

ESTAB      0      0               172.16.3.10:45719          172.16.3.12:179    ino:79218 sk:ffff880115de6800

 

     skmem:(r0,rb262144,t0,tb262144,f0,w0,o0) ts sack cubic wscale:2,2 rto:203.333 rtt:3.333/1.666 ato:40 mss:9164 cwnd:10 send 220.0Mbps rcv_space:18352

 

 

NY-N3K-LEAF-10#

Please reset always the BGP session when you change the MTU, as the MSS is only discovered during the initial TCP three-way handshake.

 

The MSS value of 9164 bytes confirms that the underlay physical network is ready with an end-to-end MTU of 9216 bytes. But why is the MSS value (9164) of BGP 12 bytes smaller than the TCP MSS value (9176) negotiated during the TCP three-way handshake?

Again, in many TCP IP stacks implementation we could see a MSS of 1460 bytes with the interface MTU of 1500 bytes respective a MSS of 9176 bytes for a interface MTU of 9216 bytes (40 bytes difference) , but there are other factors that can change this. For example, if both sides support RFC 1323/7323 (enhanced timestamps, windows scaling, PAWS***) this will add 12 bytes to the TCP header, reducing the payload to 1448 bytes respective 9164 bytes.

And indeed, the Nexus NX-OS TCP/IP stacks used for BGP supports by default the TCP enhanced timestamps option and leverage the PMTUD (RFC 1191) feature to handle the 12 byte extra room and hence reduce the maximal payload (payload in our case is BGP) to a MSS of 9164 bytes.

 

The below diagram from a wireshark sniffer trace confirms the extra 12 byte used for the TCP timestamps option.

Wireshark-TCP-12bytes-Option-timestamps.png

Hope you had a little bit fun reading this small Network Readiness write-up.

 

Footnote 1: Configure and Verify Maximum Transmission Unit on Cisco Nexus Platforms - Cisco

** 20 bytes TCP Header is only correct when default TCP header options are used, RFC 1323 - TCP Extensions for High Performance and replaced by RFC 7323 - TCP Extensions for High Performance  defines TCP extension which requires up to 12 bytes more.

*** PAWS = Protect Against Wrapped Sequences

 

Software Inventory:

vSphere version: VMware ESXi, 6.5.0, 15256549

vCenter version:6.5.0, 10964411

NSX-T version: 2.5.1.0.0.15314288 (GA)

Cisco Nexus 3048 NX-OS version: 7.0(3)I7(6)

 

Blog history:

Version 1.0 - 23.03.2020 - first published version

oziltener Novice
VMware Employees

NSX-T N-VDS VLAN Pinning

Posted by oziltener Aug 19, 2019

Dear readers

As you are probably aware NSX-T use its own vSwitch called N-VDS. The N-VDS is primarily used to encapsulate and decapsulate GENEVE overlay traffic between NSX-T transport nodes along supporting the distributed Firewall (dFW) for micro-segmentation. The N-VDS requires its own dedicated pNIC interfaces. These pNIC cannot be shared with vSphere vSwitches (vDS or vSS). Each NSX-T transport node has in a typically NSX-T deployment one or two Tunnel End Points (TEPs) to terminate the GENEVE overlay traffic. The number of TEP is directly related to the attached Uplink Profile. In case you use an uplink teaming policy "failover", then only a single TEP is used. In case of a teaming policy "Load Balance Source" then you have for each physical NIC a TEP assigned. Such an "Load Balance Source" Uplink Profile is showed below and will be used for this lab exercise.

Screen Shot 2019-08-19 at 20.07.00.png

The mapping of the "Uplinks" is as follow:

  • ActiveUplink1 is the pNIC (vmnic2) connected to ToR switch NY-CAT3750G-A
  • ActiveUplink2 is the pNIC (vmnic3) connected to ToR switch NY-CAT3750G-B

 

Additionally, you could see the VLAN 150 to carry the GENEVE encapsulated traffic.

 

However, the N-VDS can also be used for VLAN-based segments. VLAN-based segments are very similar as vDS portgroups. In deployment, where your hosts has only two pNICs and both pNICs are used for the N-VDS (yes, for redundancy reason), you have to use VLAN-based segments to carry VmKernel interfaces (e.g. mgmt, vMotion or vSAN). When your VLAN-based segments are used to carry VMKernel interface traffic and you use an Uplink Profile as shown above, then it is difficult to figure out on which pNIC the VmKernel traffic is carried, as these traffic is following the default teaming policy, in our case "Load Balance Source". Please note, VLAN-based segments is not limited to VmKernel traffic, such segment can also carry regular virtual machine traffic.

 

There are often good reasons to do traffic steering to have a predicable traffic flow behavior, as example you would like to transport Management and vMotion VmKernel traffic under normal conditions (all physical links are up) on pNIC_A and vSAN on pNIC_B. One of the top two reasons are:

1.) predict the forwarding traffic pattern under normal conditions (all links are up) and align as example the VmKernel traffic with the active First Hop Gateway Protocol (e.g. HSRP)

2.) reduce ISL traffic between the two ToR Switches or ToR-to-Spine traffic for high load traffic (e.g. vSAN or vMotion) along with predictable and low latency traffic forwarding (assume as example you have 20 hosts in a single rack and all hosts use for vSAN the left ToR Switch, in such situation the ISL is not carrying vSAN traffic)

 

This is where NSX-T "VLAN Pinning" comes into the game. The term "VLAN Pinning" is in our NSX-T public documentation referred as "Named Teaming Policy". Actually I like the term "VLAN Pinning". In this lab exercise for this blog, I would like to show you how you could configure "VLAN Pinning". The physical lab setup looks like the diagram below:

Physical Host Representation-Version1.png

For this exercise is only host NY-ESX72A relevant. This host NY-ESX72A is attached to two Top of Rack (ToR) Layer 3 Switches, called NY-CAT3750G-A and NY-CAT3750G-B. As you see, this hosts has four pNICs (vmnic0...3). But only the pNIC vmnic2 and vmnic3 assigned to the N-VDS are relevant for this lab exercise. On the host NY-ESX72A, I have created three additional "artificial/dummy" VmKernel interfaces (vmk3, vmk4, vmk5). Each of the three VmKernel is assigned to a dedicated NSX-T VLAN-based segment. The diagram below shows the three VmKernel interfaces, all attached to a dedicated VLAN-based segment owned by the N-VDS (NY-NVDS) and the MAC address from vmk3 as example.

Screen Shot 2019-08-19 at 21.00.26.png

 

The simplified logical setup is shown below:

Logical Representation-default-teaming-Version1.png

 

 

From the NSX-T perspective we actually have configured three VLAN-based segments. These VLAN-based segments are created with the new policy UI/API.

NSX-T-VLAN-Segments-red-marked.png

The policy UI/API is the new interface since NSX-T 2.4.0 which is the preferred interface for the majority of NSX-T deployments. The "legacy" UI/API is still available and is visible in the UI under the tab "Advanced Networking & Security".

 

As already mentioned, the three VLAN-based segments use the default teaming policy (Load Balance Source), so the VMkernel traffic is distributed over the two pNIC (vmnic2 or vmnic3). Hence, we typically cannot predict, which of the ToR switches will learn the associated MAC address from the three individual VMkernel interfaces. Before we move forward and configure "VLAN Pinning", lets see how the three VmKernel traffic is distributed. One of the easiest way is to check the "MAC address" table for the two ToR switches for interface Gi1/0/10.

Screen Shot 2019-08-19 at 20.53.59.png

As you could see NY-CAT3750G-A is learning the MAC address from vmk3 (0050.5663.f4eb) only, whereas NY-CAT3750G-B is learning the MAC address from vmk4 (0050.5667.50eb) and vmk5 (0050.566d.410d). With the default teaming option "Load Balance Source", the administrator has actually no option to steer the traffic. Please ignore the two learned MAC addresses from VLAN 150, these are TEP MAC addresses.

 

Before we now configure VLAN Pinning, lets assume we would like that vmk3 and vmk4 are learnt on NY-CAT3750-A and vmk5 on the NY-CAT3750-B (when all links are up). We would like to use two new "Named Teaming Policies" with failover. The traffic flows should look like the diagram below --> dotted line means "standby link".

Logical Representation-vlan-pinning-teaming-Version1.png

The first step is to create two additionally "Named Teaming Policies". Please compare this diagram with the very first diagram above. Please be sure you use the identically names for the uplinks (ActiveUplink1 and ActiveUplink2) as for the default teaming policy.

Edit-Uplink-Profile.png

 

The second step is we need to make these two new "Named Teaming Policy" or the associated VLAN transport zone (TZ) available.

Edit-TZ-for-vlan-pinning.png

The third and last step is to edit the three VLAN-based segments according to your traffic steering policy. As you could see, we unfortunately need to edit the VLAN-based segments in the "legacy" "Advanced Networking & Security" UI section. We plan to support this editing option to be available in the new policy UI/API in one of the future NSX-T releases.

NY-VLAN-SEGMENT-90.png

NY-VLAN-SEGMENT-91.png

NY-VLAN-SEGMENT-92.png

As soon you edit the VLAN-based segments with the new "Named Teaming Policy", the ToR switches will immediately learn the MAC address from the associated physical interfaces.

The two ToR switches learn after applying "VLAN Pinning" through two new "Named Teaming Policy" in the following way:

Catalyst-MAC-table-with-vlan-pinning.png

As you could see NY-CAT3750G-A is learning now the MAC address from vmk3 and vmk4, whereas NY-CAT3750G-B is learning only the MAC address from vmk5.

Hope you had a little bit fun reading this NSX-T VLAN Pinning write-up.

 

 

Software Inventory:

vSphere version: 6.5.0, build 13635690

vCenter version: 6.5.0, build 10964411

NSX-T version: 2.4.1.0.0.13716575

 

Blog history

Version 1.0 - 19.08.2019 - first published version

Dear readers

I was recently at the customer site, where we have discussed the details about the NSX-T north/south connectivity with active/active edge node virtual machines to maximizing throughput and resiliency. To achieve the highest north to south and vice versa bandwidth requires the installation of multiple edge nodes in active/active mode leveraging ECMP routing.

But lets have first a basic view of a NSX-T ECMP deployment.

The physical router is in a typical deployment a Layer 3 leaf switch acting as Top of Rack (ToR) device. Two of them are required to provide redundancy. NSX-T support basically two edge node deployment option. Active/standby and active/active deployments. For maximizing throughput and highest level of resiliency is the active/active deployment option the right choice. NSX-T is able to install up to eight paths leveraging ECMP routing. As you are most likely already familiar with NSX-T, then you know that NSX-T requires the Service Router (SR) component on each individual edge nodes (VM or Bare Metal) to setup the BGP peering with the physical router. But have you ever thought about the details what does eight ECMP path entries really mean? Are these eight paths counted on the Tier0 logical router or on the edge node itself or where?

 

Before we talk about the eight ECMP paths let us have a closer look to the physical setup. For this exercise I have in my lab only 4 ESXi hosts available. Each host is equipped with four 1Gbit/s pNIC. Two of these ESXi hosts are purely used to provide CPU and memory resources to the edge node VMs and the other two ESXi hosts are prepared with NSX-T (NSX-T VIBs installed). The two "Edge" ESXi hosts have two vDS, each with 2 pNIC configured. The first vDS is used for vmk0 management, vMotion and IPStorage, the second vDS is used for the Tunnel End Point (TEP) encapsulated GENEVE traffic and the routed uplink(s) traffic towards the ToR switches. The edge node VM is acting as NSX-T transport nodes, they have typically two or three N-VDS embedded (future release will support a single N-VDS per edge node). The two compute hosts are prepared with NSX-T, they act also as transport nodes and they have a slightly different setup regarding vSwitches. The first vSwitch is again a vDS with two pNIC and is used for vmk0 management, vMotion and IPStorage. The other two pNIC are assigned to the NSX-T N-VDS and is responsible for the TEP traffic. The diagram below shows the simplified physical setup.

Physical Host Representation-Version1.png

As you could easily see, the two "Edge" vSphere hosts have totally eight edge node VMs installed. This is a purpose-built "Edge" vSphere cluster to serve edge node VMs only. Is this kind of deployment recommend in a real customer deployment? It depends :-)

To have 4 pNICs probably is a good choice, but most likely are 10Gbit/s or 25Gbit/s interfaces instead 1Gbit/s interfaces preferred respective required for high bandwidth throughput. When you host more than one edge node VM per ESXi hosts, then I recommend to use at least 25Gbit/s interfaces. As our focus is on maximizing throughput and resiliency, a customer deployment would have likely 4 or more ESXi hosts for the Edge" vSphere cluster.  Other aspects should be consider as well, like the used storage system (e.g vSAN), operational aspects (e.g. maintenance mode) or vSphere cluster settings. For this lab are "small" sized edge node VM used; real deployment should use "large" sized edge node VM where maximal throughput is required. To have a dedicated purpose-built "Edge" vSphere cluster can be considered as best practice when maximal throughput and highest resiliency along with operation simplification is required. Here two additional diagrams from the edge node VM deployment in my lab.

Screen Shot 2019-08-06 at 06.06.20.png

Screen Shot 2019-08-06 at 06.14.38.png

 

As we now have already an idea, how the physical environment looks, it is now time to move forward and dig into the logical routing design.

 

Multi Tier Logical Routing-Version1.png

To simplify the diagram, the diagram shows only a single compute transport node (NY-ESX70A) and only six of the eight edge node VMs. All these eight edge node VMs are assigned to a single NSX-T edge cluster and these edge cluster is assigned to the Tier0 logical router. The logical design show a two tier architecture with Tier0 logical routers and two Tier1 logical routers. This is very common design. Centralized services are not deployed at Tier1 level in this exercise. A Tier0 logical router consist in almost all cases (as you normally want use static or dynamic routing to reach the physical world) of a Service Router (SR) and a Distributed Router (DR). Only the edge node VM can host the Service Router (SR). As already said, the Tier1 logical router has in this exercise only the DR component instantiated, a Service Router (SR) is not required, as centralized service (e.g. Load Balancer) are not configured. Each SR has two eBGP peerings with the physical routers. Please keep in mind, only the two overlay segments green-240 and blue-241 are user configured segments. Workload VMs are attached to these overlay segments. This overlay segment provides VM mobility across physical boundaries. The segment between the Tier0 SR and DR and the segments between the Tier0 DR and Tier1 DR are automatically configured overlay segments through NSX-T, including the IP addressing assignment.

Meanwhile, you might have already recognized that eight edge node might be equally with eight ECMP path. Yes this is true....but where we have these eight ECMP path installed in the routing respective in the forwarding table? These eight paths are not installed on the logical construct Tier0 logical router nor on a single edge node. The eight ECMP path are installed on each Tier0 DR component of the individual compute transport node, in our case on the NY-70A Tier0 DR and NY-71A Tier0 DR. The CLI output below shows the forwarding table on the compute transport node NY-ESX70A.

 

IPv4 Forwarding Table NY-ESX70A Tier0 DR

NY-ESX70A> get logical-router e4a0be38-e1b6-458a-8fad-d47222d04875 forwarding ipv4

                                   Logical Routers Forwarding Table - IPv4                            

--------------------------------------------------------------------------------------------------------------

Flags Legend: [U: Up], [G: Gateway], [C: Connected], [I: Interface]

[H: Host], [R: Reject], [B: Blackhole], [F: Soft Flush], [E: ECMP]

 

                   Network                               Gateway                Type               Interface UUID   

==============================================================================================================

0.0.0.0/0                                              169.254.0.2              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.3              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.4              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.5              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.6              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.7              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.8              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

0.0.0.0/0                                              169.254.0.9              UGE     48d83fc7-1117-4a28-92c0-7cd7597e525f

100.64.48.0/31                                           0.0.0.0                UCI     03ae946a-bef4-45f5-a807-8e74fea878b6

100.64.48.2/31                                           0.0.0.0                UCI     923cbdaf-ad8a-45ce-9d9f-81d984c426e4

169.254.0.0/25                                           0.0.0.0                UCI     48d83fc7-1117-4a28-92c0-7cd7597e525f

--snip--

Each compute transport node can distribute the traffic sourced from the attached workload VMs from south to north for these eight paths (as we have eight different next hops), a single paths per Service Router. With such a active/active ECMP deployment we can maximize the forwarding bandwidth from south to north. This is shown in the diagram below.

Multi Tier Logical Routing-South-to-North-Version1.png

On the other hand, from north to south, each ToR switch has eight path installed (indicated with "multipath") to reach the destination networks green-240 or blue-241. The ToR switch will distributed the traffic from the physical world to all of the eight next hops. Here we achieve as well the maximum of throughput from north to south. Lets have a look to the two ToR switches routing table for the destination network green-240.

 

BGP Table for "green" prefix 172.16.240.0/24 on RouterA and RouterB

NY-CAT3750G-A#show ip bgp 172.16.240.0/0

BGP routing table entry for 172.16.240.0/24, version 189

Paths: (9 available, best #8, table Default-IP-Routing-Table)

Multipath: eBGP

Flag: 0x1800

  Advertised to update-groups:

     1          2  

  64513

    172.16.160.20 from 172.16.160.20 (172.16.160.20)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.22 from 172.16.160.22 (172.16.160.22)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.23 from 172.16.160.23 (172.16.160.23)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.21 from 172.16.160.21 (172.16.160.21)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.27 from 172.16.160.27 (172.16.160.27)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.26 from 172.16.160.26 (172.16.160.26)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.25 from 172.16.160.25 (172.16.160.25)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.160.24 from 172.16.160.24 (172.16.160.24)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath, best

  64513

    172.16.3.11 (metric 11) from 172.16.3.11 (172.16.3.11)

      Origin incomplete, metric 0, localpref 100, valid, internal

NY-CAT3750G-A#

NY-CAT3750G-B#show ip bgp 172.16.240.0/0

BGP routing table entry for 172.16.240.0/24, version 201

Paths: (9 available, best #9, table Default-IP-Routing-Table)

Multipath: eBGP

Flag: 0x1800

  Advertised to update-groups:

     1          2  

  64513

    172.16.161.20 from 172.16.161.20 (172.16.160.20)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.23 from 172.16.161.23 (172.16.160.23)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.21 from 172.16.161.21 (172.16.160.21)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.26 from 172.16.161.26 (172.16.160.26)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.22 from 172.16.161.22 (172.16.160.22)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.27 from 172.16.161.27 (172.16.160.27)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.161.25 from 172.16.161.25 (172.16.160.25)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath

  64513

    172.16.3.10 (metric 11) from 172.16.3.10 (172.16.3.10)

      Origin incomplete, metric 0, localpref 100, valid, internal

  64513

    172.16.161.24 from 172.16.161.24 (172.16.160.24)

      Origin incomplete, metric 0, localpref 100, valid, external, multipath, best

NY-CAT3750G-B#

 

Traffic arriving at the Service Router (SR) from the ToR switches is kept locally on the edge node before the traffic is forwarded to the destination VM (GENEVE encapsulated). This is shown in the next diagram below.

Multi Tier Logical Routing-North-to-South-Version1.png

And what is the final conclusion of this little lab exercise?

Each single Service Router on an edge node provide to each individual compute transport node exactly a single next hop. The number of BGP peerings per edge node VM is not relevant for the eight ECMP path, the number of edge nodes is relevant. Theoretically a single eBGP peer from each edge node would achieve the same number of ECMP path. But please keep in mind, two BGP session per edge node provide better resiliency. Hope you had a little bit fun reading this NSX-T ECMP edge node write-up.

 

Software Inventory:

vSphere version: 6.5.0, build 13635690

vCenter version: 6.5.0, build 10964411

NSX-T version: 2.4.1.0.0.13716575

 

Blog history

Version 1.0 - 06.08.2019 - first published version

Version 1.1 - 19.08.2019 - minor changes

Dear readers

this is the second blog of a series related to NSX-T. This second coverage provide you relevant information required to better understand the implication of a centralized services in NSX-T. In the first blog where I have provided you an introduction of the lab setup, this second blog will now discuss the impact when you add a Tier-1 Edge Firewall for the tenant BLUE. The diagram below shows the logical representation of the lab setup with the Edge Firewall attached to the Tier-1 uplink interface of the Logical Router for tenant BLUE.

Blog-Diagram-2.1.png

 

For this blog I have selected to add an Edge Firewall for a Tier-1 Logical Router, but I could have also selected a Load Balancer, VPN service or NAT service. The implication to the "internal" NSX-T networking are similar. However, please keep in mind, not all NSX-T centralized services are supported at the Tier-1 level (as example VPN) or at Tier-0 (as example Load Balancer) with NSX-T 2.3 and not all services (as example DHCP or Metadata Proxy) will instantiate a Service Router.

 

Before I move forward and try to explain what is happen under the hood when you enable an Edge Firewall, I would like to update you with some additional information to the diagram below.

Blog-Diagram-2.2.png

I am sure you are already familiar with this diagram above, as we have talked about the same in my first blog. Each of the four Transport Nodes (TN) has two tenants Tier-1 Logical Routers instantiated. Inside of each Transport Node there are two Logical Switches with VNI 17295 and 17296 between the Tier-1 tenant DR and Tier-0 DR used. These two automatically instantiated (sometimes referred as auto-plumbing) transit overlay Logical Switches have got subnets 100.64.144.18/31 and 100.64.144.20/31 automatic assigned. Internal filtering avoids duplicate IP address challenges;  in the same way as NSX-T is doing already for gateways IP (.254) the Logical Switches 17289 and 17294 where the VMs are attached. Each of this Tier-1 to Tier-0 transit Logical Switch (17295 and 17296) could be showed as linked together in the diagram, but as internal filtering takes place, this is for the moment irrelevant.

The intra Tier-0 Logical Switch with the VNI 17292 is used to forward traffic between the Tier-0 DRs and towards northbound via the Service Router (SR). This Logical Switch 17292 has again an automatic assigned IP subnet (169.254.0.0/28). Each Tier-0 DR has assigned the same IP address (.1), but the two Service Routers use different IPs (.2 and .3), otherwise the Tier-0 DR would not be able to forward based on equal cost with two different next hops.

 

Before the network administrator is able to configure an Edge Firewall for tenant BLUE at the Tier-1 level, he has to assign and edge-cluster to the Tier-1 Logical Router along the edge-cluster members. This is shown in the diagram below.

Blog2-add-edge-nodes-to-Tier1-BLUE.png

Please be aware, as soon as you assign an edge-cluster to a Tier-1 Logical Router a Service Router is automatically instantiated, independent of the Edge Firewall.

 

These two new Service Routers are running on the edge-nodes and they are in active/standby mode. Please see in the next diagram below.

Blog2-routing-tier1-blue-overview.png

 

The configuration of the tenant BLUE Edge Firewall itself is shown in the next diagram. Here we use for this lab the default firewall policy.

Blog2-enable-edge-firewall.png

This simple configuration step with adding the two edge-nodes to the Tier-1 Logical Router for tenant BLUE will cause that NSX-T "re-organize" the internal auto-plumbing network. To understand what is happening under the hood, I have divided these internal network changes into four steps instead showing only the final result.

 

In step 1, NSX-T will internally disconnect the Tier-0 DR to Tier-1 DR for the BLUE tenant, as the northbound traffic needs to be redirected to the two edge-nodes, where the Tier-1 Service Routers are running. The internal Logical Switch with VNI 17295 is now explicit linked together between the four Transport Nodes (TN).

Blog-Diagram-2.3.png

 

In step 2, NSX-T automatically instantiate on each edge-node a new Service Router at Tier-1 level for the tenant BLUE with an Edge Firewall. The Service Router are active/standby mode. In this example, the Service Router running on the Transport Node EN1-TN is active, where the Service Router running on EN2-TN is standby. The Tier-1 Service Router uplink interface with the IP address 100.64.144.19 is either UP or DOWN.

Blog-Diagram-2.4.png

 

In step 3, NSX-T connects the Tier-1 Service Router and the Distributed Router for the BLUE tenant together. For this connection is a new Logical Switch with VNI 17288 added. Again, the Service Router running on EN1-TN has the active interface with the IP address 169.254.0.2 up and the Service Router on EN2-TN is down. This ensure, that only the active Service Router can forward traffic.

Blog-Diagram-2.5.png

 

In the final step 4, NSX-T extends the Logical Switch with VNI 17288 to the two compute Transport Nodes ESX70A and ESX71A. This extension is required to route traffic from vm1 as example on the local host before the traffic is forwarded to the Edge Transport Nodes. NSX-T adds finally the required static routing between the different Distributed and Service Routers. NSX-T does all these steps under the hood automatically.

Blog-Diagram-2.6.png

 

The next diagram below shows a traffic flow between vm1 and vm3. The traffic sourced from vm1 will first hit the local DR in the BLUE tenant on ESX70A-TN. The traffic now needs to be forwarded to the active Tier-1 Service Router (SR) with the Edge Firewall running on Edge Transport Node EN1-TN. The traffic reach then the Tier-0 DR on EN1-TN and then is the traffic forwarded to the RED Tier-1 DR and finally arrives at vm3. The return traffic will hit first the local DR in the RED tenant on ESX71A-TN before the traffic reach the Tier-0 DR on the same host. The next hop is the BLUE Tier-1 Service Router (SR). The Edge Firewall inspects the return traffic and forwards the traffic locally the BLUE Tier-1 DR before finally the traffic arrives back at vm1. The majority of the traffic is handled locally on the EN1-TN. The used bandwidth between the physically hosts and therefore the GENEVE encapsulated traffic is the same as without the Edge Firewall. But as everybody could imagine an edge-node which might hosts multiple Edge Firewalls for multiple tenants or any other centralized services should be designed accordingly.

Blog-Diagram-2.7.png

 

Hope you had a little bit fun reading these two blogs. Feel free to share this blog!

 

Lab Software Details:

NSX-T: 2.3.0.0

vSphere: 6.5.0 Update 1 Build 5969303

vCenter:  6.5 Update 1d Build 2143838

 

Version 1.0 - 10.12.2018

Dear readers

this is the first blog of a series related to NSX-T. This first coverage provide you a simple introduction to the most relevant information required to better understand the implication of a centralized services in NSX-T. A centralized service could be as example a Load Balancer or an Edge Firewall.

 

NSX-T has the ability to do distributed routing and supports distributed firewall. Distributed routing in the context that each host, which is prepared for NSX-T, can do local routing. From the logical view is this part called Distributed Router (DR). The DR is part of a Logical Router (LR) and this LR can be configured at Tier-0 or at Tier-1 level. Distributed routing is perfect for scale and could reduce the bandwidth utilization of each physical NIC on the host, as the routing decision is done on the local host. As example, when the source and the destination VM is located on the same host, but connected to different IP subnets and therefore attached to different overlay Logical Switches, then the traffic never leaves the host. All traffic forwarding is processed on the host itself instead at the physical network as example on the ToR switch.

Each host which is prepared with NSX-T and attached to a NSX-T Transport Zone is called a Transport Node (TN). Transport Nodes have implicit a N-VDS configured, which provides as example GENEVE Tunnel Endpoint or is responsible for the distributed Firewall processing. However, there are services like load balancing or edge firewalling which is not a distributed service. VMware call these services "centralized services". Centralized services instantiate a Service Router (SR) and this SR runs on the NSX-T edge-node (EN). An edge-node could be a VM or a bare metal server. Each edge-node is also a Transport Node (TN).

 

Lets have now a look to a simple two tier NSX-T topology with a tenant BLUE and a tenant RED. Both have for new no centralized services at Tier-1 level enabled. For the North-South connectivity to the physical world, there is already a centralized services at the Tier-0 instantiated. However, we don't want focus on this North-South routing part, but as we later would like the understand, what it means to have a centralized services configured on a Tier-1 logical router, it is important to understand this part as well, because North-South routing is also a centralized service. The diagram below shows the logical representation of a simple lab setup. This lab setup will later be used to instantiate the a centralized service at a Tier-1 Logical Router.

Blog-Diagram-1.png

For those which like to get a better understanding of the topology, I have included a diagram of the physical view below. In this lab, we actually use 4 ESXi hosts. For simplification we focus in this blog on the Hypervisor ESXi, instead KVM, even we could build a similar lab with KVM too. On each of these two Transport Nodes ESX70A-TN and ESX71A-TN is a VM installed. The two other hosts ESX50A and ESX51A are NOT* prepared for NSX-T, but they host on each a single edge-node (EN1 and EN2) VM. These two edge-nodes don't have to run on two different ESXi hosts, but it is recommended for redundancy reason.

Blog-Diagram-2.png

As shown in the next diagram, we combine now the physical and logical view. The two Transport Nodes ESX70A-TN and ESX71A-TN have only DRs at Tier-1 and Tier-0 level instantiated, but no Service Router. That means the Logical Router consists of only a DR. These DRs at Tier-1 level provide the gateway (.254) for the attached Logical Switch. The tenant BLUE uses VNI 17289 and the tenant RED uses VNI 17294. NSX-T assign these VNIs out of a VNI pool (default pool: 5000 - 65535). The edge-nodes VMs, now showed as Edge Transport Node (EN1-TN and EN2-TN) have the same Tier-1 and Tier-0 DRs instantiated, but only the Tier-0 includes a Service Router (SR).

Blog-Diagram-1.3.png

The two Tier-1 Logical Routers respective DRs can only talk to each other via the green Tier-0 DR. But before you are able to attach the two Tier-1 DRs to a Tier-0 DR a Tier-0 Logical Router is required. And a Tier-0 Logical Router mandates the assignment of an edge-cluster during the configuration of the Tier-0 Logical Router. Lets assume at this point, we have already configured two edge-node VMs and these edge-node VMs are assigned to an edge-cluster. A Tier-0 Logical Router consists always of a Distributed Router (DR) and depending on the node type as well with a Service Router. A Service Router is always required for the Tier-0 Logical Router, as the Service Router is responsible for the routing connectivity to the physical world. But the Service Router is only instantiated on the edge-nodes. In this lab both Service Router are configured on two edge-nodes respective as Edge Transport Node in active/active mode to provide ECMP to the physical world.

All the internal transit links, as shown in the diagram below, are automatically configured through NSX-T. The only task for the network administrator is to connected the Tier-0 DR to the Tier-1 DRs.

The northbound connection to the physical world requires further a configuration of an additional (or better two Transport Zones for routing redundancy) VLAN based Transport Zone plus the routing peering (typically eBGP). Below is the resulting logical network topology.

One probably ask, why NSX-T instantiate on each edge-node the two Tier-1 DRs too? Well, this is required for an optimized forwarding. As already mentioned, routing decisions are always done on that hosts where the traffic is sourced. Assume vm1 in tenant BLUE would like to talk to a server in the physical world. Traffic sourced at vm1 is forwarded to its local gateway on the Tier-1 DR and then towards the traffic to the Tier-0 DR on the same host. From the Tier-0 DR is then the traffic forwarded to the left Tier-0 SR on EN1-TN (lets assume, traffic is hashed accordingly) and then the flow reach the external destination. The return traffic reach first Tier-0 SR on EN2-TN (lets assume again based on the hash), then the traffic is forwarded locally to Tier-0 DR on the same Edge Transport Node and then to the Tier-1 DR in tenant BLUE. The traffic never leaves EN2-TN until the traffic reach locally the Logical Switch where the vm1 is attached. This is what is called optimized forwarding which is possible due the distributed NSX-T architecture. The traffic needs to be forwarded only once over the physical data center infrastructure and therefore encapsulated into GENEVE per direction!

Blog-Diagram-1.4.png

For now we close this first blog. For the second blog we will dive into the instantiation of a centralized service at Tier-1. Hope you had a little big fun reading this first write-up.

 

 

 

 

*Today, NSX-T supports to run edge-node VMs on NSX-T prepared hosts too. This capability is important to combine compute and edge-node services on the same host.

Version 1.0 - 19.11.2018

Version 1.1 - 27.11.2018 (minor changes)

Version 1.2 - 04.12.2018 (cosmetic changes)

Version 1.3 - 10.12.2018 (link for second blog added)