Welcome to a new series of blogs talking about the network readiness. As you might be already aware, NSX-T requires from the physical underlay network mainly two things:
- IP Connectivity – IP connectivity between all components of NSX-T and compute hosts. This includes on one hand the Geneve Tunnel Endpoint (TEP) interfaces and an other management interfaces (typically vmk0) on hosts as well NSX-T Edge nodes (management interface) - both bare metal and virtual NSX-T Edge nodes.
- Jumbo Frame Support – A minimum required MTU is 1600, however MTU of 1700 bytes is recommended to address the full possibility of variety of functions and future proof the environment for an expanding Geneve header. To get out most of your VMware SDDC your physical underlay network should support at least an MTU of 9000 bytes.
This blog has a focus on the MTU readiness for NSX-T. There are other VMkernel interfaces than for the overlay encapsulation with Geneve, like vSAN or vMotion which perform better with a higher MTU. So we keep this discussion on the MTU more generally. Physical network gear vendors, like Cisco with the Nexus Data Center switch family typically support a MTU of 9216 bytes. Other vendors might have the same MTU upper size.
This blog is about the correct MTU configuration and the verification within the Data Center spine-leaf architecture with Nexus 3K switches running NX-OS. Lets have a look to a very basic and simple lab spine-leaf topology with only three Nexus N3K-C3048TP-1GE switches:
Out of the box, the Nexus 3048 switches are configured with a MTU of 1500 bytes only. For an MTU of 9216 bytes we need to configure three pieces.
- Layer 3 Interfaces MTU Configuration – This type of interface is used between the Leaf-10 and the Borderspine-12 switch respective between the Leaf-11 and Borderpine-12 switch. We run on this interface OSPF to announce the Loopback0 interface for the iBGP peering connectivity. As example the MTU Layer 3 interface configuration on interface e1/49 from the Leaf-10 is shown below:
|Nexus 3048 Layer 3 Interface MTU Configuration|
NY-N3K-LEAF-10# show run inter e1/49
description **L3 to NY-N3K-BORDERSPINE-12**
no ip redirects
ip address 172.16.3.18/30
ip ospf network point-to-point
no ip ospf passive-interface
ip router ospf 1 area 0.0.0.0
- Layer 3 Switch Virtual Interfaces (SVI) MTU Configuration – This type of interface is required as example to establish an IP connectivity between the Leaf-10 and Leaf-11 switches when the interfaces between the Leaf switches are configured as Layer 2 interfaces. We are using a dedicated SVI for VLAN 3 for the OSPF neighborship and the iBGP peering connectivity between the Leaf-10 and Leaf-11. In this lab topology are the interfaces e1/51 and e1/52 configured as dot1q trunk to carry multiple VLANs (including VLAN 3) and these to interfaces are combined into a portchannel running LACP for redundancy reason. As example the MTU configuration of the SVI for VLAN 3 from the Leaf-10 is shown below:
|Nexus 3048 Switch Virtual Interface (SVI) MTU Configuration|
NY-N3K-LEAF-10# show run inter vlan 3
no ip redirects
ip address 172.16.3.1/30
ip ospf network point-to-point
no ip ospf passive-interface
ip router ospf 1 area 0.0.0.0
- Global Layer 2 Interface MTU Configuration – This global configuration is required for this type of Nexus switches and a few other Nexus switches (please see footnote 1 for more details). This Nexus 3000 does not support individual Layer 2 interface MTU configuration; the MTU for Layer 2 interfaces must be configured via a network-qos policy command. All interfaces configured as access or trunk port for host connectivity and as well for the dot1q trunk between the Leaf switches (e1/51 and e1/52) requires the network-qos configuration as shown below:
|Nexus 3048 Global MTU QoS Policy Configuration|
policy-map type network-qos POLICY-MAP-JUMBO
class type network-qos class-default
service-policy type network-qos POLICY-MAP-JUMBO
The network-qos global MTU configuration needs to be verified with the command as shown below:
|Nexus 3048 Global MTU QoS Policy Verification|
NY-N3K-LEAF-10# show queuing interface ethernet 1/51-52 | include MTU
HW MTU of Ethernet1/51 : 9216 bytes
HW MTU of Ethernet1/52 : 9216 bytes
The verification of the end-to-end MTU of 9216 bytes within the physical network should be done already typically before you attach your first hypervisor ESXi hosts. Please keep in mind, the virtual distributed switch (vDS) and the NSX-T N-VDS (e.g uplink profile MTU configuration) supports today up to 9000 bytes. This MTU includes the overhead for the Geneve encapsulation. As you could see in the table below of an ESXi host, the MTU is set to the maximum of 9000 bytes for the VMkernel interfaces used for Geneve (we label it unfortunately still with vxlan) respective for vMotion and IP storage.
|ESXi Host MTU VMkernel Interface Verification|
[root@NY-ESX50A:~] esxcfg-vmknic -l
Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type NetStack
vmk0 2 IPv4 172.16.50.10 255.255.255.0 172.16.50.255 b4:b5:2f:64:f9:48 1500 65535 true STATIC defaultTcpipStack
vmk2 17 IPv4 172.16.52.10 255.255.255.0 172.16.52.255 00:50:56:63:4c:85 9000 65535 true STATIC defaultTcpipStack
vmk10 10 IPv4 172.16.150.12 255.255.255.0 172.16.150.255 00:50:56:67:d5:b4 9000 65535 true STATIC vxlan
vmk50 910dba45-2f63-40aa-9ce5-85c51a138a7d IPv4 169.254.1.1 255.255.0.0 169.254.255.255 00:50:56:69:68:74 1500 65535 true STATIC hyperbus
vmk1 8 IPv4 172.16.51.10 255.255.255.0 172.16.51.255 00:50:56:6c:7c:f9 9000 65535 true STATIC vmotion
For sure, the verification of the end-to-end MTU between two ESXi hosts I still highly recommend by sending VMkernel pings with the don't-fragment bit set (e.g. vmkping ++netstack=vxlan -d -c 3 -s 8972 -I vmk10 172.16.150.13).
But for a serious end-to-end MTU 9216 physical network verification we need to look for another tool than the VMkernel ping. In my case I just using BGP running on the Nexus 3048 switches. BGP is running on the top of TCP and TCP support the option "Maximum Segment Size" to maximize the TCP datagrams.
The TCP Maximum Segment Size (MSS) is a parameter of the options field of the TCP header that specifies the largest amount of data, specified in bytes. This information is part of the SYN TCP three-way handshake, as the diagram below shows from a wireshark sniffer trace.
The TCP MSS defines the maximum amount of data that an IPv4 endpoint is willing to accept in a single TCP/IPv4 datagram. RFC879 explicit mention that MSS counts only data octets in the segment, but it does not count the TCP header or the IP header. In the wireshark trace example the two IPv4 endpoints (Loopback 172.16.3.10 and 172.16.3.12) have accepted an MSS of 9176 bytes on a physical Layer 3 link with MTU 9216 during the TCP three-way handshake. The difference of 40 bytes is based on the default TCP header of 20 bytes and IP header of again 20 bytes.
Please keep in mind, a small MSS values will reduce or eliminate IP fragmentation for any TCP based application, but will result in higher overhead. This is also truth for BGP messages.
BGP update messages carry all the BGP prefixes as part of the Network Layer Reachability Information (NLRI) Path Attribute. In regards for an optimal BGP performance in a spine-leaf architecture running BGP, it is advisable to set the MSS for BGP to the maximum value but avoid fragmentation. As defined RFC879 all IPv4 endpoints are required to handle an MSS of 536 bytes (=MTU 576 bytes minus 20 bytes for TCP Header*** minus 20 bytes IP Header).
But are these Nexus switches using MSS of 536 bytes only? Nope!
These Nexus 3048 switches running NX-OS 7.0(3)I7(6) are by default configured to discover the maximal MTU path between the two IPv4 endpoints leveraging Path MTU Discovery (PMTUD) feature. Other Nexus switches may requires the configuration of the global command "ip tcp path-mtu-discovery" to enable PMTUD.
MSS is sometimes mistaken for PMTUD. MSS is a concept used by TCP in the Transport Layer and it specifies the largest amount of data that a computer or communications device can receive in a single TCP segment. While PMTUD is used to specifies the largest packet size that can be sent over this path without suffering fragmentation.
But how we could verify the MSS used for the BGP peering session between the Nexus 3048 switches?
Nexus 3048 switches running NX-OS software allows the administrator to check the MSS of the TCP BGP session with the following command: show sockets connection tcp details.
Below we see two TCP BGP sessions between the IPv4 endpoints (Switch Loopback Interfaces) and each of the session shows a MSS of 9164 bytes.
|BGP TCP Session Maximum Segment Size Verification|
NY-N3K-LEAF-10# show sockets connection tcp local 172.16.3.10 detail
Kernel Socket Connection:
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 172.16.3.10:24415 172.16.3.11:179 ino:78187 sk:ffff88011f352700
skmem:(r0,rb262144,t0,tb262144,f0,w0,o0) ts sack cubic wscale:2,2 rto:210 rtt:12.916/14.166 ato:40 mss:9164 cwnd:10 send 56.8Mbps rcv_space:18352
ESTAB 0 0 172.16.3.10:45719 172.16.3.12:179 ino:79218 sk:ffff880115de6800
skmem:(r0,rb262144,t0,tb262144,f0,w0,o0) ts sack cubic wscale:2,2 rto:203.333 rtt:3.333/1.666 ato:40 mss:9164 cwnd:10 send 220.0Mbps rcv_space:18352
Please reset always the BGP session when you change the MTU, as the MSS is only discovered during the initial TCP three-way handshake.
The MSS value of 9164 bytes confirms that the underlay physical network is ready with an end-to-end MTU of 9216 bytes. But why is the MSS value (9164) of BGP 12 bytes smaller than the TCP MSS value (9176) negotiated during the TCP three-way handshake?
Again, in many TCP IP stacks implementation we could see a MSS of 1460 bytes with the interface MTU of 1500 bytes respective a MSS of 9176 bytes for a interface MTU of 9216 bytes (40 bytes difference) , but there are other factors that can change this. For example, if both sides support RFC 1323/7323 (enhanced timestamps, windows scaling, PAWS***) this will add 12 bytes to the TCP header, reducing the payload to 1448 bytes respective 9164 bytes.
And indeed, the Nexus NX-OS TCP/IP stacks used for BGP supports by default the TCP enhanced timestamps option and leverage the PMTUD (RFC 1191) feature to handle the 12 byte extra room and hence reduce the maximal payload (payload in our case is BGP) to a MSS of 9164 bytes.
The below diagram from a wireshark sniffer trace confirms the extra 12 byte used for the TCP timestamps option.
Hope you had a little bit fun reading this small Network Readiness write-up.
** 20 bytes TCP Header is only correct when default TCP header options are used, RFC 1323 - TCP Extensions for High Performance and replaced by RFC 7323 - TCP Extensions for High Performance defines TCP extension which requires up to 12 bytes more.
*** PAWS = Protect Against Wrapped Sequences
vSphere version: VMware ESXi, 6.5.0, 15256549
vCenter version:6.5.0, 10964411
NSX-T version: 22.214.171.124.0.15314288 (GA)
Cisco Nexus 3048 NX-OS version: 7.0(3)I7(6)
Version 1.0 - 23.03.2020 - first published version