- have to repost this, since yesterday the new great forum software did not seem to save the thread... -
I am managing a small NSX-T test environment, consisting of 3 ESXi hosts as transport nodes and an overlay transport zone, where a few overlay networks are configured.
The hosts have 4 x 1Gbps physical uplinks (no switches with 10gb available atm...), two connected to the vds, the rest to the n-vds.
The Uplink profile applied to the ESXi Hosts uses both n-vds uplinks in a LB configuration. It uses the global MTU configuration setting of MTU = 1600.
2 Edge Transport nodes are grouped in an Edge Cluster. On the edges, uplink-1 is connected to the transport vlan for the overlay networks, uplinks 2 and 3 are used for communication with the physical switches.
There is almost no load on the environment currently, be it compute, storage or network. The edge nodes are deployed in the Medium sizing configuration.
The problem I am facing is, that transfer rates between vms placed on the vds in a physical vlan and VMs placed in the overlay network are extremely bad. I can only get something needing low thorughput, like SSH, to work, but during file transfers using SSH or HTTP I get as low as 30kB/s before loosing the connection. Using performance views from the vcenter I did not notice any packet drops neither on the hosts' vmnics nor the VMs' nic. Ping between components returns roundtrip times of <1 ms.
Transfer rates between VMs inside the overlay network are fine, even when they are placed on different ESXi hosts in the cluster. I've also tried different source systems from the physical VLAN to do the transfer tests.
All VMs placed in the overlay, regardles of the ESXi host, seem to be affected. Transfers between systems placed in VLANs on the vds are not negatively affected either.
All switchports connecting the transport nodes are configured consistenlty in regards of VLAN trunks, MTU size being 9216 bytes.
I've used https://spillthensxt.com/how-to-validate-mtu-in-an-nsx-t-environment/ as a guideline to check for MTU consistency across all components and could not find an issue. I use 1600 on the vds and as a default value on the nsx-t uplink profiles used on the transport nodes as well as the edge nodes.
I'm kind of at a loss here what to troubleshoot next, and any tips are most welcome:)
Are the edge-node TEP and the esx teps in different subnets ?
NSX Edge VM can be deployed using VLAN-backed logical switches on the N-VDS of the host transport node. Host TEP and NSX Edge TEP must be in different subnets.
Here are the 3 profiles applied on the transport nodes (ignore _VDS, it is not applied anywhere).
The first one is applied on the transport node n-vds uplinks (vmnic2 and vmnic3).
The second one on the TEP interface of the edge node (fp-eth0). The third on the uplink ports fp-eth1 and fp-eth2.
I've also just checked, and the transfer rate between overlay networks attached to different edge node clusters is also affected. The other ledge cluster is identically configured as the one described above, just that the logical switch uses another subnet for adressing.
Also - AFAIK the transport node TEPs as well as the edge TEPs can use IPs from the same subnet. I could not find contradictory information. I know that when you use NSX-T lower than 3.1 in regards of VLANs in the transport network, but this occurs only when you use the VDS 7 exclusively to channel all traffic. This is not the case here.
What sort of hardware are you running it on?
You are correct about the TEPs in 3.1, details on how that works can be found here https://www.lab2prod.com.au/2020/11/nsx-t-inter-TEP.html#more
How are you testing, have you run any iPerf tests or just copying and pasting files?
Have you done any packet captures on the edge appliances and hosts whilst doing the transfers? Have you checked esxtop to see if anything looks off? What switches, is your routing mtu set correctly?
I used copying files to identify the problem. What would be the benefit of using iperf in this case?
The MTU is set correctly end to end, 1600 as per documentation. Wouldn't I get dropped packets, when the MTU is not set correctly somewhere along the way, sicne Geneve frames cannot be fragmented?
Also it does not seem to be an issue with the networking hardware, since the problem occurs even the source and target system are placed on the same ESXi host. Packets would not leave the host in this case.
Just trying to cover all bases, it can sometimes be difficult to pin point Without logs and eyes on the environment. There is obviously something wrong somewhere.