Reply to Message

View discussion in a popup

Replying to:
SergeyRusak
Contributor
Contributor

Hyperflex + VMWare: a lot of restransmissions

Hi everyone!

Probably someone faced with the same issue. I found that some types of packets cause a huge amount of retransmissions in our VMWare environment.

1. SMB: we have a tunnel between offices. In one office we have Cisco Hyperflex with the latest firmware and VMWare 6.7 cluster. And there is a dedicated server with the latest firmware and free ESXi 8 in the other. If I copy file from physical device like laptop from the office where hyperflex is placed to through the tunnel to the remote office I do that for 3-7 seconds without hangs and speed drops. I made all dumps on core in office with HPX.

laptop (Windows) -> WiFi AP -> access switch -> core switch -> router -> tunnel -> router - distribution switch - server - ESXi 8 - VM (Windows).

There are no retransmissions just normal coping.

However, if I do the same thing from VM in Hyperflex to VM in ESXi:

VM (Windows) -> Hyperflex server -> UCS (2 UCSs with LACP per each) -> core switch -> router -> tunnel -> router -> distribution switch -> server -> ESXi -> VM (Windows).

I got so much retransmissions that the same file can be copied up to 30-40 minutes. We have MTU 9000 for all network devices (except tunnels of course). We also use TrustSec (+SGT overhead) and IPv4 only.

I tried to use VMXNET3 and E1000 cards, update Windows VMWare Tools, change physical connections (left only one connected UCS), change MTU on vSwithes, disable TSO/LSO in VM - nothing helped. I also tried to change EnableBandwidthThrottling on client VM to 0 and this helped a little - less retransmissions but still a lot of. I also checked errors on physical ports in UCS and Core - they look clear.

From laptop:

 

42406 2023-05-04 16:35:13,813605 58.160514 10.0.242.122 10.3.100.5 TCP 1354 51590 → 445 [ACK] Seq=45218536 Ack=7033 Win=511 Len=1300 [TCP segment of a reassembled PDU]

42407 2023-05-04 16:35:13,813605 58.160514 10.0.242.122 10.3.100.5 TCP 1354 51590 → 445 [ACK] Seq=45219836 Ack=7033 Win=511 Len=1300 [TCP segment of a reassembled PDU]

42408 2023-05-04 16:35:13,813605 58.160514 10.0.242.122 10.3.100.5 TCP 1354 51590 → 445 [ACK] Seq=45221136 Ack=7033 Win=511 Len=1300 [TCP segment of a reassembled PDU]

 

From VM:

 

  20056 2023-05-04 17:35:30,786043    55.365382      10.3.100.5            10.0.100.55           TCP      66     [TCP Dup ACK 20055#1] 445 → 53430 [ACK] Seq=1715 Ack=4152732 Win=8196 Len=0 SLE=4239832 SRE=4245032

  20057 2023-05-04 17:35:30,786094    55.365433      10.0.100.55           10.3.100.5            TCP      1354   [TCP Retransmission] 53430 → 445 [ACK] Seq=4152732 Ack=1715 Win=6244 Len=1300

  20058 2023-05-04 17:35:30,786114    55.365453      10.0.100.55           10.3.100.5            TCP      1354   [TCP Retransmission] 53430 → 445 [ACK] Seq=4154032 Ack=1715 Win=6244 Len=1300

 

 I also checked connection with iPerf and TCP/UDP packets with different MTU from HPX VM. I found that if I set MTU to 1200 in app I got maximum speed and no problems. We do not block ICMP so it couldn't be black hole issue with MTU adjastment.

Coping from laptop to HPX inside office or from VM to VM inside HPX cluster flows without problems.

 

2. Second issue is a lot of retransmissions for packates in Hyperflex cluster between Cisco VMs (like ISE, FMC, WLC) and also between network devices.

For example:

a. WLC-VM (WiFi Controller) and ISE-VM are placed on different servers inside one HPX cluster. They connect to each other through the Core switch:

VM -> HPX server -> UCS -> Core -> UCS -> HPX server -> VM

I captured traffic on the core and found a lot of Duplicate Response (Access-Request) RADIUS messages. A LOT of. I guess that this is the reason why our users have to wait for sometime for dot1x authorization. These messages size are not bigger than 1200 bytes. I tried to connect from WLC to ISE by ssh and it was clear traffic - no retransmissions or DUPs. So I guess only special (probably only UDP traffic) types of packets suffer. 

The same situation if ISE deploys CoA/RADIUS/TCP (REST) packets to network devices: a lot of DUPs and retransmissions. For example for SGT propagation. This cause deploy issues when devices start dropping incoming packets of that type and stop responding to ISE.

The third issue I found for Cisco Firepower Management Center. It uses TCP for devices connection. And again the same situation: a lot of retransmissions and DUPs. However, I tried to open ssh from ISE to firepower devices - no issues.

There are no errors on interfaces. I tried to set less MTU on vSwitches, VMs interfaces, change VMXNET to E1000 and vise versa - no result.

I would be grateful for any suggestions and ideas.

Thanks!

Reply
0 Kudos