VMware Cloud Community
miszcz
Contributor
Contributor

Significant network performance drop for FT enabled VM

Hey people,

I'm currently doing some tests to determine the usability of a certain setup: it would be good to have a certain VM protected with FT but this VM will also have quite a bit of network throughput as it is used as a firewall. The expected maximum throughput for this firewall VM is ~ 1.5 Gbps.

While doing some performance testing, I noticed a major drop in network performance when the VM was FT enabled.

The test setup is basically like this:

Client(s) >---data stream---> Firewall VM >---data stream---> Server(s)

ESXi hosts run vSphere 6.5 U1 and are connected with 2x10 Gbps NICs to the network switches. The FT link is currently not a dedicated link but a portgroup on the same vSwitch that is also handling the "live" (test) traffic. (In an operational environment, we would be using a dedicated FT link between the hosts but for the test setup I didn't use this)

The data stream between client and server which was going through the firewall was a simple netcat of /dev/zero:

  • on the client: nc -nv <server ip> <port> </dev/zero >/dev/null
  • on the server: nc -nv -l <port> >/dev/null

Without FT, the network throughput from Client to Server was ~3.5 Gbps.

With FT enabled for the Firewall VM, the throughput dropped to about 650 Mbps!

Out of curiosity, I added another client and another server with the same setup.

This time, without FT I was seeing ~7 Gbps throughput on the Firewall VM.

With FT enabled, the drop was even more severe, since the throughput did not go beyond 720 Mbps.

That's a performance drop of roughly 90%!

According to the white paper "vSphere 6 Fault Tolerance architecture and performance" there is indeed some serious network performance drop when a large amount of data is transmitted (especially over 10 Gbps), but the white paper lists 2.4 Gbps for an FT enabled VM in Figure 4 (compared to 9.5 Gbps throughput with FT disabled). That's still a factor of 3.5 to what I was seeing as the maximum.

Any idea why the FT is causing such a massive performance drop? The throughput on the FT vmk itself was ~1800 MBps (compared to the 650-700 Mbps on the live interfaces). So in total, there wasn't more than 2.5 Gbps going over the 10Gbps NICs at any time (with FT enabled). Disable or suspend FT and the throughput jumps instantly to 3.5 Gbps (for one client/server pair). I guess I would be able to saturate the 10 Gbps with three client/server pairs, but that would make the performance drop with FT only worse.

In a real deployment we would be using a dedicated FT link between the ESXi hosts, but I can not imagine what performance improvement that would bring to my test setup since the 10 G link is clearly far from saturation.

BTW, Jumbo frames are enabled on the vSwitch and the underlying network hardware (I also verified that jumbo frames are transmitted over the FT link).

No dropped packets were observed with esxtop or in the VMs themselves.

Maybe someone has a good idea where the bottleneck might be and how to push through that.

Thanks in advance for any help / hints.

Michael

0 Kudos
0 Replies