VMware Cloud Community
Ejnan
Contributor
Contributor

Isolated network between two VMs with random latency spikes

Environment:

  • ESXi 7.0.0 U3
  • CPU: 2x  18 Cores @ 2.60GHz
  • 32 VMs with Linux
    • 1 vCPU (2600MHz CPU reservation)
    • 1GB Ram (1GB memory reservation)
    • 2 vNICs with VMXNET3
    • Latency Sensitivity set to High
  • Follow the Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs
    • Static High/Max-Performance
    • Disable Processor C-States including C1E
    • Disable chipset power management
    • Disable NUMA

 

Usage:

There are always 2 VMs connect via a virtual switch (2nd vNIC). The virtual network connection is only used for communication between each other. 500 packets with a size of 780 bytes are sent per second (every 2 ms). If 4 frames remain unanswered by the other VM, the connection is terminated. Means that the other VM have to response after 8 ms at the latest.

 

Issue:

In a 195-hour test, 8 connection losses were recorded, because the number of unanswered packets exceeded.

If you look at a Wireshark capture (Figure1) from tcpdump, you can see that the messages are sent, but incoming messages are not received (The other VM received and send the messages correctly).

From No. 124367 to 124374 is the normal behavior. From No. 124375 the Controller (09:11:ff) does not get a response so an alarm reports. After approximately 17ms (No. 124380) the 5 messages from the device (df:98:e3) comes in within 4us.

Ejnan_0-1655794170338.png
Figure 1: Wireshark capture 

 

Following the Best Practices for Performance Tuning reduced the connection losses (especially Performance settings and disabled NUMA).

If i look at the processes of the VMs via esxtop, i can see that with more active VMs the %RDY time of every VM increases. In detail, the NetWorld %RDY time increases with each connected VM.

 

Are there any network settings to improve the network performance for an isolated network between two VMs?

Is there a way to get this behavior without the latency sensitive mode?

0 Kudos
6 Replies
DavoudTeimouri
Virtuoso
Virtuoso

Hi,

Please send information of your server hardware and let us know that how many VMs are running on the host?

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/
0 Kudos
Ejnan
Contributor
Contributor

- 2 x Intel(R) 18 Cores @ 2.60GHz

- 512 GB RAM (DDR4-2933)

- 32 Linux VMs (per VM 1 vCPU) running on the host (1st vNIC: all on same vSwitch) (2nd vNIC: 2 VMs connected to each other, so 16 connections on 16 separate vSwitch)

(Note: Only the traffic of 2nd vNIC is time sensitive)

 

0 Kudos
DavoudTeimouri
Virtuoso
Virtuoso

Make sure the host is not overloaded, CPU resource specially. Actually, CPU waits can put major impact on virtual machine and also ESXi host network performance.

Also remove any reservation and limit and then test it again.

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/
0 Kudos
Ejnan
Contributor
Contributor

The cpu load is under 10 %. There is not even pCPU overcommitment.

By removing any reservation and limit, the "latency sensitive" mode is not available. These settings increase the number of connection losses.

The wait value of the vcpu is at 0.00%. But the wait value of the NetWorld task is at 99-100%.

But in my opinion the ready time relevant.
Is it possible to increment the number of vm kernel i/o threads?
Is it possible to incremtn the number of network queues?

0 Kudos
DavoudTeimouri
Virtuoso
Virtuoso

This document will be useful: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmw-tuning-latency-sensi...

If you didn't read that!

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/
0 Kudos
Ejnan
Contributor
Contributor

I already read this document. This is why i set the power management, numa and latency sensitivity.

These settings helps a lot to reduce the connection losses (I started with connection losses every 60 seconds).
Now there only 8 connection losses in 200 hours. 

These connection losses are random. I knew that i have more connection losses, if I have more VMs powered on. But i used totally 32 vCPU out of (72 vCPU/36 pCPU).

 

Is there a way to prioritize traffic? So the 2nd vNIC is independent of the traffic of the first vNIC?

Is it possible to increment the number of vm kernel i/o threads?

0 Kudos