Re: Max CPU Reservation with Latency-sensitivity H...

sandhuramandeep · ‎04-05-2021

Hello Experts,

Machine details:

Physical cores : 32 Cores x 2.9 GHz

Hyperthreading : Enabled, 64 Logical cores

Guest VM1 : 32 vCPU machine

We are trying to run a real-time application on a VM created using VMWare. Based on some recommendations, we enabled the Latency-Sensitivity to High. This also requires us to do 100% CPU reservation for the guest VM i.e 32 x 2.9 GHz. However, this gives an error saying that we cannot reserve this much CPU on the guest VM.

On further analysis, we saw that the Host capacity was shown as 92.8 GHz (32 x 2.9 GHz). Now, since we have enabled hyperthreading on the host, shouldn't the capacity of the host be 64 x 2.9 GHz? Are we missing anything here?

Thanks in advance.

Ramandeep Sandhu

vbondzio · ‎04-06-2021

HT doesn't count towards capacity (nor does Turbo Boost), so the maximum is cores x nominal frequency. ESXi is reserving some CPU too though, so you won't be able to reserve everything for VMs, you should be able to reserve 28-29 vCPUs. Latency Sensitivity disables HT for the CPUs it schedules the vCPUs on anyhow, so even if it was giving you reservable capacity (on top of approx 15%-30% max. throughput), that would be of no use since LS=High is about delivering deterministic performance / reduce jitter and that doesn't mix well with HT.

Check out:https://www.vmworld.com/en/video-library/video-landing.html?sessionid=1527791508875001ekbt&region=EU from around 19:00 minutes if you want to learn more about the feature.

P.S.
VMware, lower case w please.

sandhuramandeep · ‎04-06-2021

Thanks a lot for reverting !!

I was also referring to a tech paper at https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/performance/media-worklo... which recommends enabling LS = High and at the same time set Hyperthreading as enabled in the Host BIOS.

So do you say that even with 64 HT Cores in Host, I should not be creating a VM with LS = High and vCPU > 30? Or in the other words, the sum total of vCPUs on all my VMs on a 64 HT Core host should not be greater than 30?

Thanks for your reply.

P.S

VMware, lower case w please. - I will remember this now 🙂

vbondzio · ‎04-06-2021

In that case, the recommendation to enable HT is given because you want to increase the "runable" (not reserveable) capacity of the CPUs that aren't blocked for LS=High. You not only shouldn't but you also _can't_ create LS=High VMs larger then the number of cores you can reserve, you won't be able to power it on without a full vCPU reservation. So 28-29 is the maximum, that is assuming you don't use vSAN or NSX. Reserving CPU isn't the same as using it, while most of the 2-3 cores reservation is made by user worlds (e.g. vpxa / hostd / logger), it also keeps headroom for IO, which isn't free. If I was you, I'd create a LS=High VM with 28 vCPUs (14 cores per socket).

sandhuramandeep · ‎04-06-2021

Thank you @vbondzio .

The video was really very helpful !!

We are actually porting our real-time MPEG2-TS multicast packet monitoring system to run on VMware. We are using Intel DPDK for packet capturing at line rates of around 8 Gbps. We have previously faced packet loss issues with VMXNET3. So currently, we are trying to do this using Direct Pass-through. SR-IOV is also another option we are exploring.

Any resources(videos or tech papers) that you can guide us with will be of great help!

Thanks once again.

vbondzio · ‎04-07-2021

You might want to check if vmxnet3 works well enough with LS=High, if not then using passthrough and DPDK certainly will (or if you want to go NSX-T, use ENS). Since 6.5 the latest you won't be able to differentiate passthrough from bare metal: https://www.vmware.com/files/pdf/techpaper/vmware-fdr-ib-vsphere-hpc.pdf

Given the workload, you might also want to check out: https://docs.vmware.com/en/VMware-vCloud-NFV/2.0/vmware-tuning-vcloud-nfv-for-data-plane-intensive-w...

sandhuramandeep · ‎04-07-2021

@vbondzio any clue how much line rate can ESXi network stack handle with MTU of 1500 bytes? We are seeing packet loss @ 4 Gbps line rate. Our vNIC(vmxnet3 with DPDK) is capturing all packets it is getting but we suspect that the ESXi is unable to forward all packets to our vNIC. Any way to identify these packet drops at ESXi level? I tried esxtop with 'n' but saw 0% DRPRX. We are using ESXi 7.0 U2.

Will sched.cpu.latencySensitivity.sysContexts be of any help here ? Looks like this reserves pCPU for Rx and Tx thread in the ESXi level for a particular VM. Is this outside the CPU reservation set for the guest VM?

vbondzio · ‎04-07-2021

Are you sure you are using RSS? 4 Gbps is what you get with about one core ... Yes, sysContexts (a stop-gap solution until we introduced the ENS with NSX-T) needs its own full core reservation, it's basically latency sensitivity = high for the IO networlds. I don't think you need that just yet thought ... check RSS.

sandhuramandeep · ‎04-07-2021

You mean RSS at the ESXi host level right? I am unable to figure this out. As per my understanding, we do not need RSS at the guest OS level as we are using DPDK which runs on a single pCPU in a polling mode.

sandhuramandeep · ‎04-08-2021

Btw, I was able to enable DRSS and also increased the ring buffer size for the physical NIC Intel x710.

Unfortunately, still seeing occasional packet loss.

Then basis your pointers to ENS, I was going through this document from Intel and VMware - https://www.dpdk.org/wp-content/uploads/sites/35/2018/06/Accelerating-NFV-with-VMwares-Enhanced-Netw...

We only have a VMware ESXi 7.0 installed - do you think we can install an ENS vSwitch in the host without any other software from VMware? I understand, we will also need to use i40en_ens driver for the Intel x710 pNIC.

Thanks in advance. Really appreciate your help. With all these resources shared above, we are at least able to achieve lossless packet capture with Direct-Passthrough 🙂

vbondzio · ‎04-08-2021

At what rate are you seeing the packet loss with RSS and DPDK? How many rx queues / CPUs doing DPCs / bottom halves?

You can always open an SR so that we can look at where a potential bottleneck lies ... the answer might still be direct passthrough or ENS (and NSX-T for the latter). If you don't have NSX-T, LS sysContext might do the job for you though.

sandhuramandeep · ‎04-09-2021

We have set DRSS as 4 and ring buffer as 4096. Still observing the network. Will file SR.

Thank you for your guidance so far !!

vbondzio · ‎04-11-2021

ping me the SR# to my username at vmware.com, I'll try to have a look

sandhuramandeep · ‎04-20-2021

Hello @vbondzio , we are facing some issues with filing a ticket on the support portal. While we get that resolved, I wanted to share some updates with you.

Scenario 1:

32-Physical core host, VM-1 with 16 Cores and LS = High, Ring buffers and RSS configured at physical NIC level

No packet loss observed in 16-core VM-1, Packet rate : 4 Gbps

Scenario 2:

32-Physical core host, VM-1 with 16 Cores and LS = High, VM-2 with 10 Cores and LS = Normal, Ring buffers and RSS configured at physical NIC level

VM-2 with 10 cores is run with stress --cpu 10 to hog all CPUs in this VM
Intermittent Packet loss observed in VM-1, Packet rate : 4 Gbps

Scenario 3:

32-Physical core host, VM-1 with 16 Cores and LS = High, VM-2 with 10 Cores and LS = Normal, Ring buffers and RSS configured at physical NIC level. Additional Settings done.

Additional Settings - Split Rx mode is explicitly enabled in vNIC of VM-1 (ethernet1.emuRxMode = 1). sched.cpu.latencySensitivity.sysContexts is set to 4.
VM-2 with 10 cores is run with stress --cpu 10 to hog all CPUs in this VM
In this case, we do see an additional process in VM-1 : NetWorld-Devxxx-Rx which is consuming cpu cycles. This is introduced with Split Rx enabled. Further with sysContexts setting, we are able to set the exclusive affinity for this process.
No Packet Loss observed in VM-1, at least in overnight tests.

I could not find much documentation which explains the existence of NetWorld-Devxxx-Rx process. The benefits we are seeing in this case seem to be logical as an exclusive CPU thread is being used for Rx. Is there a way to increase the count of such processes. As I understand we can use ctxPerDev for Transmit threads but I could not find anything for Rx threads. Do you think we are moving in the right direction?

Thanks in advance.

vbondzio · ‎04-20-2021

Ok, so what is the exact goal here and what would you ultimate be happy with? I assume it isn't Scenario 1? Given you mention Scenario 2, are you asking for how to protect a Scenario 1 from noisy neighbors? Ignoring Split RX for a second, that protection in Scenario 3 comes from the sysContexts setting, it will set exclusive affinity for IO worlds (above 60% utilization).

Are you saying that you don't see the same result (no packet loss) when Split RX isn't set? IIRC, emuRxMode defaults to 2, which means to look at the host default, i.e. enabled automatically at a certain multi/broadcast packet rate. It really should only matter if you'd had multiple VMs receiving the same stream, can you force disable it by setting it to 0 and re-test?

sandhuramandeep · ‎04-21-2021

You are correct - our goal is to run this VM unaffected by other VMs running on the host. That will be the ideal deployment scenario for our customers who want to run our App on VMs.

Running with only sysContexts setting results in packet loss. Moreover, since sysContexts is a best effort setting, we do not even get to know if the exclusive affinity was set or not. But with RxSplit mode ON, the NetWorld-Dev-Rx thread shows affinity set to a particular pCPU, again if its exclusive or not I am not sure. However, one time I also observed that NetWorld-Dev-Rx had affinity set as 0-31. This was in the case when the 16-vcpu VM was assigned all pCPUs from a single NUMA node. Some documentation says - "If virtual machines with full reservations are deployed on the same NUMA node,
then in order to bring up that VM the sched.cpu.latencySensitivity.sysContexts setting will be disregarded
and aborted allowing the cores associated with this setting will be made available for other virtual machines."

Btw, when you say "it will set exclusive affinity for IO worlds (above 60% utilization)"

What does 60% utilization refer to? Is it CPU utilization of the sys context thread?
How can I identify the IO Worlds for a VM? Any command that can help.

Attaching some screens for your reference :

sched-stats output for 16-core VM

esxtop output for 16-core VM

sandhuramandeep · ‎04-27-2021

@vbondzio here is the Support Request # 21216216104

All

Max CPU Reservation with Latency-sensitivity High