ESXI 'CtxPerDev' - maximal throughput and affinity

nio1 · ‎11-02-2021

Hello Experts,

In ESXi, there is an option to assign some advanced parameters to a VM.

I am interested in a particular parameter called "CtxPerDev".
To my understanding, this parameter controls the number of 'vmkernel networking threads' associated with the Guest: 1 for the whole guest, 1 for each virtual nic, or 2-8 for reach virtual nic.

To my knowledge, 'vmkernel networking threads' refer to the Host's queues that are responsible to transmit packets from the Guest's virtual nic to the Host's physical nic.

I have two questions about those queues that I didn't find an answer to.

1. How can I determine what is the maximal throughput that each queue can pass?
I would like to get an answer, in order to know how should I set the "CtxPerDev" parameter, according to the expected VM throughput.

2. Is there a way to control those queues' affinity to be associated with specific NUMA node/s?
Is there a way to know what is the current affinity of those queues?
In 'esxtop' they appear as 'NetWorls-dev-**' but there isn't any information about its affinity.

Thanks for the help!

vbondzio · ‎11-02-2021

1. That depends on what kind of IO, the underlying CPU etc. Test and check the worlds utilization, once it reaches 100% it is maxed out.
2. That is taken care off by the relationship scheduling ESXi does, you can see the CPU it is on in esxtop or with sched-stats. The default affinity will be a so called "soft affinity" to a NUMA node. If you want to set a manual affinity (not really recommended), you'd have to use vsi.

What is the purpose here? If you are a VMware (Telco?) partner, you might want to get into touch with the relevant team for enablement.

nio1 · ‎11-03-2021

Hi vbondzio,

Thank you for your replay.

I have a VM deployed in ESXi host.
The VM functions as a router and gets its traffic outside the host: from a physical NIC to the VM and from the VM to a second physical NIC.
Assuming that we are not using SRIOV technology, I want to find a set of configurations on host-guest relationships,
to have the optimal infrastructure for the guest.

As for CtxPerDev, I noticed that its default value is 2 - meaning the whole VM will receive only one vmkernel thread.
This can be a bottleneck since the throughput will be limited by only one VM kernel thread - which is the maximum packet rate a physical CPU core (according to the thread affinity) can handle.

Therefore, from what I read, I can resolve to by setting multiple vmkernel threads:
ethernetX.ctxPerDev = 1 - Means the VM will receive a thread per vNIC.
ethernetX.ctxPerDev = 3 - Means the VM will receive 2 to 8 threads per vNIC.

For performance reasons, I want to make sure that those new threads have affinities to a specific NUMA node, which is the same NUMA node where VM's Vcpu are configured (according to the parameter 'numa.nodeAffinity').

Unfortunately, I did not find any reference, including VMware docs, that explain what is the default behavior to expect
(maybe it is in the same NUMA by default?) or how to check those affinities manually.
I am not very familiar with 'esxtop' tool and not familiar at all with 'sched-stats' or 'vsi' tools, and given an ID of NetWorld-Dev'
I do know how to check its affinity explicitly, using those tools or any other tool.

In addition, for latency reasons, I am also interested to find a way to make each NetWorld-Dev (vmkernel thread)
consume all its allocated physical core so it won't share it with other processes.
For example, setting the parameter 'monitor_control.halt_desched' to FALSE will achieve the same thing only for the VMs Vcpus.

Thank you for your help!

vbondzio · ‎11-17-2021

I'm pretty sure the default is 0 but 2, or the bitmap equivalent isn't doing anything, it's been a while since I checked though. 3 means one TX thread per vNIC queue. ESXi has a relationship scheduler, that means it would place the VM and its IO worlds on the NUMA node the physical IO device is attached to anyhow. That only happens at a certain load and isn't guaranteed in a highly dynamic and committed scenario but usually reliable enough.

You can not assign a direct affinity for just those worlds via vmx option, but they are affinitized with the vCPUs and other VM worlds to a specific NUMA node (numa.nodeAffinity = 0 or 1 etc.). The only (unsupported) option to assign an specific affinity at runtime is via vsi(sh) and would be ephemeral, i.e. not survive a reboot / suspend resume etc.

You can see the CPU a world is scheduled on in esxtop, in the default cpu view (c), you can select summary stats (i) in the field (f) selector. You need to expand (e) groups (GID) to see the individual worlds.

monitor_control.halt_desched = false only means the vCPU won't be descheduled when it halts, it doesn't imply any guarantee / entitlement. Latency Sensitivity = High does a lot more, what you want is exclusive affinity, which LS=High does automatically for vCPUs. If you want that for networlds, you need to run the VM with LS=high and "sched.cpu.latencySensitivity.sysContexts = <max number of IO worlds>" note that once the world utilizes a CPU for more than 60 %, it will try to reserve a core worth from the system pool and if that is successful, that world will now have exclusive affinity (nothing else can run on that core now except that world).

All

ESXI 'CtxPerDev' - maximal throughput and affinity