Solved: CPU allocation

DCaradec56 · ‎04-19-2021

Heelo, i have a strange behavior of my esx concerning vcpu allocation since 2 years now. I openned tickets to Vmware support but with no real result to fix my issue. Ou infrastructure is V7.U2

Some of our esx 72 vcpu host the same type of asterisk SIP VM. Half of the vcpu are used corresponding to only one of the host's cores.
These VMs are cut at 2 vcpu (no NUMA therefore)
When the machine time request becomes necessary for an additional load on the VMs, half of the vcpu remain inactive. The other vcpu, however available, are never called upon.
This has the effect of generating load on our VMs even though the host resource is still available.
Below, only the 36 vcpu of the core2 are requested. The increase in the cpu time requirement on our VMs will not cause the available vcpu to be solicited. They will remain inactive.

We have done tests on the Bios settings of esx and tests with the high latency feature. Unsuccessful tests, shutdown vcpu remain unsolicited.

Why does this cpu remain unsolicited while our VMs request cpu resource?

vbondzio · ‎04-20-2021

Ok, those VMs don't have any NUMA node affinity set. After changing Numa.LocalityWeightActionAffinit to 0, vMotion the VMs off the host, vMotion them back on. The distribution will then be round robin and they shouldn't migrate to the same NUMA node due to relationship any longer.

View solution in original post

a_p_ · ‎04-19-2021

Moderator: Moved from to ESXi Discussions

vbondzio · ‎04-19-2021

I'm having a hard time parsing your description and making out the numbers in the tiny screenshot but let me give this a shot. Let me just clarify some terminology before I'm going to give you advice based on what I _think_ you mean.

vCPU = a "thread" (world in ESXi terms) that is scheduled by the vmkernel
PCPU = the physical execution context a vCPU runs on whether SMT is enabled or not
core = if SMT / HT is enabled, two PCPUs are backed by one core

I'm just going to assume you have a 72 PCPU, SMT enabled host, so 18 cores per socket (x2).

I think you are saying, based on the right side of the image that (at least most of) the PCPUs on the first socket seem to be very minimally utilized. Given that this a per PCPU line graph overview, it also seems that there is quite a bit of headroom on the other socket with individual PCPU utilization barely averaging ~45 and peaking not much higher.

This is most likely explained by: https://kb.vmware.com/s/article/2097369 , unless you are having a numa node affinity in your template.

P.S.
Could you send me your previous SR# via email to myusername at vmware dot com?

DCaradec56 · ‎04-20-2021

Thank you very much for your response.

I'm just going to assume you have a 72 PCPU, SMT enabled host, so 18 cores per socket (x2). You assume well.

Completely, pcpu on a socket are not used. Only the 36 pcpu of the other socket are used. I agree that there is no reason to activate the pcpu of the other socket when there is no need. But every day for 1 hour, we have a cpu consumption pick on our esx due to the very significant increase in our SIP traffic. During this hour the pcpu of the other socket should wake up to offer us the available power of the entire esx. This is not the case, during this pick period, the pcpu of the active socket increase in consumption and generate load on the VMs without ever waking up the inactive pcpu of the other socket.

we carried out stress tests during this pick period by adding VMs on the esx and concentrating SIP traffic on it. Active pcpu get wrapped over 90% cpu and idle socket pcpcu are never awakened.

All our VM are 2vcpu so no Numa here. I don't think we have set a numa affinity, where is set this parameter for me to check it ?

Other strange behaviour, for no reason the load switches to the other socket on week 13

Other strange thing we have one esx that delivers correctly load on both sockets. The esx are strictly identical.

I'm ready to do some tests to investigate these behaviours.

DCaradec56 · ‎04-20-2021

I checked

In Advanced System Settings, the Numa.LocalityWeightActionAffinity.is set to 130 on all esx.

vbondzio · ‎04-20-2021

Yeah, that is the default. Set it to 0 and vMotion all VMs off / then back on. If that solve it, can you post the following to pastebin:

https://github.com/vbondzio/sowasvonunsupported/blob/master/vcpu_affinity_info.sh
https://github.com/vbondzio/sowasvonunsupported/blob/master/pci2numa.sh
# cat /etc/vmware/config

You might want to censor the VM names for the output of the first script.

P.S.
For the scripts, SSH into ESXi, copy the scripts, do "# cat > /tmp/foo.sh" press enter, paste (ctrl-p), enter, exit the cat (ctrl-c) "# chmod +x /tmp/foo.sh" enter, same for the other script, no need to chmod if you keep the same filename of course.

DCaradec56 · ‎04-20-2021

Sorry , i don't understand vMotion all VMs off / then back on. You mean vmotion the vm on another esx and vmotion it back again ?

I don't put the parameter Numa.LocalityWeightActionAffinity to 0 for the moment.

What i have in the numa client config is on an esx that use both sockets :

CID=2102435 GID=33685 LWID=2102788 Name=tast15

Group CPU Affinity:
guest worlds:0-71
non-guest worlds:0-71

Latency Sensitivity:
-3

NUMA client 0:
affinity: 0x00000003
home: 0x00000001

vcpuId vcpu# pcpu# affinityMode softAffinity Affinity ExAff
2102788 0 51 1 -> numa 36-71 0-71 no
2102790 1 52 1 -> numa 36-71 0-71 no

CID=2102442 GID=33721 LWID=2102804 Name=tast01

Group CPU Affinity:
guest worlds:0-71
non-guest worlds:0-71

Latency Sensitivity:
-3

NUMA client 0:
affinity: 0x00000003
home: 0x00000000

vcpuId vcpu# pcpu# affinityMode softAffinity Affinity ExAff
2102804 0 33 1 -> numa 0-35 0-71 no
2102806 1 20 1 -> numa 0-35 0-71 no

And on the esx that use only one socket, all the VM NUMA client are :

CID=2103711 GID=41316 LWID=2103833 Name=tast29

Group CPU Affinity:
guest worlds:0-71
non-guest worlds:0-71

Latency Sensitivity:
-3

NUMA client 0:
affinity: 0x00000003
home: 0x00000000

vcpuId vcpu# pcpu# affinityMode softAffinity Affinity ExAff
2103833 0 18 1 -> numa 0-35 0-71 no
2103835 1 9 1 -> numa 0-35 0-71 no

I never change something on NUMA configuration on my VMs !

vbondzio · ‎04-20-2021

Ok, those VMs don't have any NUMA node affinity set. After changing Numa.LocalityWeightActionAffinit to 0, vMotion the VMs off the host, vMotion them back on. The distribution will then be round robin and they shouldn't migrate to the same NUMA node due to relationship any longer.

DCaradec56 · ‎04-20-2021

I have done this on my 4 hosts, vms are now ditributed in RR mode on both sockets.

Is there any explanation on why this default NUMA parameter generates this strange behavior of distributing the pcpu on only one socket ?

I want to thank you very much for your very appreciate help on this topic.

Kind regards.

vbondzio · ‎04-20-2021

TL;DR
This is done because it often increases throughput and decreases CPU cost and latency by increasing locality of threads that might hit the same cache line.

The explanation in the KB is a bit more detailed and I would encourage you to read it. If you want to make it easier for others with similar problems to find the right answer, you might want to change the "accepted solution" from your reply to my first one.

DCaradec56 · ‎04-20-2021

Thank you again Valentin for your help.

DCaradec56 · ‎04-21-2021

Hello

can i come back with several questions ?

I would first like to explain our use case.
We dedicate esx to SIP traffic with VMs hosting asterisk. These VMs require real-time CPU availability. We have distributed our esx in 2 data centers and we use half of the resources on each data center. In the event of a crsah of one datacenter, the VMs switch to the other datacenter without problem of CPU load. All our SIP VMs are configured with 2 vcpu.
When I extract with your vcpu_affinity_info.sh script I get a list of the pPCPUs used.
Here is the list for an esx SIP: 2 3 17 18 22 27 28 32 33 36 46 49 51 53 56 60 61 64 65 67 68 69
So the other ESX PCPUs not listed here are not being used? because not assigned to any VM.
Why are my Cacti graphs showing me activity on PCPUs not listed above? should these unassigned PCPUs not be active?

We can see activity on PCPU0 and PCPU1 that are not assigned to any VM. Is there an explanation ?

vbondzio · ‎04-22-2021

The vCPU / PCPU scheduling is dynamic, it is not statically assigned, so a VM's vCPUs will freely move across vCPUs in the same NUMA node (and all vCPUs in a NUMA client / PPD across NUMA nodes if they have to).

You should only be concerned with giving those VMs a CPU reservation or if you think it improves performance and is necessary, also set Latency Sensitivity to High.

If you want to know more about the ESXi CPU Scheduler, definitely start here: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmware-vsphere-cpu-sched... (It's for ESXi 5.1 but the basics haven't changed dramatically)

If you want to know more about "LS=High", watch: https://www.vmworld.com/en/video-library/video-landing.html?sessionid=1527791508875001ekbt&region=EU (about 19 minutes in)

DCaradec56 · ‎04-22-2021

Thank you again for the explanations and ressources.

Best Regards

All

CPU allocation

CPU Allocation