Solved: Re: ESXi only use one CPU

AIT-AT · ‎04-20-2022

Hello All,

I have a strange problem with my HPE server. I am using a VMware vSphere 6 Essentials. However, the server uses only one CPU at a time. Sometimes the first one, and sometimes the second one. I have never seen the server use both CPU's at 100% at the same time.

Is there anything I have misunderstood? Or does anyone have the same problem?

vbondzio · ‎04-24-2022

Glad it is easy to read! Full disclosure, I'm no shell wizard either, it is just a handful of (b)ash principles and sed constructs that can be put together in different ways, everything that extends on that is googled and "stackexchanged" 🙂

This is basically just iterating through all the sched-stats options that start with an n, which gives you most of what you care about when looking at NUMA. This match instead of a list of options was done because we removed / added some options going from 6.7 to 7, so this way it works across all versions.

In your case, you have 16 core pNUMA nodes and two VMs which fit into that, so you have 2 x 16 vCPU NUMA clients, which is the "atomic" unit the NUMA scheduler deals with, so unless you go above that, the VM runs on either one of the physical nodes.

Both of those large VMs are currently on the same node, maybe because of a device / IO relation (i.e. both use an IO device that is attached to the 2nd socket) or because there is IO between the VMs. Usually, that will have VMs run more efficiently, but esp. at that size (when VMs fit tightly into pNUMA nodes), the "locality" scheduling might be a bit overeager. Try: https://kb.vmware.com/s/article/2097369 (after changing the setting, you need to migrate the VMs off and back on, alternatively power-cycle, not guest OS reboot).

View solution in original post

vbondzio · ‎04-20-2022

It isn't necessarily indicative of a problem, just that whatever VMs (NUMA clients to be more precise) are running on one or the other NUMA node, sometimes workloads migrate together for locality reasons too.

Can you run the following on the host and link to pastebin with the results?

for numaOption in $(sched-stats -h | sed -n 's/.*:    \(n.*\)$/\1/p'); do echo -e "\nsched-stats -t ${numaOption}"; sched-stats -t ${numaOption}; done

The output doesn't contain any sensitive information.

AIT-AT · ‎04-20-2022

Hello, thank you very much for your reply. I am amazed at how people write these scripts. They are easy to read and hard (for me) to write.

The Pastebin is here: https://pastebin.com/7i4qrSth

vbondzio · ‎04-24-2022

Glad it is easy to read! Full disclosure, I'm no shell wizard either, it is just a handful of (b)ash principles and sed constructs that can be put together in different ways, everything that extends on that is googled and "stackexchanged" 🙂

This is basically just iterating through all the sched-stats options that start with an n, which gives you most of what you care about when looking at NUMA. This match instead of a list of options was done because we removed / added some options going from 6.7 to 7, so this way it works across all versions.

In your case, you have 16 core pNUMA nodes and two VMs which fit into that, so you have 2 x 16 vCPU NUMA clients, which is the "atomic" unit the NUMA scheduler deals with, so unless you go above that, the VM runs on either one of the physical nodes.

Both of those large VMs are currently on the same node, maybe because of a device / IO relation (i.e. both use an IO device that is attached to the 2nd socket) or because there is IO between the VMs. Usually, that will have VMs run more efficiently, but esp. at that size (when VMs fit tightly into pNUMA nodes), the "locality" scheduling might be a bit overeager. Try: https://kb.vmware.com/s/article/2097369 (after changing the setting, you need to migrate the VMs off and back on, alternatively power-cycle, not guest OS reboot).

AIT-AT · ‎04-25-2022

Thank you very much for this KB. I will try it out at the next maintenance window. But this will take some time.

AIT-AT · ‎04-29-2022

Thank you very much for this tip. This solves the problem. I have seen that I am not the only person who has this problem 😁