Highlighted
Hot Shot
Hot Shot

numa.autosize.vcpu.maxPerVirtualNode lack of info

Jump to solution

I have a VM with 64 vCPUs and 512GB RAM, it’s a massive db. ESXi 6.5 are HP DL560 with 4 sockets (22 cores each + HT) and 1.5 TB RAM. According to Frank Denneman's book vSphere 6.5 Host Resources Deep Dive I aimed at keeping the VPD onto as little psockets as possible. By disabling the Hot add CPU feature and by adding the preferHT set to True I increased performance quite a lot and I expected to see cores spread onto two physical sockets. However while the VMWare KB 2003582 states how to implement the preferHT setting it does not mention something Frank Denneman did say in his book:

Quote:

“Please remember to adjust the numa.autosize.vcpu.maxPerVirtualNode setting in the VM if it is already been powered-on once. This setting overrides the numa.vcpu.preferHT=TRUE setting”

End quote

I read the above after I did the initial changes to the VM and I have now noticed that its numa.autosize.vcpu.maxPerVirtualNode value is 11. According to Virtual NUMA Controls I should get 6 virtual nodes by dividing 64 by 11,but I see the VM has 7. This is another thing I don't understand.

pastedImage_2.png

following which criteria do I adjust numa.autosize.vcpu.maxPerVirtualNode value?

Shall I set it to 44 as it is the max number of logical cores in a physical socket? Or shall I disable it and let the system do its best decision? If yes how do I disable it? This is the current layout of the cpu resources of the vm:

pastedImage_0.png

although performances have improved I’m not happy with the distribution of the cores. Specially considering that homeNode 3 is not used at all.

pastedImage_1.png

So, to recap my question to the experienced admins are the following:

1. following which criteria do I adjust the numa.autosize.vcpu.maxPerVirtualNode value so that the preferHT setting is enforced correctly?

2. I knew that in 6.5 the coresPerSocket setting was decoupled from the Socket setting, so it does not really matter anymore if you set 12 sockets x 1 core or 1 Socket x 12 cores (unless licenses rectrictions are in place). However in Frank Denneman's book I read:

quote

"If preferHT is used, we recommend aligning the cores per socket to the physical CPU package layout. This leverages the OS and application LLC optimizations the most "

end quote

So, in this case the use of CoresPerSocket is effective? Then I should set 2 Sockets x 32 coresPerSocket? Option that frankly I haven't seen available in the VM Settings window

3. Why the VM has 7 virtual nodes instead of 6?

1 Solution

Accepted Solutions
Highlighted
VMware Employee
VMware Employee

PreferHT is used to consolidate the vCPUs as much as possible and create the fewest number of NUMA clients. In your situation, there are four NUMA nodes, 22 cores with each 384 GB of memory (assuming the DIMMs are equally distributed across sockets and offer the same capacity). With PreferHT, the NUMA scheduler takes the SMT capabilities into account, and therefore, each NUMA node can now accept NUMA clients similar to the HT thread count. In your situation, that is 44.

With this theory, your VM of 64 vCPUs should be distributed across two NUMA clients, each NUMA client grouping 32 vCPUs.

The advice of setting numa.autosize.vcpu.maxPerVirtualNode in the book is to propagate the virtual NUMA topology to the guest OS. This is typically recommended for VMs that exceed the memory capacity of a NUMA node while being able to all the vCPUs in a single NUMA node. In your scenario, setting it to 11 restricts the scheduler to prefer HT threads for sizing the NUMA client. 64/11 = 5.8; thus, it should round up to 6, but it doesn't do this, so I expect other advanced settings are influencing the NUMA client configuration.

Instead of diving into and wasting much time on figuring out this anomaly, my recommendation is to remove the setting numa.autosize.vcpu.maxPerVirtualNode, and allow the NUMA scheduler to align the NUMA client configuration to the PreferHT setting. Typically most customers do not use PreferHT on a virtual machine that has that many CPUs. PreferHT is an advanced setting designed to help workloads that are cache-intensive, but not CPU intensive. By scheduling all the vCPUs into the same NUMA node, it can leverage the same L1/L2/L3 cache and thus take advantage of the spatial locality of memory access. I suspect, you assigning 64 vCPUs is to have a lot of CPU resources available for the application.

We decoupled the NUMA scheduling constructs from the user setting Cores per Socket (CPS). Thus the setting provides customers to align with licensing requirements while the NUMA scheduler can manage its constructs to optimize memory access. When setting it, it provides a socket topology, and the Guest OS translates that as cache scheduling domains. In your situation, with 64 vCPUs with PreferHT enabled and maxPerVirtualNode disabled, you should get two NUMA clients with each containing 32 vCPUs. Your CPS setting should be 32 cores per Socket. That will present a dual-socket configuration, and your Guest OS will create a Cache map, where 32 CPUs share the same cache. The application and Guest OS can now determine where to place the threads when they expect these threads to share and access the same memory blocks.

Hope this clarifies it for you, if you decide to apply my recommendation, please test it first in a non-production environment and post the results as others might experience a similar situation and are helped by your experience.

View solution in original post

5 Replies
Highlighted
Hot Shot
Hot Shot

on vmware docs pages it says:

numa.vcpu.maxPerVirtualNode

Determines the number of virtual NUMA nodes by splitting the total vCPU count evenly with this value as its divisor.

so does it mean I need to adjust its value to the number of pcores in a psocket? Or logical cores in a psocket?

0 Kudos
Highlighted
VMware Employee
VMware Employee

PreferHT is used to consolidate the vCPUs as much as possible and create the fewest number of NUMA clients. In your situation, there are four NUMA nodes, 22 cores with each 384 GB of memory (assuming the DIMMs are equally distributed across sockets and offer the same capacity). With PreferHT, the NUMA scheduler takes the SMT capabilities into account, and therefore, each NUMA node can now accept NUMA clients similar to the HT thread count. In your situation, that is 44.

With this theory, your VM of 64 vCPUs should be distributed across two NUMA clients, each NUMA client grouping 32 vCPUs.

The advice of setting numa.autosize.vcpu.maxPerVirtualNode in the book is to propagate the virtual NUMA topology to the guest OS. This is typically recommended for VMs that exceed the memory capacity of a NUMA node while being able to all the vCPUs in a single NUMA node. In your scenario, setting it to 11 restricts the scheduler to prefer HT threads for sizing the NUMA client. 64/11 = 5.8; thus, it should round up to 6, but it doesn't do this, so I expect other advanced settings are influencing the NUMA client configuration.

Instead of diving into and wasting much time on figuring out this anomaly, my recommendation is to remove the setting numa.autosize.vcpu.maxPerVirtualNode, and allow the NUMA scheduler to align the NUMA client configuration to the PreferHT setting. Typically most customers do not use PreferHT on a virtual machine that has that many CPUs. PreferHT is an advanced setting designed to help workloads that are cache-intensive, but not CPU intensive. By scheduling all the vCPUs into the same NUMA node, it can leverage the same L1/L2/L3 cache and thus take advantage of the spatial locality of memory access. I suspect, you assigning 64 vCPUs is to have a lot of CPU resources available for the application.

We decoupled the NUMA scheduling constructs from the user setting Cores per Socket (CPS). Thus the setting provides customers to align with licensing requirements while the NUMA scheduler can manage its constructs to optimize memory access. When setting it, it provides a socket topology, and the Guest OS translates that as cache scheduling domains. In your situation, with 64 vCPUs with PreferHT enabled and maxPerVirtualNode disabled, you should get two NUMA clients with each containing 32 vCPUs. Your CPS setting should be 32 cores per Socket. That will present a dual-socket configuration, and your Guest OS will create a Cache map, where 32 CPUs share the same cache. The application and Guest OS can now determine where to place the threads when they expect these threads to share and access the same memory blocks.

Hope this clarifies it for you, if you decide to apply my recommendation, please test it first in a non-production environment and post the results as others might experience a similar situation and are helped by your experience.

View solution in original post

Highlighted
Hot Shot
Hot Shot

Frank many many thanks for your time on this! Really appreciated!

Now I know I can simply delete the numa.autosize.vcpu.maxPerVirtualNode in favour of the preferHT. Since it is not clear why the numa clients count does not reflect the actual settings, I agree with you that is a waste of time investigating further. So I will proceed with your suggestion and report back to the community as soon as I get the green light for the next shutdown. Unfortunately for this very application I don't avail of a test environment. Thanks also for the hint on the preferHT parameter. I chose it so vCPUs could stay on as less physical sockets as possible, else the other solution would have been to have the vCPUs spread over 4 physical sockets, like 4 vSockets x 16 cores and frankly I don't know if that is a good idea. Also I will reserve VM's RAM fully to further improve performance.

Big thanks again!

Highlighted
Hot Shot
Hot Shot

Hello again, I could not get the authorization to shutdown the VM but I did for another VM, and so I could test these settings. The VM at hand has 80 vCPUs/1.3TB RAM and the Hot Adds enabled (CPU/RAM). After having shutdown the VM I did the following:

1. disabled both Hot adds features

2. decreased RAM to 312 GB to suit its real consumption (roughly 150/200 GB)

2. added the numa.vcpu.preferHT=TRUE parameter

3. removed the numa.autosize.vcpu.maxPerVirtualNode parameter

4. adjusted the CPU sockets/Cores as per screenshot:

Sockets-Cores.PNG

I turned on the VM but I see that the numa.autosize.vcpu.maxPerVirtualNode parameter was recreated with a value of 20. I was expecting to see 40 vCPU on two physical sockets. Can anybody tell me if that is normal?

esxi.PNG

0 Kudos
Highlighted
Hot Shot
Hot Shot

FDenneman01 I finally had the green light to shutdown that VM for maintenance. Here's its cpu situation:

pastedImage_0.png

and after the shut down I removed the numa.autosize.vcpu.maxPerVirtualNode as the preferHT is in place. When powered on I got this situation:

pastedImage_1.png

once again the numa.autosize.vcpu.maxPerVirtualNode is back and its value is 16. So again the vCPUs are spread over all 4 physical cores, in spite of the preferHT and coresPerSocket. Is this something to address to VMware support? Or something I don't understand. I read about the preferHT:

"Please note that this setting does not force the CPU scheduler to only run vCPUs on HTs. It can still (and possible attempt to) schedule a vCPU on a full physical core. The scheduling decisions are up to the CPU scheduler discretion and typically depends on the over-commitment ratio and utilization of the system."

It is the scheduler decision I didn't understand. If anybody has a clue it will be of great help! Thanks again

0 Kudos