HendersonD
Hot Shot
Hot Shot

NUMA node confusion

ESXi 6.7 Update 3 running on an HPE DL380 Gen10 server

Server has 2 sockets with 16 cores per socket. In BIOS, node interleaving is disabled and sub-numa clustering is enabled per best practice

With sub-numa clustering enabled, there are 4 NUMA nodes, this has been verified in ESXTOP as shown below

pastedImage_6.png

The server has 512GB of RAM so each NUMA node is given 16 cores and about 130GB of RAM. You might think that each NUMA node would get just 8 cores (32 cores total/4 NUMA nodes) but that is not the case. I have verified that each NUMA node has 16 cores using: esxcli hardware cpu list | grep Node

I do not have any VMs that would exceed a single NUMA node memory limit of 130GB. Also, the largest VMs has only 8 vCPUs. For this reason, every one of my VMs is configured with 1 socket. I then vary the number of cores per socket to get the amount of vCPU. For example, for a VM that needs 8 CPUs, it is configured with 1 socket and 8 cores per socket. Every VM should fit in a single NUMA node which means that none of them should be using remote memory from another NUMA node. I gleaned a lot of this information from just a few articles

vSphere Design for NUMA Architecture and Alignment - | Exit | the | Fast | Lane |

https://www.altaro.com/vmware/vsphere-misconfigurations/

Virtual Machine vCPU and vNUMA Rightsizing - Rules of Thumb - VMware VROOM! Blog - VMware Blogs

When I look at ESXTOP, memory page, and enable NUMA stats I am still seeing several VMs that are using a lot of remote memory. I see other VMs that are using mostly local memory and then for a period of time start using a lot of remote memory. Here is a screenshot from ESXTOP. I have a Windows print server (Printserv below) with 1 socket, 8 cores per socket, and 12GB of RAM. This should all fit within a single NUMA node and use local memory. In this screenshot it is using a very small amount of local memory and huge compliment of remote memory. Camera3 is in the same situation. Any idea why? Can this happen if the VM is over provisioned (given too much RAM or vCPUs)?

pastedImage_1.png

19 Replies
Ardaneh
Enthusiast
Enthusiast

Hi

When it comes to NUMA configuration you should consider a lot of things. I am going to go to your configuration directly. when you create a VM with less vCPU and more cores per socket, you had actually created some small domains of resources (you can check this with "Coreinfo" tools from Microsoft Sysinternals or check your vmware.log file and look for VPD to PPD), so as guest os view you have more than 2 NUMA nodes but in your physical environment, you only have 2 NUMA nodes, so some of your data will be reachable from remote memory DIMMs.(depends on your application behavior)

The only reason that you need to create this kind of VM with this configuration is "licensing" or maybe some specific situations, so from the NUMA perspective, I suggest keeping your cores per socket as low as you can (1 is the best) and increase your vCPUs unless you have another reason. If you have some kind of reason for this configuration, you can use "numa.consolidate=false" for your VM.

I hope this could help you

0 Kudos
HendersonD
Hot Shot
Hot Shot

This article is written by the VMWare performance team and was updated July of 2019

https://blogs.vmware.com/performance/2017/03/virtual-machine-vcpu-and-vnuma-rightsizing-rules-of-thu...

It's recommendation is to make every VM have one socket and then just increase cores per socket to reach the required vCPUs. They only list two reasons to vary from this recommendation

  • The number of total vCPUS required exceeds the number of cores on the CPU. For example, the CPU has 10 cores and a VM needs 12 vCPUs. In this case you would need to assign 2 sockets and 6 cores per socket
  • The amount of memory needed by a VM exceeds the amount allocated to one NUMA node

None of the VMs I am running fall into either situation, hence all of my VMs run with one socket. Is this article incorrect?

0 Kudos
OsburnM
Hot Shot
Hot Shot

I think your answer is actually a two-parter.. 

1) q.How do you have 4 Nodes showing in ESXCLI when it's a 2-socket host?  You don't specify the processor type; but, given it's a DL380 Gen10, it's entirely likely you're using an Intel Scalable Processor (Haswell or better) and they introduced a new feature called sub-NUMA Clustering.  Think of it like what hyperthreading is to cores, sub-NUMA clustering is to NUMA nodes...  It's a BIOS setting enabled by default in HPE Gen10's when you set PowerMode to Virtualization/HighPerformance.  See here:  Intel® Xeon® Processor Scalable Family Technical Overview | Intel® Software

2) q.Why are some VMs different than others regardless the socket/core config?  I suspect the newer auto-vNUMA feature is at play here.  Starting in 6.5, vNUMA is no-longer tied to socket/core configs in the vSphere Client.  See here for more detail:  Virtual Machine vCPU and vNUMA Rightsizing - Rules of Thumb - VMware VROOM! Blog - VMware Blogs

HendersonD
Hot Shot
Hot Shot

The processor in this host is an Intel Xeon Gold 6142 @2.60GHZ. As I stated in my original post, sub-numa clustering is enabled per best practice giving us 4 NUMA nodes.

OsburnM - the article you linked to is the same one I linked to. This article gives best practice for configuring individual VMs in terms of sockets and cores per socket. All of my VMs are configured with 1 socket following the advice in this article

This still does not answer my original question. The whole reason to stay within a NUMA node/boundary is to make sure the VM uses local memory and not remote memory. All of my VMs are contained within a single NUMA node but I still have some VMs that are using a considerable amount of remote memory. Here is another screenshot from ESXTOP.

Why is my VM called Ruckus1 getting half of its memory remotely? Ruckus1 is a Linux based server

Why is Camera1 getting 1/3 of its memory remotely? Camera1 is a Windows server

Is there something going on at the OS level that explains this?

Perhaps ESXTOP is just not reporting this correctly? Some other factor I am not taking into account?

pastedImage_2.png

Ardaneh
Enthusiast
Enthusiast

In my point of view, the only reason that you are using the remote memory is that there is no NUMA node exposed to your guest os and you are using more capacity than one NUMA node. if you are using Microsoft Windows you can check your NUMA configuration by using "Coreinfo" tools from Sysinternals. from VM level, you can check vmware.log file inside your VM folder ( "cat /vmfs/volumes/YOUR VOLUME NAME/YOUR VM NAME/vmware.log | grep -i vpd") and in there you must have more than one VPD, otherwise, there is no NUMA node exposed to your VM.

The recommendation is to create a VM with "N" vCPUs and 1 "Cores per socket" unless you have some different reasons (Licensing or exceeding the CPU limitation of 64 by windows)

The "Cores per socket" is only for licensing purposes, and if you increase the number of that value, you may face some performance issues (Many Domains with small resources) and your applications must be optimized themselves by that kind of configuration. So I recommend you to consider these and test:

- Disable sub-Numa clustering

- Create a VM with N vCPUs and 1 "cores per socket" (There will be no NUMA node if less than 9 vCPUs assigned to the VM or does not exceed the number of your physical cores per socket)

- If you have a VM with a large amount of memory (more than memory capacity of a NUMA node) and your workload is not CPU intensive, you can use "numa.PreferHT" configuration to put your VM into one NUMA node (in this case, your VM will not use the remote memory)

- from socket perspective, when you are using vSphere 6.5+ (you are using 6.7 as you mentioned), having more than 1 "cores per socket" will not affect the NUMA configuration of Guest OS (for example if you have 12 vCPUs and 2 cores per socket), but if you have a cache intensive workload or a smart application that can use CPU cache, you should increase the number of cores per socket (for example 16 vCPUs and 8 cores per socket)

I hope this could be helpful

0 Kudos
HendersonD
Hot Shot
Hot Shot

Here is what coreinfo is reporting on a VM that is using quite a bit of remote memory. The operating system is showing no NUMA nodes

pastedImage_0.png

Here is what I am focusing on: "In my point of view, the only reason that you are using the remote memory is that there is no NUMA node exposed to your guest os and you are using more capacity than one NUMA node."

I started looking at CPU and Memory usage inside the guest OS (rather than what is being reported by ESXi). Here is what I am seeing on this server which is running Windows Server 2012 R2

pastedImage_1.png

It does not appear I am exhausting the resources devoted to this machine. This VM has 1 socket, 4 cores per socket, and 10GB of RAM. Again, the VMWare performance team article I linked to recommends that nearly all VMs have 1 socket and then scale up using cores per socket. I could swap this and configure this VM with 1 core per socket and 4 sockets.

The oddest thing is when I look at ESXTOP, I see VMs that are using 80% remote memory and then it will suddenly change to 2% remote memory. Wait a short period and it will bounce back to a 70 or 80% remote memory. I cannot account for this bouncing around. Other VMs stay rock solid with nearly all memory is local

0 Kudos
HendersonD
Hot Shot
Hot Shot

Any other ideas on this? Is this the type of item that I can open a ticket with VMware and have them explain what I am seeing and figure out a way so all of my VMs us local memory?

0 Kudos
Ardaneh
Enthusiast
Enthusiast

Hi

I was interested in your scenario, so I tried to check on my own LAB and I found out some interesting things that I want to share with you. I hope this could be helpful:

There are 4 different columns in ESXTOP command when you check the memory that we should consider, NHN, NMIG, NRMEM, NLMEM.

- NHN or NUMA Home Node is the NUMA node that your VM has been put by NUMA Scheduler.

- NMIG or the number of migration is related to the number of migration made by the NUMA scheduler because of its nature and also Action Affinity.

- NRMEM or NUMA remote memory is the amount of remote memory accessed by VM

- NLMEM or NUMA local memory is the amount of local memory accessed by VM

The behavior of NUMA Scheduler:

As soon as a VM power-up, the NUMA scheduler will put the VM into a single or multi NUMA node (it depends on VM configuration). but this is not the end of the story, every time that a NUMA node is a better place for that specific VM (because of free memory or CPU metrics), the NUMA scheduler will migrate that VM to that NUMA node. it happens all the time. but the speed of vCPU migration is far more than the speed of memory migration. in this case, you will see the amount of remote memory accessed by that VM, under the NRMEM counter, is much more than the amount of memory in NLMEM. as you said, for example, 80% NRMEM and suddenly it changes to 20%

The behavior of Memory Scheduler:

When memory scheduler tries to access the remote memory (it can happen because of the lack of memory on local node or other reasons) NUMA scheduler will decide to migrate the whole VM to that NUMA node or NOT. so in this case, despite the fact that your VM was fitted into a single NUMA node, you will see some amount of remote memory access and local memory access. for example 20% of remote memory access and 80% of local memory access.

What is Action Affinity feature:

In your virtualization environment, you have 2 different places that data can be accessed by vCPUs, CPU L3 Cache and Memory. because of the different latency time between these two places, VMware always considers CPU L3 Cache as a better place to access data for vCPUs. so when there are two different vCPUs (for example one for VM1 and another for VM2) shared the same data or they are communicating with each other, NUMA scheduler will decide to place them closely, so they both can have access to the same L3 Cache data but the memory data for one of them is in another NUMA node, and this means remote memory access (Despite the fact that your VM is fittable into one NUMA node). this may cause CPU contention but as VMware says, the contention can be handled by NUMA scheduler (you can check KB 2097369).

You can check this with the NMIG counter, but it happens fast so you should put your eyes on the screen!

By Action Affinity, you may see the NHN of most of your VMs is the same, that doesn't mean one of your NUMA nodes is overloaded and another one is all free. it is expected behavior and generally improves performance even if such a concentrated placement causes non-negligible ready time.

If the increased contention is negatively affecting performance, you can turn off that by changing the value of NUMA.LocalityWeightActionAffinity to 0 in your host advanced configuration. Be aware that it means all of your workloads will be affected, so be careful.

Conclusion:

Accessing remote memory is a normal behavior of NUMA scheduler, I attempted to run different scenarios and saw the same behavior. NUMA scheduler will try to fit your VM into the best NUMA node, which means VM migration between NUMA nodes. as I said before, the speed of vCPUs migration is much more than memory, so you see a big amount of remote memory access at the moment.

You can check out my different scenarios screenshots in this link.

0 Kudos
HendersonD
Hot Shot
Hot Shot

This is a great explanation, thanks for checking this out in your lab

It is interesting that I have most of my VMs that are steady state with all of their memory coming from local memory

I have a handful of VMs (just 4 or 5) that are getting some portion of their memory from remote memory or bouncing around indicating a NUMA migration as you mentioned

Now I need to figure out why. What is unique about these VMs that has them doing quite a few NUMA migrations where other VMs never migrate

0 Kudos
Ardaneh
Enthusiast
Enthusiast

Your welcome, I hope you could figure it out and please share your experience.

You can migrate them to another ESXi host if it is possible for your environment. for example, set some Anti-Affinity DRS rules to separate them from each other and then check the result.

0 Kudos
HendersonD
Hot Shot
Hot Shot

I only have two hosts

  • HPE DL380 Gen10 servers
  • 512GB each
  • Two Intel Xeon Gold 6142 CPUs at 2.60GHz. These are 16 cores each
  • 10GB connection to core switch
  • Nimble all flash array with 10GB connections

Below is our current load. I took this screenshot on a Saturday morning so things are quiet. Even during the week, we have enough capacity to put a host in maintenance mode if we need to

pastedImage_1.png

Two host are plenty enough to handle our 45 VMs in terms of compute, ram, and storage. I do have one other storage array, a Nexsan which has all spinning drive. We have 4 VMs that are dedicated for video surveillance. These are backed by the Nexsan array that has huge capacity to store video files. All other VMs are backed by the Nimble array. The 4 camera servers are among the small handful that are using quite a bit of remote memory.

We have a Ruckus wireless network with two virtualized controllers and 411 access points. The two controllers are the other ones that tend to use quite a bit of remote memory

0 Kudos
HendersonD
Hot Shot
Hot Shot

This patch just came out

VMware ESXi 6.7, Patch Release ESXi670-202004001

This seems to describe what I am seeing. It appears that the patch is applied then a few advanced settings have to be changed. Am I reading this right?

0 Kudos
Ardaneh
Enthusiast
Enthusiast

The article is about non-volatile DRAM and PMEM (Persistent Memory). Do you have any PMEM installed in your environment?

0 Kudos
jskaznik
Contributor
Contributor

Hello

I've got a concern about this statement (which is actually in the VMware KB as well): "VMware always considers CPU L3 Cache as a better place to access data for vCPUs. so when there are two different vCPUs (for example one for VM1 and another for VM2) shared the same data or they are communicating with each other, NUMA scheduler will decide to place them closely..." 

How come a hypervisor knows that 2 separate VMs are communicating with each other? What kind of communication is this about? How come they are sharing the same data, when inter-VM TPS is disabled by default?

Any ideas?

regards,
Jacek

0 Kudos
vbondzio
VMware Employee
VMware Employee

Communication, or relation here means that two worlds (threads) access the same data and potentially hit the same cache line. E.g. a vCPU world and its networld for TX / RX and the vCPU at the destination (if the two VMs are on the same PG / VLAN / broadcastdomain). It could also be a vCPU and lets say the VM's pvscsi world with the IO world of the physical IO device. The VMs aren't sharing any memory, they reference the same data which is (outside of the VMs) in memory and more important, cache.

jskaznik
Contributor
Contributor

Thank you Valentin!

I was concerned about the case I've had a while back. We have an dual socket 20C/socket ESXi server with 4 MS SQL servers configured for 8vCPUs each, running DVDStore benchmark. With default Numa.LocalityWeightActionAffinity of 130, I've observed that 3 out of 4 VMs were scheduled on one NUMA node and 1 on the other and it did not change during the benchmark run. I've also rebooted the VMs and rerun the test, but it they were always scheduled 3:1 between NUMA nodes. This resulted in performance imbalance as well as the combined benchmark results (IOPS) from all 4 VMs was impacted. When the setting was changed to 0, the VMs were balanced between NUMA nodes (2 per node) without CPU congestion - CPUready was low, the benchmark (IOPS) results were within 1% between the VMs and the combined IOPS value from all VMs was also higher. 

 

0 Kudos
vbondzio
VMware Employee
VMware Employee

Yeah, turns out its super hard to get the balance right (automatically), my rule of thumb is to disable localitywheightactionaffinity for hosts with multiple VMs that are ~ half as large or larger than the pNUMA node and expected to be busy. Is there something in the wording of https://kb.vmware.com/s/article/2097369 we should update? I'm pretty sure I wrote the current iteration of that KB and it could maybe be a bit more prescriptive? It's just very hard to not be "handwavy" ....

We are constantly improving the algorithm though, most recently in 7.0 U2. You can of course always use CPU reservations which everyone seems to forget about ...

For you specific case, I assume the VMs were doing some sort of (external) IO? That IO device was likely attached to the (over)crowded node, hence the locality "benefit". You could run https://github.com/vbondzio/sowasvonunsupported/blob/master/pci2numa.sh to check the device locality (6.7 and newer).

You can check the relationships of worlds via:

[root@esxi04:~] sched-stats -t vcpu-comminfo
 vcpu        leader     name                  isRel type         id     rate  isRel type         id     rate  isRel type         id     rate  (...)   
    1048601     1048601 fastslab                  n    2    1048765        1      n    2    1048763        1                                  (...)
    1048602     1048602 SVGAConsole               n    2    1048756        1                                                                  (...)
    1048606     1048606 tlbflushcount             n    2    1048873        1      n    2    1048781        1      n    2    1050268        1  (...)
    1048607     1048607 tlbflushcounttryflus      n    2    1048671        2      n    2    1048586        1      n    2    1048650        1  (...)
    1048614     1048614 ndiscWorld                n    2    1048751        1                                                                  (...)
    1048622     1048622 CmdCompl-4                n    2    1048819        1                                                                  (...)
    1048624     1048624 CmdCompl-6                n    2    1051028        1      n    2    1051415        1      n    2    1051080        1  (...)
    1048625     1048625 CmdCompl-7                n    2    1051418        1      n    2    1048873        1      n    2    1051024        1  (...)
    1048628     1048628 CmdCompl-10               n    2    1049045        1      n    2    1051418        1      n    2    1048819        1  (...)
    1048629     1048629 CmdCompl-11               n    2    1048873        1      n    2    1051065        1                                  (...)
    1048632     1048632 CmdCompl-14               n    2    1051030        1      n    2    1051416        1      n    2    1051417        1  (...)
    1048633     1048633 CmdCompl-15               n    2    1051028        1      n    2    1197403        1      n    2    1051030        1  (...)
    1048634     1048634 CmdCompl-16               n    2    1207887        1                                                                  (...)
    1048638     1048638 CmdCompl-20               n    2    1051066        1      n    2    1051416        1      n    2    1051024        1  (...)
    1048639     1048639 CmdCompl-21               n    2    1051023        1      n    2    1051065        1      n    2    1051024        1  (...)
    1048641     1048641 CmdCompl-23               n    2    1152278        1                                                                  (...)
    1048643     1048643 AsyncTimeout              n    2    1048764        1      n    2    1048758        1                                  (...)
    1048644     1048644 DeviceTaskmgmtWatchd      n    2    1048764        1                                                                  (...)
(...)																																		  (...)
0 Kudos
jskaznik
Contributor
Contributor

Thank you so much!

Would be great to have some background information in the KB on what actually locality weight action affinity is and what does the change from 130 to 0 means to the worlds - unless this is the 'secret sauce of VMware ESXi' 🙂

Tags (1)
0 Kudos
vbondzio
VMware Employee
VMware Employee

Well, the setting description says:

"Benefit of improving action affinity by 1."

Internally, it is used together with the calculated benefit of increasing locality based on the communication rates / weights etc. I guess you could say it is "secret sauce" but it's also just very complicated and due to how it works, "tips" the scale from beneficial to detrimental with very little delta, so basically you might as well set it to 0 or 200. The internal algorithm is something we constantly work on and improve, the option to control it is just not very good (and should be a boolean instead of an int).

 

0 Kudos