HendersonD
Hot Shot
Hot Shot

NUMA node confusion

ESXi 6.7 Update 3 running on an HPE DL380 Gen10 server

Server has 2 sockets with 16 cores per socket. In BIOS, node interleaving is disabled and sub-numa clustering is enabled per best practice

With sub-numa clustering enabled, there are 4 NUMA nodes, this has been verified in ESXTOP as shown below

pastedImage_6.png

The server has 512GB of RAM so each NUMA node is given 16 cores and about 130GB of RAM. You might think that each NUMA node would get just 8 cores (32 cores total/4 NUMA nodes) but that is not the case. I have verified that each NUMA node has 16 cores using: esxcli hardware cpu list | grep Node

I do not have any VMs that would exceed a single NUMA node memory limit of 130GB. Also, the largest VMs has only 8 vCPUs. For this reason, every one of my VMs is configured with 1 socket. I then vary the number of cores per socket to get the amount of vCPU. For example, for a VM that needs 8 CPUs, it is configured with 1 socket and 8 cores per socket. Every VM should fit in a single NUMA node which means that none of them should be using remote memory from another NUMA node. I gleaned a lot of this information from just a few articles

vSphere Design for NUMA Architecture and Alignment - | Exit | the | Fast | Lane |

https://www.altaro.com/vmware/vsphere-misconfigurations/

Virtual Machine vCPU and vNUMA Rightsizing - Rules of Thumb - VMware VROOM! Blog - VMware Blogs

When I look at ESXTOP, memory page, and enable NUMA stats I am still seeing several VMs that are using a lot of remote memory. I see other VMs that are using mostly local memory and then for a period of time start using a lot of remote memory. Here is a screenshot from ESXTOP. I have a Windows print server (Printserv below) with 1 socket, 8 cores per socket, and 12GB of RAM. This should all fit within a single NUMA node and use local memory. In this screenshot it is using a very small amount of local memory and huge compliment of remote memory. Camera3 is in the same situation. Any idea why? Can this happen if the VM is over provisioned (given too much RAM or vCPUs)?

pastedImage_1.png