What's up with the ESXi numa scheduler? In my experience it doesnt seem like it does a great job. I have what I thought was a simple design for our Exchange environment:
As you can see each VM should fit inside a single numa node. And after initial power-on they all did. But I checked again today and now VM1 and VM2 are on the same numa node. Also VM3 and VM4 are on the same numa node. So VM1 and VM2 are sharing cores when they really dont need to be. VM3 and VM4 are sharing cores when they really dont need to be. And the percent local memory (N%L) for VM2 and VM4 is very low.
Why on earth would the ESXi scheduler do this?
(Sorry for the double post)
Wow not a single reply with 50 views. I've decided to use affinity even though I really really dont want to.
VM1: numa.nodeAffinity = 0
VM2: numa.nodeAffinity = 1
VM3: numa.nodeAffinity = 0
VM4: numa.nodeAffinity = 1
I'd still like to know why the ESXi scheduler would ever place the VMs on the same numa node the way that it did.
I think the VMs themselves have to be configured properly for them to take advantage of the underlying NUMA configuration. Can you post your VM's CPU configuration in sockets and cores/socket?
I am not concerned with the VMs themselves taking advantage of the underlying NUMA architecture (vNUMA). I simply am wondering why the ESXi numa scheduler would place the two VMs on the same numa node. When to me it would be more efficient for the scheduler to place 1 VM on numa node 0 and the other VM on numa node 1. Because by placing both VMs on the same numa node they end up sharing cores. If they were placed on separate numa nodes each VM would have exclusive access to cores. Not to mention the fact that by placing the VMs on the same numa node some of the VMs memory becomes remote instead of all being local.
That is important to note that NUMA archteture is not exposed to the GuestOS (vNUMA disabled), since It doesn't have more than 8 vCPUs. It is possible to expose NUMA topology (Enabling vNUMA) if you edit the advanced parameter numa.vcpu.min (Take a look at this blog post). Try editing this parameter and your VMs should perform better.
I understand that. But at this point I do not want to expose numa to the VMs (vNUMA). I simply would like to know why the ESXi scheduler would schedule the VMs on the numa nodes the way that it did.
The CPU scheduler will attempt to run each of these VM's on the same NUMA node and use the same memory each time it processes. That does not necessarily mean it will inherently separate 2 particular VM's. In this case, since the VM's seem to be serving the same purpose, transparent page sharing may be a processing advantage, and so the CPU scheduler decided to run these two VM's with many of the same memory pages on the same NUMA node to take advantage of that and enhance the processing speed of each. If you were to add additional load to the host, the CPU scheduler may ultimately separate the two.
You can disable the page sharing if you want to do so as a test and see if the VM's wind up running on separate sockets.
That could be a possibility and would make sense to me. I figured there had to be a "reason" the scheduler chose to migrate the VMs to the same numa node. For some reason it thinks scheduling them on the same numa node is more efficient.
I think this might be what is happening since these VMs communicate with each other heavily. From https://www.vmware.com/files/pdf/techpaper/VMware-vSphere-CPU-Sched-Perf.pdf
Frequent communication between virtual machines may initiate NUMA migration.
As when the CPU scheduler considers the relationship between scheduling contexts and places them closely,
NUMA clients might be placed on the same NUMA node due to frequent communication between them. Such
relation, or “action-affinity,” is not limited between NUMA clients and can also be established with I/O
context. This policy may result in a situation where multiple virtual machines are placed on the same NUMA
node, while there are very few on other NUMA nodes. While it might seem to be unbalanced and an
indication of a problem, it is expected behavior and generally improves performance even if such a
concentrated placement causes non-negligible ready time. Note that the algorithm is tuned in a way that
high enough CPU contention will break such locality and trigger migration away to a less loaded NUMA node.
If the default behavior of this policy is not desirable, this action-affinity-based migration can be turned off by
overriding the following advanced host attribute. Refer to the vSphere Resource Management Guide  for
how to set advanced host attributes.