VMware Cloud Community
xadamz23
Contributor
Contributor

Numa Imbalance

What's up with the ESXi numa scheduler?  In my experience it doesnt seem like it does a great job.  I have what I thought was a simple design for our Exchange environment:

Host 1

  • ESXi 5.5 U2
  • Intel Xeon E5-2680 v2 (2 sockets, 10-core/socket, HT on)
  • 256 GB RAM
  • So it becomes divided into 2 numa nodes.  Each node has 10 cores (20 logical w/HT) and 128 GB of RAM
  • VM1:  8 vCPU/32 GB
  • VM2:  8 vCPU/120 GB

Host 2

  • ESXi 5.5 U2
  • Intel Xeon E5-2680 v2 (2 sockets, 10-core/socket, HT on)
  • 256 GB RAM
  • So it becomes divided into 2 numa nodes.  Each node has 10 cores (20 logical w/HT) and 128 GB of RAM
  • VM3:  8 vCPU/32 GB
  • VM4:  8 vCPU/120 GB

As you can see each VM should fit inside a single numa node.  And after initial power-on they all did.  But I checked again today and now VM1 and VM2 are on the same numa node.  Also VM3 and VM4 are on the same numa node.  So VM1 and VM2 are sharing cores when they really dont need to be.  VM3 and VM4 are sharing cores when they really dont need to be.  And the percent local memory (N%L) for VM2 and VM4 is very low.

Why on earth would the ESXi scheduler do this?

Thanks

(Sorry for the double post)

Reply
0 Kudos
11 Replies
xadamz23
Contributor
Contributor

Wow not a single reply with 50 views.  I've decided to use affinity even though I really really dont want to. 

VM1:     numa.nodeAffinity = 0

VM2:     numa.nodeAffinity = 1

VM3:     numa.nodeAffinity = 0

VM4:     numa.nodeAffinity = 1

I'd still like to know why the ESXi scheduler would ever place the VMs on the same numa node the way that it did.

Reply
0 Kudos
vAMenezes
Enthusiast
Enthusiast

I think the VMs themselves have to be configured properly for them to take advantage of the underlying NUMA configuration. Can you post your VM's CPU configuration in sockets and cores/socket?

Reply
0 Kudos
vAMenezes
Enthusiast
Enthusiast

Also esxi only does this scheduling automatically on VMs with more than 8vCPU, with 8 or less you need to do that manually, so perhaps what you're seeing is normal?

Reply
0 Kudos
xadamz23
Contributor
Contributor

I am not concerned with the VMs themselves taking advantage of the underlying NUMA architecture (vNUMA).  I simply am wondering why the ESXi numa scheduler would place the two VMs on the same numa node.  When to me it would be more efficient for the scheduler to place 1 VM on numa node 0 and the other VM on numa node 1.  Because by placing both VMs on the same numa node they end up sharing cores.  If they were placed on separate numa nodes each VM would have exclusive access to cores.  Not to mention the fact that by placing the VMs on the same numa node some of the VMs memory becomes remote instead of all being local.

Reply
0 Kudos
Kauy
VMware Employee
VMware Employee

Hey.

That is important to note that NUMA archteture is not exposed to the GuestOS (vNUMA disabled), since It doesn't have more than 8 vCPUs. It is possible to expose NUMA topology (Enabling vNUMA) if you edit the advanced parameter numa.vcpu.min (Take a look at this blog post). Try editing this parameter and your VMs should perform better.

Kauy Souza
Reply
0 Kudos
xadamz23
Contributor
Contributor

I understand that.  But at this point I do not want to expose numa to the VMs (vNUMA).  I simply would like to know why the ESXi scheduler would schedule the VMs on the numa nodes the way that it did.

Thank you

Reply
0 Kudos
vAMenezes
Enthusiast
Enthusiast

I think that would be very hard to find out, it depends on what else you have going on on those hosts, I'm sure.

Reply
0 Kudos
xadamz23
Contributor
Contributor

There is nothing else going on with the hosts.  Just 2 VMs on each host as I described.  No other 3rd party stuff, no snaphosts, nothing.

Thanks guys

Reply
0 Kudos
greco827
Expert
Expert

The CPU scheduler will attempt to run each of these VM's on the same NUMA node and use the same memory each time it processes.  That does not necessarily mean it will inherently separate 2 particular VM's.  In this case, since the VM's seem to be serving the same purpose, transparent page sharing may be a processing advantage, and so the CPU scheduler decided to run these two VM's with many of the same memory pages on the same NUMA node to take advantage of that and enhance the processing speed of each.  If you were to add additional load to the host, the CPU scheduler may ultimately separate the two.

You can disable the page sharing if you want to do so as a test and see if the VM's wind up running on separate sockets.

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
xadamz23
Contributor
Contributor

That could be a possibility and would make sense to me.  I figured there had to be a "reason" the scheduler chose to migrate the VMs to the same numa node.   For some reason it thinks scheduling them on the same numa node is more efficient.

Reply
0 Kudos
xadamz23
Contributor
Contributor

I think this might be what is happening since these VMs communicate with each other heavily.  From https://www.vmware.com/files/pdf/techpaper/VMware-vSphere-CPU-Sched-Perf.pdf

Frequent communication between virtual machines may initiate NUMA migration.

As when the CPU scheduler considers the relationship between scheduling contexts and places them closely,

NUMA clients might be placed on the same NUMA node due to frequent communication between them. Such

relation, or “action-affinity,” is not limited between NUMA clients and can also be established with I/O

context. This policy may result in a situation where multiple virtual machines are placed on the same NUMA

node, while there are very few on other NUMA nodes. While it might seem to be unbalanced and an

indication of a problem, it is expected behavior and generally improves performance even if such a

concentrated placement causes non-negligible ready time. Note that the algorithm is tuned in a way that

high enough CPU contention will break such locality and trigger migration away to a less loaded NUMA node.

If the default behavior of this policy is not desirable, this action-affinity-based migration can be turned off by

overriding the following advanced host attribute. Refer to the vSphere Resource Management Guide [1] for

how to set advanced host attributes.

/Numa/LocalityWeightActionAffinity

Reply
0 Kudos