Re: Poor NUMA-locality and -distribution

MKguy · ‎05-10-2011

I have a couple of HP DL380 G6 Servers with Xeon 5550 Nehalem CPUs running ESXi 4.1 U1.

When checking the memory view in esxtop with NUMA Stats, I notice 2 possible problems:

a) Some VMs display very poor memory-locality of even though their home node has more than enough free memory available (See second image, one VM only has 13% of its memory on the local home node.)

b) One NUMA-Node is usually much more heaviliy loaded than the other one (See first image where Node 1 holds a lot more VMs and total memory compared to Node 0)

Another thing I noticed is that the NMIG value never exceeds zero, so apparantly ESX doesn't even try to relocate VMs on NUMA nodes.

Although I don't directly experience poor performance on these systems, it still confuses me why ESX would schedule the VMs in such a way.

I know I'm not the only one with this issue, comments on this awesome article report the same behaviour.

And yes, I have read the infamous ESX CPU scheduler whitepaper (and other NUMA-related posts by Frank like the above).

-- http://alpacapowered.wordpress.com

Dev09 · ‎05-20-2011

>a) Some VMs display very poor memory-locality of even though their home node has more than enough free memory available (See second image, one >VM only has 13% of its memory on the local home node.)

NUMA Scheduler first does the CPU load balancing then it consider the memory-locality. If in case where the home node has high CPU load, then it will not consider the VM for migration on the basis of memory locality.

>b) One NUMA-Node is usually much more heaviliy loaded than the other one (See first image where Node 1 holds a lot more VMs and total memory >compared to Node 0)

Load on a NUMA node is not based on number of VMs. Load on a NUMA node is based on amount of CPU cycles being used by running VMs. That in turn will depend on number of VCPUs of a VM, its shares, workload running inside the VM and so on. Lets take an example of 2 node NUMA with 1000GHz CPU cycles available on each node. It is possible to have 3VMs collectively consuming 800GHz and 10 other VMs consuming only 700GHz together. In this case it is expected that the first 3 VMs will be running on node 0 (say) and the other 10 on the remaining node (node 1 say). And there will not be any load balancing migration happening.

-Dev

All

Poor NUMA-locality and -distribution