Re: VMs balanced across NUMA nodes, however single...

RobWilkes1 · ‎02-23-2021

I have an application distributed across a cluster of VMs, each sized 4 vCPUs and 2GB vMEM.

The VMs are running on multiple hosts with 4x 15-core CPUs and 64GB RAM (it's a CPU heavy application, not memory).
(4x NUMA nodes, each with 15 pCPU and 16GB RAM)

Only 57 vCPUs are allocated, so there is zero CPU contention (there is also no memory oversubscription)

ESXTOP seems to indicate the VMs are evenly distributed across NUMA nodes:

However ESX GUI basically shows a single CPU Package being utilised:

This has resulted in high CPU ready times (up to 15%).

I assume it's basically trying to schedule 57 vCPUs onto a single 15-core socket?

I can work around this to a degree by setting CPU affinities, resulting in all packages being more evenly utilised and CPU ready times dropping to <1%.

However I would prefer to not set affinities and have the NUMA scheduler balance nodes more effectively, as there are (failure) scenarios where some VMs may get loaded up more heavily than others, creating NUMA imbalances that ideally the scheduler should be able to rebalance.

I have changed Numa.LocalityWeightActionAffinity in an attempt to not have them all executing on a single CPU Package, but it has made no difference.

I understand some benefits having VMs executing on a single NUMA node, especially if there is a lot of IO between them, and especially if they can fit on the node, but I can't understand why it would want to run all these VMs on a single NUMA node, assumedly accessing memory on a remote NUMA mode (a performance impact), and when CPU ready times are at 15%. It would make more sense to execute these VMs on their NUMA Home Node.

Is there something I'm missing? Any suggestions? Thanks.

RobWilkes1 · ‎02-23-2021

Since it's not a memory intensive workload, should I consider simply disabling NUMA?

It's more important that all cores are utilised than using cores closest to the VMs memory..

Although, that's not really what the NUMA Scheduler seems to be doing right now, since a single socket is being loaded up with as far as I can tell, most the memory being on remote NUMA nodes.

FDenneman01 · ‎02-24-2021

Please please please, Do not disable NUMA. NUMA isn't solely a software feature. it's a hardware layout. In my VMware talk 60 minutes of NUMA (2019 and 2020) I explain the HW layout and some of the scheduling decisions made. Writing will be done in a round-robin way, 4kb to one node, 4jb to another node. So from a write latency perspective, you get an average 20% performance hit. However read performance goes down the drain as you read the data where it is stored and every time the cache is flushed you have to retrieve it again, giving you a wonderful performance hit of more than 70% performance hit.

The reason why you see a particular NUMA node being the "meeting point" for all these VMs is of relational scheduling (better known as action affinity). The scheduler detects that they seem to "communicate" with each other. Simply it means that they show the same data access patterns and what better way to optimize performance than to have these VMs to use the data that is already in the L3 cache. On an Intel system accessing local memory occurs at roughly 75 to 80ns. Accessing L3 cache is done at 15ns. And so you typically get better performance.

In most situations, this behavior isn't noticeable due to a heterogeneous set of applications running on the ESXi host. You got some databases, some webservers, some fileservers, maybe a few Kubernetes cluster,s and all have different access patterns. But here you quite a homogeneous distribution of applications and that's when this behavior stands out quite effectively.

You already found the setting to disable relational scheduling. It's interesting to see what performance benefits you are getting with and without it. Last week I had a meeting with a customer about the same issue and they had a single node running all their SQL machines. From a CPU utilization perspective, it was completely unbalanced. 87% against 1.5% (dual-socket system) but the overall performance degradation was 13% even with the ready time that they were seeing, just because of the benefit of having data so close to the CPU.

Please try to measure the application performance, by finding a metric that is generated from the application and not from the level of the hypervisor (transactions in SQL for example, I'm not sure what such a metric would be in your application) and make a compression between the two scenarios. It might be better in your situation to keep action affinity disabled due to running a homogeneous application distribution

On my site numa.af I published a set of NUMA articles on vSphere and you can download the host resources deep dive book for free at hostdeepdive.com. I contains roughly 300 pages about NUMA. Hoped this helped

RobWilkes1 · ‎02-24-2021

Thanks for the thorough reply.

This particular application is a voice application, it's not about processing/high throughput, it's about getting the packets through the box and back onto the wire as fast as possible.

The inter-VM communication is quite low, some VRRP packets, some SIP signalling, however the majority of the traffic (RTP) enters the box, hits a single media VM, then hairpins back out onto the wire, it doesn't traverse multiple VMs.

I think high CPU READY would be detrimental to the quality of the call, analyzing the impact is complex, it's largely subjective whether a call sounds good or not, however there are tools and methods to analyze voice, and whilst they don't show anything wrong, I would suspect 15% CPU READY (and climbing as more channels is added) would result in increased latency, possibly increased jitter, and eventually a degraded experience.

Do the VMs need to be restarted after disabling the relational scheduling? the VMware KB article does not state so, however we did not see any rebalancing after disabling this setting.

I'll have a ready through your articles, they sound interesting.

Thanks again.

FDenneman01 · ‎02-24-2021

I can imagine that CPU ready time is unwanted in this situation. Is it possible to vMotion (Maintenance mode) these machines? Typically with voice applications is something you do not want, but in this case, it could speed up things. MM the host and take it out of MM asap so machines can flow back.

If you're disabling the setting, the NUMA scheduler will not generate a relation scheduling event and thus won't move VMs together on the same NUMA node. But now I will just look at the overall balance of the different NUMA nodes. It does track the "cost/benefit" in some way to see whether a particular move makes sense. It might happen that in your scenario, a single move will not generate "benefit" over the "cost" involved.

It may makes sense to file a support ticket to have GSS look at this to determine if there are any other blockers present.

All

VMs balanced across NUMA nodes, however single CPU Package high utilisation