Highlighted
Contributor
Contributor

NUMA issue, again

Hi there.

I have 5 esxi nodes. All with 5.1 update 2.

I'm having issues with NUMA scheduler. Poor balance.

Let me go directly there.

I'm playing (doing tests) now with one node. Just one.

Dell r910, 4 opteron --> 4 numa nodes: each --> 6 cores + 64 GB ram. Total: 24 cores + 256 GB ram.

10 VMs. VMs cores: 4,4,4,4,2,2,1,1,1,1, respectively. Very well sized. Each one of them use 80-90% of those vcores.

No under or oversized situations. Memory, between 1 and 16 GB. No problem with memory. The issue is strictly CPU related.

Ok. ESXTOP + m + f. Numa statistics.

Numa nodes and cores (VMs)

0          4,4,1            9 cores (!!). Terribly balanced. The 3 VMs have high CPU ready times.

1          2,2,1,1          Completing numa core count. 6 cores. Ok here.

2          4,1                5 cores. Ok here.

3          4                  4 cores. Ok here.

So. Why? Smiley Sad

I waited for an entire day and the VMs stay there. No new rebalance. Nothing.

So, i fix it manually. I move the VMs between nodes using resource settings --> advanced memory / cpu (specifying numa node and cpu affinity*).

* Fact : I've read on the official documentation that specifying only the numa node on advanced memory, does not work. You need to specify the CPU affinity too.

So, for example, for numa node 1, cpu's are: 6,7,8,9,10,11. I specified: 6-11, which is the same Smiley Happy

The VMs move instantly.

Result on esxtop:

0          4,1,1

1          4,1,1

2          4,2

3          4,2

Excellent. That's balance. VMs on each numa node; completing the 6 cores per node.

Yes, of course: memory locality 97-100%. Every time. No problem there, like i remarked at the beginning.

CPU ready time dropped to 0-50 ms on every VM. Perfect. Before, we were talking about 2000 ms - 5000 ms (!!!).


Another fact:

I've read that once a VM is part of a new ESXi host (by automatic vmotion for example), the scheduler considers the VM memory.

It puts the VM on the first numa node which has enough free memory to hold the VM. That's all.

It does not care about the core amount. That can deliver poor CPU performance in the short term.

Now, after one hour, i removed, from each VM, every setting from advanced settings (affinity matters).

After another hour, i checked CPU ready times on each VM. All doing fine, but 2.

I went to ESXTOP. AGAIN. NUMA nodes IMBALANCED.

One of the numa nodes had enough VMs for 7 cores and another, 8 cores.

Why !!!

So. What i'm doing right now and from now on?

I do manual balance and then --> ESXi host --> software --> advanced settings --> numa.RebalanceEnable=0.

The VMs stayed right on the numa node i put them.

Excellent CPU ready times up to now.


2 questions:

1) Is there a way to fix this using one or more of the NUMA advanced attributes? I want the VMs to be positioned on each of the NUMA nodes, taking as reference / considering each VM core count, too, not only the memory!! It' s obvious and essential !!! Otherwise, you experience the obvious bridge cross (that's how i call it) between physical cores; adding latency. Instant latency. I want each VM to stay on one numa node. No remote memory or remote CPU !!

2) Is this, in some way, totally fixed on vSphere 5.5 ? Is numa balancer/scheduler better? is quite frightening.

Thanks !!!!!!!

ps: The "again" on the subject is version related. I've seen NUMA poor balancing issues on other discussion threads, for vSphere 4.1 and 5.0.

25 Replies
Highlighted
Contributor
Contributor

So, like i said, i disable NUMA balancer, but... although ready times drop to 0-100 ms, i can see on esxtop that there is no more NUMA. NHN columns now shows all the cores. I thought this attribute would disable the balance action, not NUMA.

So, numa disabled, SMP takes place. Memory locality went to 100%. For all VMs.

So, is this correct? I don't want to disable NUMA. I just want it to be fair.

Is there any way to force this? consider VMs core count when it balances?

By tuning any of the advanced attributes?

numa.automemaffinity

numa.costopskewadjust

numa.largeinterleave

numa.localityweightactionaffinit

numa.localityweightmem

ETC ??

Hoping to hearing from you guys soon.

0 Kudos
Highlighted
Contributor
Contributor

So, after a little investigation (reading http://www.vmware.com/files/pdf/techpaper/VMware-vSphere-CPU-Sched-Perf.pdf )

NUMA scheduler seem to consider:

     - VMs communication (?) and

     - Long term fairness (every numa node should have similar CPU load)...

... when balancing between nodes.

So. I enable back the attribute Numa.EnableBalancer and disable those 2 attributes:

Numa.LocalityWeightActionAffinity

Numa.LTermFairnessInterval

If you think it, is great in the sense that... you can see which VMs are put together, and then try to balance manually the others.

So:

1) Leave it all by default.

2) See the balance on esxtop + m

3) Manually balance the wrongly balanced nodes. Moving VMs with affinity settings.

4) Disable the 2 attributes i mentioned.

That's it.

So, this is great if you have 1,2 esx and a couple of VMs. Easy to follow.

What about 20 esxi and 600 VMs ??

It's a dead end ??

0 Kudos
Highlighted
Contributor
Contributor

Hi there.

Anyone?

Thanks!

0 Kudos
Highlighted
Hot Shot
Hot Shot

Good post so far...I'm wondering also and have a quite a few hosts. Need some time to have a look at our own environment and the balancing...

0 Kudos
Highlighted
VMware Employee
VMware Employee

Hi there,

3 questions from my side:

Are you saying that setting Numa.LocalityWeightActionAffinity to 0 and waiting for a day does not cause balance migrations that re-distribute the VMs?

If you set it to 0 and vMotion everything off, than back on, how does it look?

While you do see high ready time, do you see an actual performance issue in your workloads?

Cheers,

Valentin

0 Kudos
Highlighted
Contributor
Contributor

Hi vbondzio, thanks for your post.

So, going directly to your questions.

1) I've opened a support request of course. I'm still waiting fo VMware to tell me what VM communication exactly is for example.

VMs on the same datastore? VMs that communicate between them through: VM app talks to another VM database for example? etc.

No, it didn't work. I wanted to be like that, but the VMs move from one NUMA to another. I saw no balance again, and ready times going to the roof. I had to balance them manually, again. Ready times --> 0 ms. Perfection.

I'm thinking on moving to 5.5; nevertheless, i still cannot believe the NUMA scheduler is that dumb. It does not care about VM ready times. It just care about NUMA load (?), it tries to balance the nodes --> every numa with the same/similar CPU load.

Now, then again, what about VM communication? what is that?

Numa scheduler don't care about VM ready times, then... VMware...  stop recommending se the VMs with a number of cores... so they can fit on a numa node.

2) Already tried that one... several times. Everytime i did it saw the same. They are positioned ramdonly ---> if a numa has enough free memory in order to keep the VM, up you go. No bother about core count. No bother about ready times. A behaviour that can annoy anyone.

3) Yes, of course. I see slowness withing the VM. If linux, you see high uptimes. Apps (java, apache) start to act erratic.

So, questions to be answered.

What do you guys think this attributes do................ exactly ?

a) LocalityWeightMem. Benefit of improving memory locality by 1 pct. Now is set to 1.

A VM will be moved between NUMA nodes.... if it can increase the N%L by 1% ?

b) LocalityWeightActionAffinity. Benefit of improving action locality by 1 pct. Now set to 0. Default: 130.

This one is about "VM communication". Numa scheduler put VMs together (same numa node) if they communicate a lot.

What is that communication?


c) MigImbalanceThreshold. Mininum percent load imbalance between nodes to trigger migration. Default: 10

Mininum. I set this to 200 (maximum allowed). Is this ok? So the VMs won't be moved until the numa scheduler sees too much load imbalance. But again, this could probably work............ after i do some manual balance.


d) MigThreshold. Mininum percent load imbalance improvent to allow single migration. Default: 2.

I set it to 100. Difference with the previous one?


I really don't want to touch this attributes.

Is not the idea. Again, i want the numa scheduler to handle the balance... perfectly.

Is this really fixed on 5.5 ??


Thanks.

0 Kudos
Highlighted
VMware Employee
VMware Employee

Hi,

As far as I understand my colleague already asked you for a performance log bundle, very good. Before you collect those though, can I ask you to revert _all_ NUMA advanced settings back to their default? Including Numa.LocalityWeightActionAffinity which would need to be set to 130 again (don't forget to remove CPU / memory affinity). Now wait until the you can reproduce the unbalanced scenario and take the performance snapshots. Once those are done recording, set Numa.LocalityWeightActionAffinity back to 0 please. Now, put the host into maintenance mode, once the host is empty, migrate the VMs back onto the host and once that is finished, capture a regular support log bundle. Wait until there is enough load / "un-balance" so that you have a visible performance impact on some of the VMs, record another performance log bundle and upload as usual.

"Communication" means that VMs are using the same worldlets ("threads") for sending network packets among themselves (on the same host) or by sending frames out the physical nic. These worldlets can share a memory region to communicate with each other, i.e. the VMs vCPU notifies the vNic worldlet that a new packet is available to be send out by writing to that shared region. That worldlet will read from that region and do what is necessary based on the configuration, e.g. add a vlan header. Then it will write into the same region that the frame is ready to be picket up by the pNic worldlet which will read that the frame can be send out and write back that it has been sent, etc. (note that this is a VERY rough explanation but for all intents and purposes, correct enough to get the picture across).

Having that shared memory region Last Level Cache local, i.e. in the CPU cache that multiple cores of a socket share (always means all cores in a NUMA node have a LLC), means that those constant communications between those worlds and worldlets (i.e. read and write to that region) don't have to go alllll the way to the memory (or worse, remote memory). Since some worldlets, e.g. for communicating with the pNic are limited, hence some VMs might "huddle" together on one or two nodes to make use of the LLC locality.

Note that theoretically, the benefit from the locality should make up for the added ready time (and it does in most benchmarks / tests we do). I.e. even though you see higher ready on average, the decrease cost for network will result in an overall improvement for network intensive workloads. Once ready becomes too high though, the VMs should be balance migrated to other nodes. That being said, sometimes the locality does _not_ make up for the increased ready time plus it stays below the level necessary levels to cause a balance migration. In those cases you can set Numa.LocalityWeightActionAffinity to 0 and that should be enough.

It is possible that this is something else in your environment and not what I explained above, but we should figure that out via the SR. Once we reviewed the logs we can give you some more concrete information.

Cheers,

Valentin

P.S.

Here a slide pdf export that covers some of the above, while it misses some explanatory animations, it should still fill in some of the gaps.

https://horizonworkspace.vmware.com:443/data/shf/i+xTgUEKR1SPUfKy72BX6AAFMTUyODIA

Highlighted
Contributor
Contributor

Hi. Thanks for the info! really useful.

Just for your knowledge, i remove all the VMs from 2 hosts, wait a few minutes, and put them all back in.

Let me introduce you this 2 "lab rats" Smiley Happy

1) Dell R815 - Opteron 6140. 4 sockets. 8 cores per socket = 32 cores.

8 NUMA. 4 cores per numa.

----- On this one, i have 18 VMs --> cores --> 4 - 4 - 2 - 2 - 2 - 2 - 2 - 2 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 = 32 cores.

They can accomodate perfectly

2) Dell R910 - Opteron 8439 SE. 4 sockets. 6 cores per socket = 24 cores.

4 NUMA. 6 cores per numa.

----- On this one, i have 9 VMs --> cores --> 6 - 4 - 4 - 4 - 2 - 1 - 1 - 1 - 1 = 24 cores.

Idem

I have default attributes on both, regarding NUMA.

I left them all night. I check today and i see:

- low ready times on the 80% of the VMs, on the first one. Not bad.

- high ready times on the 80% of the VMs, on the second. I check esxtop and i can see 6 and 2 cores VMs (for example) on the same NUMA. Not good.

I would like to add 2 things:

- There is never memory locality issues: all VMs are at 99-100% (N%L).

- This are not under/oversized VM. As i said on one of the first posts, they all work at 70-90% (core related).

Finally, I would like to take this to another level. Is there a problem regarding certain Opteron models ??

Thank you!

0 Kudos
Highlighted
VMware Employee
VMware Employee

gferreyra wrote:

[..]

2) Dell R910 - Opteron 8439 SE. 4 sockets. 6 cores per socket = 24 cores.

[..]

Just a minor thing: this can't be a R910, as the "0" indicates an Intel-based server; "5" is for AMD.

Based on the Opteron model you specified, I think this is actually a PowerEdge R905; can you confirm?

0 Kudos
Highlighted
VMware Employee
VMware Employee

There is no know issue with specific CPU models. Depending on what the underlying issue is, it might be more pronounced with certain NUMA configurations, VM sizes and workload characteristics. Get vm-support -p from your two lab rats as well and upload them to the SR. We'll look at those two as well but I'm more interested in the logs that I asked for in my last post.

While there might be more tests coming your way, we'll try our best to get to the bottom of this as soon as possible 🙂

Cheers,

Valentin

0 Kudos
Highlighted
Contributor
Contributor

Yes, sorry about that one.

R905. Confirmed.

0 Kudos
Highlighted
Contributor
Contributor

I've just upload 2 log bundles. With performance data (15 minutes collection), samples every 10 seconds.

One for each esxi host.

Hope to hear from you soon !

0 Kudos
Highlighted
Immortal
Immortal

Hi gferreyra,

I'm sorry to hear that default NUMA scheduler doesn't provide optimal placement

decision. In the following example, having 9 vCPUs on 6 core/node is indeed not

good while node 3 has two cores available. What puzzles me is that at least 1-vCPU

VM should be migrated to node 3. I wonder whether the below status is persistent.

Numa nodes and cores (VMs)

0          4,4,1            9 cores (!!). Terribly balanced. The 3 VMs have high CPU ready times.

1          2,2,1,1          Completing numa core count. 6 cores. Ok here.

2          4,1                5 cores. Ok here.

3          4                  4 cores. Ok here.

Anyway, this kind of placement is local optimum and breaking out of it is tricky

because moving one 4 vCPU VM to other node will cause even higher CPU imbalance.

Note that current NUMA scheduler performs either one-way move or two-way swap

(meaning switching VM A and B on two nodes).

In your example, one-way move won't improve the situation and multiple two-way moves

might reach better situation.

As two-way swap is more computation heavy, it is not performed as frequently

as one-way moves. So, it may help to break out of such local optimum by attempting

two-way moves more frequently. You can achieve this by reducing SwapInterval.

/Numa/SwapInterval = 1

Secondly, a proposed NUMA migration may be rejected if it doesn't reduce load imbalance

significantly. Otherwise, present migration may trigger future migration which we define

as "thrashing". To avoid thrashing, we require load imbalance is reduced significantly.

I see that reducing such requirement helps to break out local optimum.

/Numa/MigThrashThreshold = 90 seems helpful.

The side effect is that you'll have more NUMA migrations. At least, I don't see extremely

high ready times. It's not ideal but hopefully works for your case.

Let me know if the above suggestions don't work for you.

Thanks!

Highlighted
Contributor
Contributor

Hi . Nice stuff. Didn't know about that one-way move or two-way swap behaviour.

Everybody, i'm attaching 2 captures, from Dell R905. 4 numa nodes. 6 cores each.

As you can see, you have VMs (IDs): 7661478 (4 cores), 7661137 (4 cores) and 7662517 (2 cores). On the same numa node (0). High ready times for the three.

VM 7661469 has 6 cores and it's alone on a numa node. That's good.

ESXi-1.jpg

ESXi-2.jpg

Thanks!

0 Kudos
Highlighted
Immortal
Immortal

Hi gferreyra,


I wonder if you have a chance to try the following tweak:


/Numa/SwapInterval = 1

/Numa/MigThrashThreshold = 90


Please note that this kind of local-optimum needs a few pre-requisites to happen:

--All NUMA nodes have consistently high CPU load (~80%?).

--Majority of VMs are configured close to NUMA node size.

Thanks!

0 Kudos
Highlighted
Contributor
Contributor

Hi there. Im still doing tests.

But no, is not working. High cpu ready times.

I guess i will try directly with vsphere 5.5. Will get back with new data.

Cheers

0 Kudos
Highlighted
Contributor
Contributor

Hi there.

Documentation tells us to consider cores per numa when you are sizing a VM.

Example --> Each numa = 2 cores --> VM, should have 2 cores, maximum.

No need to use another NUMA --> cross boundarie --> add latency.

Question.

Dell R815, 4 sockets (Opteron 63xx). 8 cores each. 32 cores total.

VMware tells me that i have 8 NUMA nodes, not 4. Leaving this fact aside a moment, and concentrate directly on NUMA efficiency vs VM sizing.

I can create a VM with up to 8 cores, right ?? CPU ready time will be fine, right ?

Consider the VM is using in fact the 8 cores ((95% load)).

VM stays on one socket. 2 NUMA's, but 1 SOCKET.

No added latency, by crossing from one socket to another.

Can anyone confirm this?

If you have more than 1 numa node on 1 socket, you can just consider the total cores per socket when you need to size a VM.

Thanks!!!

0 Kudos
Highlighted
Contributor
Contributor

Anyone?

0 Kudos
Highlighted
Immortal
Immortal

Some Opteron has an internal architecture more complex... mainly internal are dual SMP (see the Buldozer architecture).

PS: you alway write about R910, but with the final 0 must be an Intel... not an AMD... I suppose you mean an R905.

Have you verify your CPU architecture, to see if they are the same? Or similar?

Andre | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
0 Kudos