Re: NUMA issue, again - Page 2

gferreyra · ‎07-09-2014

Hi there.

I have 5 esxi nodes. All with 5.1 update 2.

I'm having issues with NUMA scheduler. Poor balance.

Let me go directly there.

I'm playing (doing tests) now with one node. Just one.

Dell r910, 4 opteron --> 4 numa nodes: each --> 6 cores + 64 GB ram. Total: 24 cores + 256 GB ram.

10 VMs. VMs cores: 4,4,4,4,2,2,1,1,1,1, respectively. Very well sized. Each one of them use 80-90% of those vcores.

No under or oversized situations. Memory, between 1 and 16 GB. No problem with memory. The issue is strictly CPU related.

Ok. ESXTOP + m + f. Numa statistics.

Numa nodes and cores (VMs)

0 4,4,1 9 cores (!!). Terribly balanced. The 3 VMs have high CPU ready times.

1 2,2,1,1 Completing numa core count. 6 cores. Ok here.

2 4,1 5 cores. Ok here.

3 4 4 cores. Ok here.

So. Why?

I waited for an entire day and the VMs stay there. No new rebalance. Nothing.

So, i fix it manually. I move the VMs between nodes using resource settings --> advanced memory / cpu (specifying numa node and cpu affinity*).

* Fact : I've read on the official documentation that specifying only the numa node on advanced memory, does not work. You need to specify the CPU affinity too.

So, for example, for numa node 1, cpu's are: 6,7,8,9,10,11. I specified: 6-11, which is the same

The VMs move instantly.

Result on esxtop:

0 4,1,1

1 4,1,1

2 4,2

3 4,2

Excellent. That's balance. VMs on each numa node; completing the 6 cores per node.

Yes, of course: memory locality 97-100%. Every time. No problem there, like i remarked at the beginning.

CPU ready time dropped to 0-50 ms on every VM. Perfect. Before, we were talking about 2000 ms - 5000 ms (!!!).

Another fact:

I've read that once a VM is part of a new ESXi host (by automatic vmotion for example), the scheduler considers the VM memory.

It puts the VM on the first numa node which has enough free memory to hold the VM. That's all.

It does not care about the core amount. That can deliver poor CPU performance in the short term.

Now, after one hour, i removed, from each VM, every setting from advanced settings (affinity matters).

After another hour, i checked CPU ready times on each VM. All doing fine, but 2.

I went to ESXTOP. AGAIN. NUMA nodes IMBALANCED.

One of the numa nodes had enough VMs for 7 cores and another, 8 cores.

Why !!!

So. What i'm doing right now and from now on?

I do manual balance and then --> ESXi host --> software --> advanced settings --> numa.RebalanceEnable=0.

The VMs stayed right on the numa node i put them.

Excellent CPU ready times up to now.

2 questions:

1) Is there a way to fix this using one or more of the NUMA advanced attributes? I want the VMs to be positioned on each of the NUMA nodes, taking as reference / considering each VM core count, too, not only the memory!! It' s obvious and essential !!! Otherwise, you experience the obvious bridge cross (that's how i call it) between physical cores; adding latency. Instant latency. I want each VM to stay on one numa node. No remote memory or remote CPU !!

2) Is this, in some way, totally fixed on vSphere 5.5 ? Is numa balancer/scheduler better? is quite frightening.

Thanks !!!!!!!

ps: The "again" on the subject is version related. I've seen NUMA poor balancing issues on other discussion threads, for vSphere 4.1 and 5.0.

gferreyra · ‎03-23-2015

Please, check the last post. "Dell R815, Opteron 6300 series."

Thanks!

gferreyra · ‎04-13-2015

Is there any issue between VMware vSphere and AMD families?

Specifically: opteron 6200/6300 series. This is for a Dell R815 for example.

VMware separates a 4 socket (8 cores each) server.... in 8 numa nodes, not 4.

Now, one can expect that numa nodes: 0,1 ..... 2,3 ..... 4,5 ...... 6,7 ..... lie on the same socket. Each.

Is this correct?

We are using last vSphere 5.5 version, last patch. I'm still seeing the worst numa allocation i've ever seen.

4 vms (20 cores total) on a 32 core Esxi. According to esxtop:

- vm#1 (8 cores) .... on numa node 1 and 4.

- vm#2 (8 cores)... on numa node 2 and 7.

- vm#3 (4 cores) ... on numa node 5

- vm#4 (2 cores) ... on numa 0

Ready time for vm1 and 2... off the roof. Why?

They should be "happy" on numa nodes 0,1 ..... and ..... 2,3, respectively.

My questions:

1) Numa nodes split within AMD. Please, confirm split. 0 and 1 are on same socket? 2,3 another socket.... 4,5 another socket .... 6,7 another socket ?

2) Is there any issue between VMware and AMD architecture regarding VMs vs NUMA allocation?

3) vSphere 6 should be better regarding this matter? VMs vs Numa allocation?

4) This feature works perfect with Intel? We need some proof in order to perform such architecture change (amd --> intel).

5) We' ve not experienced this behaviour with KVM or XEN. Same VMs, same physical server.

Thanks!!!

admin · ‎04-14-2015

Hi gferreyra,

My questions:

1) Numa nodes split within AMD. Please, confirm split. 0 and 1 are on same socket? 2,3 another socket.... 4,5 another socket .... 6,7 another socket ?

2) Is there any issue between VMware and AMD architecture regarding VMs vs NUMA allocation?

3) vSphere 6 should be better regarding this matter? VMs vs Numa allocation?

4) This feature works perfect with Intel? We need some proof in order to perform such architecture change (amd --> intel).

5) We' ve not experienced this behaviour with KVM or XEN. Same VMs, same physical server.

(1) Recent AMD CPUs have two NUMA nodes per socket. It is correct to have 8 NUMA nodes for 4 socket Opteron 6200/6300.

Node 0 and 1 are on the same socket; ditto for 2/3, etc.

(2) I'm not aware of any issues on vSphere and AMD.

(3) vSphere 6 has the same scheduling policy as before.

(4) Again, I don't expect difference between AMD and Intel regarding NUMA scheduler. What may matter is that how many NUMA nodes exist per host and how many cores per node.

We are using last vSphere 5.5 version, last patch. I'm still seeing the worst numa allocation i've ever seen.

4 vms (20 cores total) on a 32 core Esxi. According to esxtop:

- vm#1 (8 cores) .... on numa node 1 and 4.

- vm#2 (8 cores)... on numa node 2 and 7.

- vm#3 (4 cores) ... on numa node 5

- vm#4 (2 cores) ... on numa 0

Ready time for vm1 and 2... off the roof. Why?

They should be "happy" on numa nodes 0,1 ..... and ..... 2,3, respectively.

Let me assume that the ESXi host has 4 sockets, 32 cores, and 8 NUMA nodes. So, there are 4 cores/node.

For VM1 (8-vCPUs), it is placed on node 1 and 4, probably because node 1 and 4 are one-hop apart.

For VM2 (8-vCPUs), it is placed on node 2 and 7, with the similar logic.

Note that node 0 and node 1 are also one-hop away. The access latency between node 0 and node 1

is same as node 1 and node 4. However, the *bandwidth* between 0 and 1 should be higher than node 1 and 4.

So, it should be better to place VM1 on node 0 and 1. Currently, ESXi does not consider inter-bandwidth

between NUMA nodes.

For VM3 and VM4, those are placed on NUMA nodes where there are no VMs, which seems

good placement to me.

I don't understand why VM1 and VM2 have high ready time. How much is it?

Just based on your description, there doesn't seem to be CPU over-commitment.

Were there other VMs that you didn't mention? What workload did you run?

Do you have specific performance problem or are you concerned about the ready time?

BTW, you mentioned 4 VMs with 20 vCPUs but the actual description totals 22 vCPUS.

Maybe it was suggested before but it'll be best for you to file SR to further debug any performance issue.

Regarding (5), can you kindly describe what the behavior was on KVM or XEN?

Thanks.

gferreyra · ‎07-27-2015

Hello seongbeom

So:

1) Node 0 and 1 on the same socket. Same for 1,2.... 3,4, and so on. Great.

2) No issues with that processors family. Ok.

3) No changes on scheduling on vSphere 6. Ok.

4) My obvios question is: why VM#1, with 8 cores, is not placed on node 0 and 1?

Again: Dell R815, 4 sockets. 8 cores per socket => 8 numa. 4 cores per numa.

I want the vmkernel to place the VMs accordingly among the numa nodes, so we can see, ALWAYS, LOW CPU ready times, which is the no 1 problem on any virtual infrastructure.

We are not using DRS, because it does not take into calculation, the CPU ready time. Only the Host CPU usage.

We are on Manual. We can use DRS, but we have it manual. And i'm talking about 20 Esxi hosts. 500 VMs. We do not trust automatic drs anymore.

When i say high ready time i mean: > 500/700 ms. Up to 2 seconds. We have very sensitive apps. No time to spend waiting for cpu cycles.

I understand that may be node 1 a 4 are one hop apart, just like 0 and 1, but, it does not make any sense.

You have a very slow bridge/interconnection between sockets, if you compared to inteconnection inside the same socket.

Why would vmkernel choose node 1 and 4, when 0 and 1 is the way to go?

Thanks for your time!!!

gferreyra · ‎04-03-2019

Nothing? No idea?

dmorse · ‎04-26-2019

gferreyra

I just noticed you are still interested in this thread. Unfortunately, @seongbeom is no longer with VMware.

For increased visibility, I'd suggest starting a new thread (with a pointer to the old thread). Feel free to shoot me a link to the new thread in a private message, and I'll make sure you get a prompt response.