1 2 Previous Next 25 Replies Latest reply on Apr 26, 2019 10:51 AM by dmorse Go to original post
      • 15. Re: NUMA issue, again
        Novice

        Hi gferreyra,


        I wonder if you have a chance to try the following tweak:


        /Numa/SwapInterval = 1

        /Numa/MigThrashThreshold = 90


        Please note that this kind of local-optimum needs a few pre-requisites to happen:

        --All NUMA nodes have consistently high CPU load (~80%?).

        --Majority of VMs are configured close to NUMA node size.

         

        Thanks!

        • 16. Re: NUMA issue, again
          gferreyra Novice

          Hi there. Im still doing tests.

           

          But no, is not working. High cpu ready times.

          I guess i will try directly with vsphere 5.5. Will get back with new data.

           

          Cheers

          • 17. Re: NUMA issue, again
            gferreyra Novice

            Hi there.

            Documentation tells us to consider cores per numa when you are sizing a VM.

            Example --> Each numa = 2 cores --> VM, should have 2 cores, maximum.

            No need to use another NUMA --> cross boundarie --> add latency.

             

            Question.

             

            Dell R815, 4 sockets (Opteron 63xx). 8 cores each. 32 cores total.

            VMware tells me that i have 8 NUMA nodes, not 4. Leaving this fact aside a moment, and concentrate directly on NUMA efficiency vs VM sizing.

             

            I can create a VM with up to 8 cores, right ?? CPU ready time will be fine, right ?

            Consider the VM is using in fact the 8 cores ((95% load)).

             

            VM stays on one socket. 2 NUMA's, but 1 SOCKET.

            No added latency, by crossing from one socket to another.

             

            Can anyone confirm this?

            If you have more than 1 numa node on 1 socket, you can just consider the total cores per socket when you need to size a VM.

             

            Thanks!!!

            • 18. Re: NUMA issue, again
              gferreyra Novice

              Anyone?

              • 19. Re: NUMA issue, again
                AndreTheGiant Guru
                User ModeratorsvExpert

                Some Opteron has an internal architecture more complex... mainly internal are dual SMP (see the Buldozer architecture).

                 

                PS: you alway write about R910, but with the final 0 must be an Intel... not an AMD... I suppose you mean an R905.

                Have you verify your CPU architecture, to see if they are the same? Or similar?

                • 20. Re: NUMA issue, again
                  gferreyra Novice

                  Please, check the last post. "Dell R815, Opteron 6300 series."

                   

                  Thanks!

                  • 21. Re: NUMA issue, again
                    gferreyra Novice

                    Is there any issue between VMware vSphere and AMD families?

                    Specifically: opteron 6200/6300 series. This is for a Dell R815 for example.

                     

                    VMware separates a 4 socket (8 cores each) server.... in 8 numa nodes, not 4.

                     

                    Now, one can expect that numa nodes: 0,1 ..... 2,3 ..... 4,5 ...... 6,7 ..... lie on the same socket. Each.

                    Is this correct?

                     

                    We are using last vSphere 5.5 version, last patch. I'm still seeing the worst numa allocation i've ever seen.

                    4 vms (20 cores total) on a 32 core Esxi. According to esxtop:

                    - vm#1 (8 cores) .... on numa node 1 and 4.

                    - vm#2 (8 cores)... on numa node 2 and 7.

                    - vm#3 (4 cores) ... on numa node 5

                    - vm#4 (2 cores) ... on numa 0

                    Ready time for vm1 and 2... off the roof. Why?

                    They should be "happy" on numa nodes 0,1 ..... and ..... 2,3, respectively.

                     

                    My questions:

                    1) Numa nodes split within AMD. Please, confirm split. 0 and 1 are on same socket? 2,3 another socket.... 4,5 another socket   .... 6,7 another socket ?

                    2) Is there any issue between VMware and AMD architecture regarding VMs vs NUMA allocation?

                    3) vSphere 6 should be better regarding this matter? VMs vs Numa allocation?

                    4) This feature works perfect with Intel? We need some proof in order to perform such architecture change (amd --> intel).

                    5) We' ve not experienced this behaviour with KVM or XEN. Same VMs, same physical server.

                     

                    Thanks!!!

                    • 22. Re: NUMA issue, again
                      Novice

                      Hi gferreyra,

                      My questions:

                      1) Numa nodes split within AMD. Please, confirm split. 0 and 1 are on same socket? 2,3 another socket.... 4,5 another socket   .... 6,7 another socket ?

                      2) Is there any issue between VMware and AMD architecture regarding VMs vs NUMA allocation?

                      3) vSphere 6 should be better regarding this matter? VMs vs Numa allocation?

                      4) This feature works perfect with Intel? We need some proof in order to perform such architecture change (amd --> intel).

                      5) We' ve not experienced this behaviour with KVM or XEN. Same VMs, same physical server.

                       

                      (1) Recent AMD CPUs have two NUMA nodes per socket. It is correct to have 8 NUMA nodes for 4 socket Opteron 6200/6300.

                            Node 0 and 1 are on the same socket; ditto for 2/3, etc.

                       

                      (2) I'm not aware of any issues on vSphere and AMD.

                       

                      (3) vSphere 6 has the same scheduling policy as before.

                       

                      (4) Again, I don't expect difference between AMD and Intel regarding NUMA scheduler. What may matter is that how many NUMA nodes exist per host and how many cores per node.

                       

                      We are using last vSphere 5.5 version, last patch. I'm still seeing the worst numa allocation i've ever seen.

                      4 vms (20 cores total) on a 32 core Esxi. According to esxtop:

                      - vm#1 (8 cores) .... on numa node 1 and 4.

                      - vm#2 (8 cores)... on numa node 2 and 7.

                      - vm#3 (4 cores) ... on numa node 5

                      - vm#4 (2 cores) ... on numa 0

                      Ready time for vm1 and 2... off the roof. Why?

                      They should be "happy" on numa nodes 0,1 ..... and ..... 2,3, respectively.

                       

                      Let me assume that the ESXi host has 4 sockets, 32 cores, and 8 NUMA nodes. So, there are 4 cores/node.

                      For VM1 (8-vCPUs), it is placed on node 1 and 4, probably because node 1 and 4 are one-hop apart.

                      For VM2 (8-vCPUs), it is placed on node 2 and 7, with the similar logic.

                       

                      Note that node 0 and node 1 are also one-hop away. The access latency between node 0 and node 1

                      is same as node 1 and node 4. However, the *bandwidth* between 0 and 1 should be higher than node 1 and 4.

                      So, it should be better to place VM1 on node 0 and 1. Currently, ESXi does not consider inter-bandwidth

                      between NUMA nodes.

                       

                      For VM3 and VM4, those are placed on NUMA nodes where there are no VMs, which seems

                      good placement to me.

                       

                      I don't understand why VM1 and VM2 have high ready time. How much is it?

                      Just based on your description, there doesn't seem to be CPU over-commitment.

                      Were there other VMs that you didn't mention? What workload did you run?

                      Do you have specific performance problem or are you concerned about the ready time?

                       

                      BTW, you mentioned 4 VMs with 20 vCPUs but the actual description totals 22 vCPUS.

                       

                      Maybe it was suggested before but it'll be best for you to file SR to further debug any performance issue.

                       

                      Regarding (5), can you kindly describe what the behavior was on KVM or XEN?

                       

                      Thanks.

                      • 23. Re: NUMA issue, again
                        gferreyra Novice

                        Hello seongbeom

                         

                        So:

                         

                        1) Node 0 and 1 on the same socket. Same for 1,2.... 3,4, and so on. Great.

                         

                        2) No issues with that processors family. Ok.

                         

                        3) No changes on scheduling on vSphere 6. Ok.

                         

                        4) My obvios question is: why VM#1, with 8 cores, is not placed on node 0 and 1?

                        Again: Dell R815, 4 sockets. 8 cores per socket => 8 numa. 4 cores per numa.

                        I want the vmkernel to place the VMs accordingly among the numa nodes, so we can see, ALWAYS, LOW CPU ready times, which is the no 1 problem on any virtual infrastructure.

                        We are not using DRS, because it does not take into calculation, the CPU ready time. Only the Host CPU usage.

                        We are on Manual. We can use DRS, but we have it manual. And i'm talking about 20 Esxi hosts. 500 VMs. We do not trust automatic drs anymore.

                         

                        When i say high ready time i mean: > 500/700 ms. Up to 2 seconds. We have very sensitive apps. No time to spend waiting for cpu cycles.

                         

                        I understand that may be node 1 a 4 are one hop apart, just like 0 and 1, but, it does not make any sense.

                        You have a very slow bridge/interconnection between sockets, if you compared to inteconnection inside the same socket.

                         

                        Why would vmkernel choose node 1 and 4, when 0 and 1 is the way to go?

                         

                        Thanks for your time!!!

                        • 24. Re: NUMA issue, again
                          gferreyra Novice

                          Nothing? No idea?

                          • 25. Re: NUMA issue, again
                            dmorse Enthusiast
                            VMware Employees

                            gferreyra

                             

                            I just noticed you are still interested in this thread.  Unfortunately, @seongbeom is no longer with VMware.

                             

                            For increased visibility, I'd suggest starting a new thread (with a pointer to the old thread).  Feel free to shoot me a link to the new thread in a private message, and I'll make sure you get a prompt response.

                            1 2 Previous Next