QuazzieM
Contributor
Contributor

Virtual Server Design

Hey all,

Quick question (I hope)..

We have a VM that is a reporting application... Now its data set is quite huge (~190GB mark), considering the application works by basically loading the data-set into vRAM..

I've spoken to application vendors and they believe that we need to add more vCPU's to the VM as some of the reports generated from this is quite CPU dependent..

I've requested that our infrastructure vendor to increase the vCPU size from 10 vCPU's to 14.. But they've come back with the following..

Your Host is a 4 socket / 10 core server with 512GB of Ram or 128GB per socket, The VM is configured with 190Gb of vRAM. Therefore, part of the vRAM is located outside the NUMA alignment and as advised, the best practice is to limit the number of cores to 1 socket.. Maybe you should look at splitting the server into multiple smaller VM's.


We cant currently split the VM into smaller sizes, we are working at re-designing the data-set and splitting them up into separate stores, but for the meantime I need to get this one running a more efficient.

Therefore, based on the fact that:-


We have a 3 host cluster that is only at 35% utilization, and based on the info on how NUMA works, couldn't we spread the application across 2 sockets, instead of the one?


I'm more seeking exactly how NUMA works.. if the Host has 4 sockets with 10cores per socket and 512GB ram there each socket is assigned 128GB Ram, if we changed the server from 1xsocket 10x cores to 2xsockets and 7cores (or 10cores) then the VM is assign 2 sockets and 256GB of RAM and therefore falls under the NUMA alignment?


The other option could be that we could setup the VM as a wide/flat vCPU setup of 10sockets and 1core to see if that helps.. But I'm not sure if that would work or the best idea

0 Kudos
11 Replies
vfk
Expert
Expert

What version of vsphere are you currently using?  As there has been many improvement relating to numa in later version of vsphere.

Secondly, best practise are just that, a best practise, and you would have to evaluate each one on its merits based your environment and implementation.  However, when it comes to NUMA and vNUMA, you should configure the VM to closely match the underlying physical topology  then vsphere will expose the numa topology to the guest OS and if your guest is numa aware.

I can understand the concerns regarding NUMA alignment and some of the performance penalty, but this is between CPU socket interconnect, really really fast bus interconnect, and in my opinion unlikely to cause massive performance issue.  Data access for disk,lun are more likely to cause performance issues then memory contect switch between CPU.

Here is performance best practise for vSphere 5.0, it is a good read http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.0.pdf

--- If you found this or any other answer helpful, please consider the use of the Helpful or Correct buttons to award points. vfk Systems Manager / Technical Architect VCP5-DCV, VCAP5-DCA, vExpert, ITILv3, CCNA, MCP
0 Kudos
JPM300
Commander
Commander

Your vendor is correcty when they state this:

"I've requested that our infrastructure vendor to increase the vCPU size from 10 vCPU's to 14.. But they've come back with the following..

Your Host is a 4 socket / 10 core server with 512GB of Ram or 128GB per socket, The VM is configured with 190Gb of vRAM. Therefore, part of the vRAM is located outside the NUMA alignment and as advised, the best practice is to limit the number of cores to 1 socket.. Maybe you should look at splitting the server into multiple smaller VM's."

If your ESXi host has a 4 Socket / 10 core 512GB of RAM availble, they are correct in stating tha you should stay at 10vCPU and 128GB of memory to stay with the NUMA complaince.  However if you where to jump to 14vCPU or 20vCPU with more memory instruction sets may cross the NUMA barrier.  Lets look at how this works a little more:

Lets take a 2 Hosts with the following Configuration:

2 Socket 4 Core

64GB of memory

NumaExplain.PNG

Now when VM1 runs all CPU scheduling and memory scheduling stays in one block, as each CPU in the host has a memory block associated with it and its requests.  You typically see this very clearing on the board layout if you pop the case off.  So all instructions that CPU has to do stays local to it which is VERY FAST.  However if you look at VM2 since it is using more vCPU then a socket can handle any instructions for the 5th and 6th vCPU and or memory over 32GB now has to go across the CPU bus or the NUMA bus in the picture.  This is still fast just not AS fast if you where to keep it local on the same BUS.  This is what your vendor is trying to tell you.

You could try and bump the vCPU & Memory to what you want and you may get a mimamul gain, however typically when your jumping the CPU bus for CPU & Memory the returns tend to cap out and your gains are mimumal.  However there is no harm in trying.  To give you an example we had a SQL deployment recently with simular requirements.  The psyhical was 16 core 128GB memory SQL BOX.  When we virtualized it to stress test it we saw better gains with 8 CPU 64GB of memory vs the 16vCPU 128GB of memory.  This was also largely in part due to the way that particular SQL box functioned, however typically once your jumping that barrier you will eventually hit a wall where throwing more CPU / Memory at it will not get you good returns and your better off scaling out into smaller VM's to get your performance.

I hope this has helped

0 Kudos
vfk
Expert
Expert

The ESXi CPU scheduler is smart enough to figure the physical topology and figure out resource allocation in the best possible way.  In later version of ESXi, there is wide-numa, a virtual machine that has more vCPUs than the number of cores per NUMA node is referred to as “wide” virtual machine because its width is too wide to fit into a NUMA node.  wide virtual machines are also managed by the NUMA scheduler by being split into multiple “NUMA clients,” each a wide-VM can be split into smaller NUMA clients that fit within the size of the NUMA node in terms of physical CPU cores. A home node is then assigned to each client.

wide-numa.PNG

In you case, there is nothing wrong with adding more ram, as long as you are running recent version of Windows/Linux and ESXi.  This is just a temp situation as you are already planning to break up the data set.  As JPM300 already pointed out, there is little again, you are better off scaling out.  The management overhead of monster VMs is little high, i.e vmotion between esxi host could take long time unless you have 10gbe vmotion network. etc

Please refer to CPU scheduler whitepaper, it is a good read if you time to spare.

The CPU Scheduler in VMware vSphere 5.1 : http://www.vmware.com/files/pdf/techpaper/VMware-vSphere-CPU-Sched-Perf.pdf

--- If you found this or any other answer helpful, please consider the use of the Helpful or Correct buttons to award points. vfk Systems Manager / Technical Architect VCP5-DCV, VCAP5-DCA, vExpert, ITILv3, CCNA, MCP
0 Kudos
JPM300
Commander
Commander

^What vfk said Smiley Happy

vfk You just had to 1 up my crayon drawing didn't ya :smileysilly: good work on the description!

0 Kudos
QuazzieM
Contributor
Contributor

Thanks vfk and JPM300 for you input...

The cluster is currently running 5.0 U2, with 8GB FC and a clustered SAS/SSD SAN, if that's of any help... The application that is running on the VM is Qlikview and from all accounts its an application that's not really designed to run in a visualized environment due to its high compute requirements and how the data is stored.. Having said that we don't really have any option with moving it to a physical box.

The vendor has recommended that with the dataset size that were running we needed to increase the vRAM to 128GB to 230GB and the vCPU from 8 to 14... due to the vCPU maxing out during a lot of the reports.

Basically I'm trying to workout whats the best course of action, as I said theirs 3 hosts within the cluster.. Each host is running at around 35% utilization, therefore, I'm trying to workout what would be my best course of action..

I can

  • Dedicate 1 host purely to this VM and allow DRS to managed the rest of the VM's between the other 2 hosts
  • Change the VM's vCPU anyway as the vRAM is already crossing the NUMA alignment, and see if it makes any difference.. If I go this way or the point below.. Should it really cause any great performance penalties to the other VM's on the host?
  • Change the VM to a wide vCPU configuration of 10socket/1core configuration..

Thanks again for all your time.

0 Kudos
JPM300
Commander
Commander

Does the application have the ability to split out roles?  maybe 1 server for reports, one server for application purposes, 1 server for indexing?  If not any of those 3 options are good to try and get the performance you want.  The nice thing about it is they are all easy changes that require a short outage.  So you can always test and then if its not working out flip it back.

If possible you will probably get the best results if you can break down the one large VM into 2 or 3 smaller ones, but if you can't then the other 3 options you said are all viable options to test.

0 Kudos
QuazzieM
Contributor
Contributor

JPM300 wrote:

Does the application have the ability to split out roles?  maybe 1 server for reports, one server for application purposes, 1 server for indexing?  If not any of those 3 options are good to try and get the performance you want.  The nice thing about it is they are all easy changes that require a short outage.  So you can always test and then if its not working out flip it back.

If possible you will probably get the best results if you can break down the one large VM into 2 or 3 smaller ones, but if you can't then the other 3 options you said are all viable options to test.

It will eventually be broken into separate data-sets.. But for the time being, upper management want to try and get more performance from the current design..

As the the changes.. I know they would be simple.. However, and this is another issue I'm working through is that I don't have control over cluster.. so all these changes have to go through the vendor for which they charge a stupid amount for such a simple re-configuration and they require 1hr outage windows..


That's why I was trying to workout which option would be the best to try first..

0 Kudos
vfk
Expert
Expert

lol @ JPM300 - no such thing.

Just a note on the wide-vm config, don't do 1-socket 1-core, match the underlying physical topology.  Divide equally to get your requirements. 


JPM300 suggestion about breaking up the server into three smaller VMs is great idea.  The best you can do is present the different options to upper management, the time you are spending fire fighting to keep this monster VM up and running is also a cost to the company as it is preventing you working on other projects and so on.   Upper management needs to have a word with your account manager @Qlik  - Good luck, and let us know how you choose to proceed.



--- If you found this or any other answer helpful, please consider the use of the Helpful or Correct buttons to award points. vfk Systems Manager / Technical Architect VCP5-DCV, VCAP5-DCA, vExpert, ITILv3, CCNA, MCP
0 Kudos
QuazzieM
Contributor
Contributor

vfk wrote:

lol @ JPM300 - no such thing.

Just a note on the wide-vm config, don't do 1-socket 1-core, match the underlying physical topology.  Divide equally to get your requirements.


Hey vfk.. Can you please clarify what you mean by the above statement..

Do you mean I shouldn't look to setup the vCPU layout on the VM from a 1socket/10core to a 10socket/1core setup? Ideally the apps team want to try a 14vCPU config

0 Kudos
vfk
Expert
Expert

If your physical host has 4 sockets, then I would assign 2 virtual sockets with 7 cores each, then would create two vNUMA.  Ideally you want to closely match the physically topology.  Hope that makes sense.

--- If you found this or any other answer helpful, please consider the use of the Helpful or Correct buttons to award points. vfk Systems Manager / Technical Architect VCP5-DCV, VCAP5-DCA, vExpert, ITILv3, CCNA, MCP
0 Kudos
QuazzieM
Contributor
Contributor

Makes perfect sense.. thats what I would of thought made more sense then running the 10socket/1core config

Thanks again to you and JPM300 for all you input

0 Kudos