VMware Cloud Community
davebaker87
Enthusiast
Enthusiast

7x ESXi Hosts CPU 100% , slot size and HA question...

Hi All,

I'm currently working on a horizon deployment in an environment that appears to be hugely over committed. The CPU utilization sits at 99.99% throughout the working day, across all 7 hosts, there's very little fluctuation. Memory is not as bad (generally 70-80%). I am confident the environment is over committed based on my understanding below, but I want to understand how strict is  HA is in reserving 20% of the resources for failover (or does it 'dip into' those resources based on some tolerance level based on workload?)

Here's the scenario:

Total VMs = 261 (Win 10 1909)

VM configurations = 4gb RAM, 2 sockets and 2 cores CPU = 4vCPU.

Cluster Details:

7x ESXi hosts, in a HA cluster with 20% Cluster resources reserved for failover capacity. DRS is enabled.

CPU: Each host has: Xeon Gold 6148 @ 2.4 ghz  =  2 sockets x 20 cores per socket = 40 cores. It shows in vSphere as 80 logical processors so presumably this is due to hyperthreading?

Host CPU in Ghz: Each host has 2.4ghz x 40 cores = 96 ghz.

Total Cluster Logical Cores = 7x80 = 560

Total Cluster CPU Resources = 7 hosts x 96 ghz = 672ghz

Total Cluster Memory Resources: Each host has 383gb RAM = 382x7 = 2.674tb

So total requirements for the 261 VM's

RAM = 261x 4gb = 1.044tb

CPU = 261 x 4vCPU = 1044

The environment is suffering from slow logons, sluggish performance on desktops. I can see that CPU ready values on the individual hosts are around 300ms (this is during 'downtime' i.e. 6pm on friday evening...) but typically it's like 1200ms or more at peak use.

So taking the above figures into account and NOT including the 20% reserved capacity for HA failover this environment must be hugely over committed? If I take 20% off the total CPU and RAM resources, and then considered the VM requirements taking HA reservation into account, it's not pretty, right? Will HA be reserving an aggregate of 20% of cluster resources across all 7 hosts so accommodate a failure?

I'm not 100% sure my workings above are correct so any pointers here would be appreciated. The business is aware of this and is building another (1 host) I believe, which of course is not sufficient but is there any other method (ESXtop?) that we can use to illustrate how over committed this environment is? I also noticed MTU size is 1500 (which doesn't help..).

I'm typically at the 'master image and pool management' end of Horizon administration so am out of my comfort zone, so any assistance would be appreciated.

Thanks

Reply
0 Kudos
10 Replies
lucasbernadsky
Hot Shot
Hot Shot

Hi, Dave. Sorry to read that. It seems you are facing a difficult scenario.

About HA: Hopefully this official document will help you understand how Admission Control manage the reservations - vSphere HA Admission Control.​ Maybe Admission control only reserves memory and not CPU capacity.

Just to be clear, HA Admission control reserves the cluster resources to guarantee VMs availability after a host failure. Not sure if I would enable it in your cluster since the performance is very poor. By disabling it you may improve a little bit your users experience.

About performance: Those values of CPU Ready and logon times are not good at all. MTU 1500 is not wrong, depending the scenario, but you may want to check packet fragmentation on the switch ports to see if it is not enough. Jumbo frames is recomended for vMotion, Fault Tolerance and VSAN networks.

About overcommitment, vROps provides an Out of the Box dashboard with that information "Cluster capacity allocation", or something like that. Capacity Allocation Overview Dashboard

If you don't have vROps installed, you can always calculate overcommitment manually: How to decide VMware vCPU to physical CPU ratio

What can you do from your perspective:

  • Suggest disabling admission control
  • Verify if hyperthreading is enabled from BIOS and from vSphere. Enable Hyperthreading
  • Optimize OS for Horizon View VMs with VMware OS Optimization Tool - VMware OS Optimization Tool | VMware Flings
  • Do not enable spare VMs in horizon pools.
  • Configure the desktop pools to power off or suspend VMs an hour after a logoff or idle session to optimize resources.
  • Verify that your VMs has the latest Horizon agent and VMware tools installed.
  • Power off Idle VMs and resize oversized VMs in your vCenter environment.

Hopefully all this actions will improve VMs performance a little bit. But maybe you should expand your cluster ASAP.

Regards!

nachogonzalez
Commander
Commander

Hi, hope you are doing fine.
Do you see a high CPU Ready value?

Regarding your view environment, are you using full vms, linked clones?

Reply
0 Kudos
davebaker87
Enthusiast
Enthusiast

Thank you that's massively helpful, and got my thinking of some other ideas too -  I will implement the immediate suggestions re. power policy and spare VM's (feel dumb for not thinking of those myself!). OSOT is already used (I will be testing the LoginVSI 'Win10LikeAPro' template...).

Will revert back to you re. the vRops and Admissions control suggestions. Again, thank you!

Reply
0 Kudos
lucasbernadsky
Hot Shot
Hot Shot

Great Dave! Another thing just came to my mind. Verify the ESXi power policy to be "High performance" and check the hosts IPMIs to have the power policy set to "Managed by OS". That way the ESXi will consume more power but you will be sure that they are performing at their best

Reply
0 Kudos
davebaker87
Enthusiast
Enthusiast

Hey - yeah - CPU ready values are high - this is a purely instant clone environment.

Reply
0 Kudos
davebaker87
Enthusiast
Enthusiast

Hyper threading is enabled and power plan is high performance. VMware Tools and agents are all correct version. The McAfee Endpoint Security suite is running, which is not the McAfee MOVE software (which is supposedly built for VDI...) so we've reviewed the exceptions and whitelisting for this, to little avail. Suspect it will boil down to one of the following:

1. An overly intrusive AV agent (we have no Stratusphere or means of measuring desktop performance at scale - any suggestions for standalone 'promiscuous' apps that can record system activity at a vm-level?)

2. A lot of single threaded apps at use.

3. Over committed VM sped (2 CPU, 4gb).

I doubt we can drop VM spec to 1vCPU per machine for Win 10 1909?!

Reply
0 Kudos
sjesse
Leadership
Leadership

2 cpus is really the bare minimum for windows 10 desktops,  and they really should be 4 core for any reasonable performance.

Reply
0 Kudos
davebaker87
Enthusiast
Enthusiast

If each VM has 2 CPU and 2 Cores per Socket as it's settings - presumably this = 2vCPU..?

So if each host has 40 physical cores , with hyperthreading they have 80 vCPU cores. Is this correct?

Reply
0 Kudos
lucasbernadsky
Hot Shot
Hot Shot

Dave, just for testing porpuses, can you deploy a desktop pool without AV? Maybe it's consuming too much resources in your VMs and it's time to evaluate alternatives. Now it's all about optimizing every little bit you can.

If you are using NSX, Trend Micro Deep Security is a great Host-based AV with very light OS agents

Reply
0 Kudos
nachogonzalez
Commander
Commander

Dave, just as lucas said

Keep in mind that is not optimal to use agent based anti virus on View deployments.

Please consider deploying any NSX - guest intropection based anti virus all are agentless and have a smaller footprint on the Guest OS.

You have Trend Micro Deep Security, VMware Carbon Black and Karspersky has another one.

Also consider that you will have to purchase an NSX License

Another quick question, have you checked this guide on how to optimize the Desktop pool   Creating an Optimized Windows Image for a VMware Horizon Virtual Desktop | VMware


Please pay special attention to the Running the OS Optimization Tool to Optimize, Generalize, and Finalize the OS section

Reply
0 Kudos