Hi All,
I'm currently working on a horizon deployment in an environment that appears to be hugely over committed. The CPU utilization sits at 99.99% throughout the working day, across all 7 hosts, there's very little fluctuation. Memory is not as bad (generally 70-80%). I am confident the environment is over committed based on my understanding below, but I want to understand how strict is HA is in reserving 20% of the resources for failover (or does it 'dip into' those resources based on some tolerance level based on workload?)
Here's the scenario:
Total VMs = 261 (Win 10 1909)
VM configurations = 4gb RAM, 2 sockets and 2 cores CPU = 4vCPU.
Cluster Details:
7x ESXi hosts, in a HA cluster with 20% Cluster resources reserved for failover capacity. DRS is enabled.
CPU: Each host has: Xeon Gold 6148 @ 2.4 ghz = 2 sockets x 20 cores per socket = 40 cores. It shows in vSphere as 80 logical processors so presumably this is due to hyperthreading?
Host CPU in Ghz: Each host has 2.4ghz x 40 cores = 96 ghz.
Total Cluster Logical Cores = 7x80 = 560
Total Cluster CPU Resources = 7 hosts x 96 ghz = 672ghz
Total Cluster Memory Resources: Each host has 383gb RAM = 382x7 = 2.674tb
So total requirements for the 261 VM's
RAM = 261x 4gb = 1.044tb
CPU = 261 x 4vCPU = 1044
The environment is suffering from slow logons, sluggish performance on desktops. I can see that CPU ready values on the individual hosts are around 300ms (this is during 'downtime' i.e. 6pm on friday evening...) but typically it's like 1200ms or more at peak use.
So taking the above figures into account and NOT including the 20% reserved capacity for HA failover this environment must be hugely over committed? If I take 20% off the total CPU and RAM resources, and then considered the VM requirements taking HA reservation into account, it's not pretty, right? Will HA be reserving an aggregate of 20% of cluster resources across all 7 hosts so accommodate a failure?
I'm not 100% sure my workings above are correct so any pointers here would be appreciated. The business is aware of this and is building another (1 host) I believe, which of course is not sufficient but is there any other method (ESXtop?) that we can use to illustrate how over committed this environment is? I also noticed MTU size is 1500 (which doesn't help..).
I'm typically at the 'master image and pool management' end of Horizon administration so am out of my comfort zone, so any assistance would be appreciated.
Thanks
Hi, Dave. Sorry to read that. It seems you are facing a difficult scenario.
About HA: Hopefully this official document will help you understand how Admission Control manage the reservations - vSphere HA Admission Control. Maybe Admission control only reserves memory and not CPU capacity.
Just to be clear, HA Admission control reserves the cluster resources to guarantee VMs availability after a host failure. Not sure if I would enable it in your cluster since the performance is very poor. By disabling it you may improve a little bit your users experience.
About performance: Those values of CPU Ready and logon times are not good at all. MTU 1500 is not wrong, depending the scenario, but you may want to check packet fragmentation on the switch ports to see if it is not enough. Jumbo frames is recomended for vMotion, Fault Tolerance and VSAN networks.
About overcommitment, vROps provides an Out of the Box dashboard with that information "Cluster capacity allocation", or something like that. Capacity Allocation Overview Dashboard
If you don't have vROps installed, you can always calculate overcommitment manually: How to decide VMware vCPU to physical CPU ratio
What can you do from your perspective:
Hopefully all this actions will improve VMs performance a little bit. But maybe you should expand your cluster ASAP.
Regards!
Hi, hope you are doing fine.
Do you see a high CPU Ready value?
Regarding your view environment, are you using full vms, linked clones?
Thank you that's massively helpful, and got my thinking of some other ideas too - I will implement the immediate suggestions re. power policy and spare VM's (feel dumb for not thinking of those myself!). OSOT is already used (I will be testing the LoginVSI 'Win10LikeAPro' template...).
Will revert back to you re. the vRops and Admissions control suggestions. Again, thank you!
Great Dave! Another thing just came to my mind. Verify the ESXi power policy to be "High performance" and check the hosts IPMIs to have the power policy set to "Managed by OS". That way the ESXi will consume more power but you will be sure that they are performing at their best
Hey - yeah - CPU ready values are high - this is a purely instant clone environment.
Hyper threading is enabled and power plan is high performance. VMware Tools and agents are all correct version. The McAfee Endpoint Security suite is running, which is not the McAfee MOVE software (which is supposedly built for VDI...) so we've reviewed the exceptions and whitelisting for this, to little avail. Suspect it will boil down to one of the following:
1. An overly intrusive AV agent (we have no Stratusphere or means of measuring desktop performance at scale - any suggestions for standalone 'promiscuous' apps that can record system activity at a vm-level?)
2. A lot of single threaded apps at use.
3. Over committed VM sped (2 CPU, 4gb).
I doubt we can drop VM spec to 1vCPU per machine for Win 10 1909?!
2 cpus is really the bare minimum for windows 10 desktops, and they really should be 4 core for any reasonable performance.
If each VM has 2 CPU and 2 Cores per Socket as it's settings - presumably this = 2vCPU..?
So if each host has 40 physical cores , with hyperthreading they have 80 vCPU cores. Is this correct?
Dave, just for testing porpuses, can you deploy a desktop pool without AV? Maybe it's consuming too much resources in your VMs and it's time to evaluate alternatives. Now it's all about optimizing every little bit you can.
If you are using NSX, Trend Micro Deep Security is a great Host-based AV with very light OS agents
Dave, just as lucas said
Keep in mind that is not optimal to use agent based anti virus on View deployments.
Please consider deploying any NSX - guest intropection based anti virus all are agentless and have a smaller footprint on the Guest OS.
You have Trend Micro Deep Security, VMware Carbon Black and Karspersky has another one.
Also consider that you will have to purchase an NSX License
Another quick question, have you checked this guide on how to optimize the Desktop pool Creating an Optimized Windows Image for a VMware Horizon Virtual Desktop | VMware
Please pay special attention to the Running the OS Optimization Tool to Optimize, Generalize, and Finalize the OS section