I have an cluster whit 6 ESX 6.5. In 2 ESX I have the same problem.
There are times when physical cores increase and hyperthreading go to 0 more or less.
I open support case but Has someone happened to you or could you help me?
There are times when physical cores increase and hyperthreading go to 0 more or less
I don't understand this statement. Could you explain better, please?
You really need to expand on the problem indeed. What are you seeing, where are you seeing this, why is this a problem?
Hi, hope you are doing fine.
Can you please give more insight?
Do all the host have the same hardware?
Is EVC enabled?
How many VMs? How many VMs affected? How are vCPUs distributed?
Have you checked ESXTOP?
Have you checked performance charts?
The ESX is 12cores * 2 socket Intel Gold 6246
EVC is not enabled
I uploaded the photo but it is not seen.
When this happens I increase the co-stop a lot in the VM
I have tested with ESXTOP that when this happens the first 24 cores (0-23) have a very low value and the 24-47 cores have a much higher value
So you are saying that occasionally the co-stop goes up and that is a problem? Co-stop means that the hypervisor has challenges scheduling a certain number of vCPUs are the same time. This typically means you are overcommiting from a vCPU point of view, you could lower the number of vCPUs on the VMs which are not actively using all those vCPUs, this should make scheduling easier for the hypervisor.
Thanks for yout answer.
You can see in the photo the ESX is normaly 50-60 cpu usage.
The esx server has 24 physical cores, 48 with hyperthreading.
It has 4 virtual machines (12 cores -12 cores- 4 cores -8 cores) with a total of 36 configured cores.
I see very strange that sometimes the hyperthreading cores are not used,
I see it strange that with hardly any core oversubscription I have a co-stop
The co-stop and cores hyperthreading without working is coincident in time.
I would like to understand and correct that hyperthreading cores are not used. This will surely solve the co-stop.
Thanks in advance.
When you go to your host in the vCenter UI and then go to Configure and click on "Overview" under "hardware" it shows you that hyperthreading is active?
Yes I can see the hypertreading is enabled.
Just realized that this could be a result of the L1 security issue and the mitigation around it. The problem/concern is described here in detail: VMware Knowledge Base
One thing that you need to consider that your 12 cpu vms will wait till 12 thread cpu threads are available, which makes scheduling very hard on what I'm assuming is 2 12 core processors. Consider a restaurant with 24 seats available. A group of 2 people will get a seat quicker than a group of 12, and its even worse if only 2 people out of that twelve are the ones actually eating. Unless the 12 cpus are actively being used, lowering the count may make things better.
Depending on if you do have the l1 terminal fixes in like mentioned vms might not be able to be co scheduled on the sam cpus, so if your running with 2 12 core cpus, having 2 12 core cpus would be an issue
I go to see L1
I can see when the hyperthreading not work co-stop up.
You can see the photo
If these mitigations affects the operations, then this is a mistake.
OP is using Cascade Lake CPUs and these are not affected by this particular issue.
(this is mentioned in the KB)
I don't see warning to enable SCA v2 on my Cascade Lake hosts.
I can see Intel(R) Xeon(R) Gold 6246 CPU @ 3.30GHz it's not affected to L1.
Any other idea?
Thanks in advance
Like we said, try figuring out if you can lower the number of vCPUs on some of the VMs, as this is causing the co-stop issues for sure. Other than that I doubt anyone can solve this problem for you,
You are right Zbigniew, I didn't incorrectly read the CPU details in this thread.
Few things to consider.
Your host has 24 physical cores (24 pCPU), and your VMs have 36 cores allocated (36 vCPU)
Your CPU is overallocated 3:2 or 150%
As the general rule of thumb CPU overallocation ratios 2:1 or even 3:1 are considered acceptable.
Please do consider though that this is done with the assumption of the bigger scale in mind, like Vsphere cluster with several servers, high-core count CPUs and few hundred VMs -> generally something that hosts not only priority production VMs, but also things with less priority like dev/test VMs that are mostly sleeping.
This is not your case - you have one host, that is overallocated, and your VMs seems to contesting the access to the CPU resources.
Some possible ideas how to solve this:
1. Descale your VMs - your CPU utilization ratio seems to be fairly average (40 % ?). That means your VMs are not really going on 100%, they are just spread over too many cores. Check over your bigger VMs (those 12 and 8 cores) and consider removing few vCPUs out of them. This way you will force OS scheduler inside the VMs to utilize fewer cores, but to the greater extent.
This is actually your best bet.
2. Ensure that your ESXi server is utilizing Turbo - check server BIOS settings, enable P-States, disable C-States, switch to power saving mode described like "OS Controlled", enable Turbo.
On the Vsphere side check that ESXi recognizes Power Saving modes, and verify that your cores can hit more than base speed.
This might give you some extra performance in the peak situation, but do not expect marvels. Your base clock is pretty high as is.
3. The last one is something I'd really like to ask others to consider whether it helps:
You may add in the advanced settings of the ESXi host the following line:
May take for this setting is to enable NUMA scheduler to squeeze more inside each NUMA node.
As I understood this by giving NUMA scheduler more space you might minimize the amount of intra node migrations or avoid the situations when VM was scheduled wide.
Still this is not a magic wand, and if your VMs are asking for too much CPU it won't magically create them out of thin air.
4. The last solution is kinda obvious - buy new bigger server
New Cascade Lakes Refresh are not that bad, still if you manage to convince someone to buy server with single amd 7702p you might really get a blast 🙂
Ofc dumping server that is at most 15 months old is not really a solution, but this is also something to consider in the long run.
Hope that helps