CPU throttling issue

Rhys1979 · ‎06-28-2023

Hello all,

I have three Dell R740xd hosts in three physical locations, all with identical physical and VM configurations.

One host, I am having an issue with the CPUs being throttled. ESXi 8.01 with Dual Xeon Gold 6144 CPUs and 128GB RAM. VM 1 is assigned 8 cores and 16GB of RAM, VM shows around 18GHz used, but the host utilization for that VM is only around 9GHz. VM2 same hardware configuration, virtual usage shows around 15GHz used, but host utilization is only around 4.5GHz.

On the other two hosts, host CPU utilization matches VM CPU utilization. I've checked every bit of configuration I can find, and everything matches across all three hosts, so I'm at a loss as to what is going on. Anyone have any ideas?

Tibmeister · ‎06-28-2023

So you're saying the only indication of throttling you have is that the guest numbers don't match the host numbers? Well, they won't, rarely ever. The reason is there's different methods measured in order to come up with the measurements.

For instance, most OS's (Linux and Windows) use a method called a Watchdog Timer in order to determine the CPU usage. It does this by starting a low priority threat on the CPU(s) and waits to see how long it takes for that thread to complete. That calculation is the measurement for usage of the CPU. The logic is that the low priority thread will not complete until all other threads are completed, so in theory, this is a viable ball park measurement. This is what happens regardless of the hardware, which is why it works because it is hardware agnostic.

So, you have a VM that is reporting high CPU usage because it's timer threads are taking a while to return. Taking into account the vCPU's are scheduled across a finite number of pCPU's, and that other VM workloads can impact this, you often will see a VM guest reporting higher CPU usage than what the host itself reports for the same VM. In this case, the host is actually correct because it knows about the scheduling and can take that into account, as well as all the other VMs running on it. The VM guest has no knowledge of this so therefore is blind and thinks that things are more utilized than they are.

Now, there's a whole slew of other factors in this, %RDY, IOWAIT, CO-STOP, etc. One thing to keep in mind is that for the most part, the vCPU of the VM will be used to process data that normally would be handled by a storage controller and NIC, which is the IOWAIT measurement. If this is high, then the VM is waiting for the vCPU to process IO from either the storage or network stack, which causes the watchdog threads to take much longer to complete, therefore the VM thinks it's CPUs are heavily utilized, when in fact that is not the case, you have an IO bottleneck somewhere. Often, if the storage doesn't have high latency, this will be something in the network stack, like a long running SQL query or a large single-threaded data transfer.

Now one may think to just throw more vCPU at the problem, but that only makes the situation worse for not only the VM in question, but all VMs on the host. This is why the term "Right Sizing" is so heavily stressed and used, you have to properly size the VM's resources to the actually workload and observe. Often, VMs will be given resources just because, or "Because the vendor says so", then wonders why they have this situation occur.

Also, hyperthreading is not your friend because despite popular belief, it's not a full added thread; in reality it's only a 50% increase in performance. So, having 8 core and 16 threads does not equal having 16 cores. Sometimes you can get lucky, but most times, you will see your VM report high CPU utilization that is not actually true.

You need to look at the VM counters on the host for %RDY, CO-STOP (CSTOP), and IOWAIT. This is a good start to determine what is going on which your VM. Also, do not override vNUMA by changing the default core per socket from 1. Leave that alone unless you got some software that still thinks it's a great idea to license this way. You don't actually gain any real benefit from messing with this setting and can cause more harm than good. Also, disable Hot-Add for both the memory and CPU, another performance killer.

Lastly, right size the VMs. Do you actually need 8 vCPU's assigned to the VM? Most folks think that if the CPU utilization of a VM is > 50% then more CPUs need to be added. That's absolutely wrong, in a VM, if you run between 70% and 80% normally, then you are right sized for sure. Measure this by taking 1-minute samples over 90 days and then only using the 95th percentile, you don't care about spikes, only plateaus. I ran a very large infrastructure on that basic principle and not only did things perform better with less vCPU's, several million $$$'s in equipment was avoided. It works.

Also, lastly, not every VM is made the same, even if the same software is installed on each one. Smal variations in workload, how the workload is used, and how the IO stack is used will cause massive variations in functionality of each. You must treat each VM as it's own container for fine-grained tuning; t-shirt sizes are a great starting point but not the end of the conversation.

Look up VM right sizing on this forum, you will find a lot of good discussions, even possibly some of my past discussions, that will explain way in depth more than I have.

Alfista_PS · ‎06-28-2023

Hi,

I can only tell you to it that I had similar problem in Zabbix with measuring Linux servers, where I have still problems - I have there the CPU usage still over 100% (about 150 ~ 190%). I found that this problems that the systems doesn't good calculate more processors.

It also can depend if you have in VM set to more CPU's with less or without cores or less CPU's with more cores. This can do also the calculation change in the OS. VMware also write what is better for which VM OS's for better performance.

I thing here in the OS will be the same problem that the VM OS doesn't calculate it correctly and the ESXi see the real correct usage.

You don't need to be worried about it and see only what the ESXi shows you while you will see there the complete host resources usage.

Alfista
----------------------
Audio-Video Accessories
Selling and Integration of Audio & Video Accessories and Technology
If my answer has resolved your problem please mark as RESOLVED or if it has only was a good help then give me the KUDOS. Thanks.

Kinnison · ‎06-29-2023

Hi,

In my opinion @Tibmeister provided an excellent explanation of what is sometimes misperceived as a "problem" when it really isn't.

Regards,
Ferdinando

Rhys1979 · ‎06-29-2023

So, I think I did not explain the problem well. The issue was not with the VM OS (Windows Server 2022 in this case) reporting high CPU utilization. While it was, that was not my concern as that is perfectly normal.

The issue was the VM CPU utilization reported in vCenter was 5-10x the physical CPU utilization reported in vCenter, while the host CPUs were only at roughly 20-25% physical utilization. esxtop also was reporting extremely high IOWAIT times for all VMs even though there was a 50-75% of the physical resources idle.

I have however resolved the issue. Of all things, turning the server fully off, giving it a few minutes and turning it back on have resolved the issue. I'm not certain as to why, but there is a factor I did not mention that I think may have been the root cause, in case anyone else ever encounters a similar issue.

One of the VMs on these servers is a 64 camera VMS system (Geovision). There is an nVidia T1000 video card installed in the system and passed through to that VM. For some reason, Lifecycle Manager generated a snapshot of the VM that was meant to be automatically removed, and for some reason, removal of this snapshot was not possible while the VM was running (still investigating what the deal is with that). The snapshot had ballooned to consume the entire 84TB spinning rust pool that the camera storage is on, and necessitated shutting the VM down to merge the snapshot. As we are a retail organization, having the camera system down for multiple days while the snapshot was consolidated was not an option, and a temporary VMS VM was created for service continuity while the primary VM consolidation proceeded (took a week!).

I think the issue may have stemmed from the video card having been unassigned from the primary VM and reassigned to the temporary VM, and then back again, that may have been the root cause of the issue as it began when the primary VM was turned back on after the consolidation completed and resolved without a full power cycle of the host server.

Tibmeister · ‎07-10-2023

I think you just described the issue, high IOWAIT, which I will bet when the context of the snapshot is taken into account you will find your storage is the main issue.

All

CPU throttling issue