VMware Cloud Community

CPU Ready vs Co Stop vs Contention vs Steal

We're running several hundred VM's on one of our clusters and have multiple business units managing servers at the OS level, running on these clusters. We have one business unit who runs their own monitoring software on their Windows Servers that is telling them the 'CPU Steal' is very high, and that it's an issue with the hosts having CPU contention. We manage the underlying infrastructure, so I'm trying to match vSphere metrics up with what they are reporting for relevance.

I'm not familiar with CPU Steal, but typically I would review the CPU Ready values of a VM experiencing CPU performance issues. With CPU Ready value < 5% there's nothing to worry about, general rule of thumb in my experience.

Looking at 1 particular VM (for example) with reported issues of high CPU Steal, CPU Ready is very low, 1.2% max peak however the CPU Co-Stop reached peaks of 250ms during these 1.2% ready peaks. If these 2 values indicate the same (or similar) information (VM vCPU is waiting to process on the hosts physical CPU) how can the values be so different?

Looking at the Max VM CPU Contention values from VROPS at the cluster level, it ranges from 4 to 16 - what is acceptable value for this metric?

0 Kudos
2 Replies

So this is my understanding but i could be wrong as i am not a windows support engineer.

CPU steal is similar to co stop

CPU Ready is the amount of time the VM was ready to execute on the physical cores but couldn't. Impotent not "The VM was ready"

CPU Co stop the amount of time from when the first vcpu was able to schedule on a physical core to the last (Co Scheduling).

If you have a VM with 1 vcpu you will not have any co stop so co stop only effects vms with multiple vcpus. Also in your case you are looking at the VM ready as a % and co stop as MS you would need to use the same for both.

So how could they be different?

Lets say you have a vm with 8 vcpu. If NO physical cores are free at the time of execution that means your VM was ready but could not execute (ready time will be high) Because there was NO physical cores free up to this point your co stop will be 0 because none of the cores got scheduled as there is no time between the first and the last vcpu.

If it takes 500ms for the first core to be free then 300ms later 4 were and 200ms later the last one was then your co stop will be 500ms but your ready time would be 1 second because ready incorporates the full event where co stop dosnt care about the first 500ms

There are other things that affect co stop (Hyper threading accounting, co run and scheduling affinity and more) if you want to put yourself to sleep feel free to buy the host resource deep dive book

Both will point to performance problems but you cant expect if one is high the other should be as they are based on two different things.

0 Kudos

I see this tread has aged a bit, but the subject is still relevant.

I like the explanation and it feels comfortably understandable, but I've having a hard time confirming whether it's correct or not.

VMware writes about Ready time:

"Percentage of time the resource pool, virtual machine, or world was ready to run, but was not provided CPU resources on which to execute."



"Percentage of time that the virtual machine was ready, but could not get scheduled to run on the physical CPU.

CPU ready time is dependent on the number of virtual machines on the host and their CPU loads."


VMware writes about stolen time:

“….“stolen time;” that is, time when the guestoperating system was ready to run, but the virtual machine was descheduled by the host scheduler.”


"stolen time—that is, the amount of time when the kernel would have run a nonidle process but was descheduled."


So the difference seems "could not get scheduled" for ready time and "was descheduled" for stolen time.
There's no explicit mention about having access to a number of cores as the answer suggests, but this might indeed be what's meant.

VMware writes about Co-Stop:

"As storage I/O for snapshots grows, co-stop (%CSTP) values for a VM with multiple vCPUs can increase as the vCPUs wait on I/O completion."



"Percentage of time a resource pool spends in a ready, co-deschedule state."


That sounds like an IO related issue caused by the creation of snapshots.

Do I understand the answer correctly and is it true that %ST is a subset of %RDY and %RDY counts from starting to wait for the CPU to the last core is accessible while steal is from the first core till the last core becoming accessible?
Does that make %ST a subset of %RDY, and how do I correctly convert the performance counter for stolen time (which measures in ms) to a % stolen time?
Would the following formula be correct? ( stolen time in ms * 100 ) / vCPU-count

0 Kudos