VMware Cloud Community
ldclancy
Enthusiast
Enthusiast
Jump to solution

CPU Latency

I am trying to find a definition of Latency (measured in Percent) as found in the CPU chart options in ESXi 5.0.

The "counter description" suggests "the percent of time the VM is unable to run because it is contending for the access to the physical CPU(s)", which sounds very similar to the CPU Ready description that suggests the "percentage of time that the virtual machine was ready, but could not get scheduled to run on the physical CPU".

However, in my observations the CPU Latency and CPU Ready do not move together.

Any help much appreciated.

Thanks, Liam.

1 Solution

Accepted Solutions
Iwan_Rahabok
VMware Employee
VMware Employee
Jump to solution

There are 4 CPU states. At least according to the vSphere 5.1 CPU scheduler white paper 🙂

Below is my understanding. Do correct me if I'm wrong.

The 4 states above add to 100%.
100% = %RUN + %READY + %CSTP + %WAIT

When first added, a VM is either in RUN or in READY state depending on the availability of a physical CPU at the ESX layer.

A VM in READY state is dispatched by the vmkernel CPU scheduler and enters RUN state. At RUN state, the VM is being served by the hypervisor, and can do what it is expected to do. CPU Ready & CPU Latency are similar
Latency: % of time the VM is unable to run because it is contending for access to the physical CPU.
Ready: % of time that the VM was ready, but could not get scheduled to run on the physical CPU. [e1: this is the total of all VMs in the ESX]
So it seems like Ready is after Latency. Latency will go up first, as the preferred NUMA core might not be available, then Ready will follow when there is no core available at all.


It can be later de-scheduled by vmkernel, and enters either READY or COSTOP state. Co-Stop happens if the VM has >1 vCPU, and one of them is waiting for the other. The waiting happens because ESXi does not have enough physical CPU to serve it. This is why you need to right size the VM. %RDY also includes %MLMTD, which is VM was ready to execute, but has not been scheduled for CPU time because of Limit. You should not use Limit.

The co-stopped VM is co-started later and enters READY state, where is ready to run. So Ready time does not include Co-Stop time. You need to measure both.

A VM in RUN state might enter WAIT state. Normally it is because it is waiting for a resource and is later woken up once the resource becomes available. This is normally IO work, e.g. waiting for a disk command to come back from the array.

When a VM is idle, not doing any work, it enters WAIT_IDLE, a special type of WAIT state. So it is not actually not waiting for anything. An idle world is woken up whenever it is interrupted.

WAIT also includes %SWPWT, which is CPU is waiting for VMKernel swapping memory

e1

View solution in original post

0 Kudos
6 Replies
jrmunday
Commander
Commander
Jump to solution

Hi Liam,

If you have a look at the rollup and units columns, these are actually quite different. Ready time is a summation in milliseconds, and Latency is an average in percent. I suspect that some sort of conversion will need to be done to measure these in the same units - taking into account the update interval for the chart you are looking at.

Similar to this % to summation conversion for ready time;

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200218...

I'll see if I can find some VM's with ready time and try do the maths.

Cheers,

Jon

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77
0 Kudos
ldclancy
Enthusiast
Enthusiast
Jump to solution

Thanks Jon

I agree that some conversion would be required to move from milliseconds/summation to percent/average. However, I'm not convinced that they really do relate to the same metric. I've been struggling to find a really good example, but have added two charts for a little test I did.

In the first chart (Capture CPU) the CPU Ready seems to be slowly declining and does not react at all to the slight rise in Latency.

The CPU Latency seems to be more associated to the rise in disk latency (second chart - Capture Disk).

I'm inclined to think that CPU Ready measures how long a virtual CPU spends waiting for its turn on a physical CPU, and once scheduled on the CPU the CPU Latency measure how long the CPU is doing nothing (waiting for the disk). However this is not what the description says, so I'd prefer to have some authority before I go sprouting this to the team.

Regards, Liam.

0 Kudos
Iwan_Rahabok
VMware Employee
VMware Employee
Jump to solution

Good question. I'm unable to find the answer either. I've posted this at Socialcast (an awesome tool for group discussion). If I hear an answer, I'd share it here.

I've added 2 charts comparing the value. Latency is _always_ higher than Ready. So I'd track Latency instead of Ready. Even if I go back 7 days, there is never a time where Ready was higher than Latency. I'm using VC Ops 5.6 for the chart.

e1
0 Kudos
lenzker
Enthusiast
Enthusiast
Jump to solution

interesting topic. Can you probably compare %wait as well?

since the 3 CPU states are running, ready, waiting. Probably the latency is %wait + %ready. It's just an asumption. I will try to check this out later

VCP,VCAP-DCA,VCI -> https://twitter.com/lenzker -> http://vxpertise.net
0 Kudos
Iwan_Rahabok
VMware Employee
VMware Employee
Jump to solution

There are 4 CPU states. At least according to the vSphere 5.1 CPU scheduler white paper 🙂

Below is my understanding. Do correct me if I'm wrong.

The 4 states above add to 100%.
100% = %RUN + %READY + %CSTP + %WAIT

When first added, a VM is either in RUN or in READY state depending on the availability of a physical CPU at the ESX layer.

A VM in READY state is dispatched by the vmkernel CPU scheduler and enters RUN state. At RUN state, the VM is being served by the hypervisor, and can do what it is expected to do. CPU Ready & CPU Latency are similar
Latency: % of time the VM is unable to run because it is contending for access to the physical CPU.
Ready: % of time that the VM was ready, but could not get scheduled to run on the physical CPU. [e1: this is the total of all VMs in the ESX]
So it seems like Ready is after Latency. Latency will go up first, as the preferred NUMA core might not be available, then Ready will follow when there is no core available at all.


It can be later de-scheduled by vmkernel, and enters either READY or COSTOP state. Co-Stop happens if the VM has >1 vCPU, and one of them is waiting for the other. The waiting happens because ESXi does not have enough physical CPU to serve it. This is why you need to right size the VM. %RDY also includes %MLMTD, which is VM was ready to execute, but has not been scheduled for CPU time because of Limit. You should not use Limit.

The co-stopped VM is co-started later and enters READY state, where is ready to run. So Ready time does not include Co-Stop time. You need to measure both.

A VM in RUN state might enter WAIT state. Normally it is because it is waiting for a resource and is later woken up once the resource becomes available. This is normally IO work, e.g. waiting for a disk command to come back from the array.

When a VM is idle, not doing any work, it enters WAIT_IDLE, a special type of WAIT state. So it is not actually not waiting for anything. An idle world is woken up whenever it is interrupted.

WAIT also includes %SWPWT, which is CPU is waiting for VMKernel swapping memory

e1
0 Kudos
vbondzio
VMware Employee
VMware Employee
Jump to solution

CPU Latency, i.e. %LAT_C in esxtop, includes: ready, cstp, ht busy time and effects of dynamic voltage frequency scaling, it doesn't include mlmtd though. Note that this is the same as CPU Contention in e.g. vCOps.

Cheers,

Valentin