VMware Cloud Community
FreddyFredFred
Hot Shot
Hot Shot

help interpreting esxtop vm cpu values

I've been looking at esxtop stats lately to try and understand why VMs sometimes seem to feel sluggish. Before I go on to tackle the storage side of things, I figured I'd start with the hosts. I've been reading a bunch of stuff but I'm still having a hard time understanding so hopefully someone can help me.

In watching esxtop, I saw a high %VMWAIT on a VM so I expanded the VM and saw this:

         ID      GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT

30289238 57794167 vmx                 1        0.81    0.06    0.75   96.40       -    0.01    0.00    0.00    0.00    0.00    0.00

30289240 57794167 vmast.30289239      1        0.02    0.02    0.00   96.45       -    0.00    0.00    0.00    0.00    0.00    0.00

30289241 57794167 vmx-vthread-4:V     1        0.00    0.00    0.00   96.48       -    0.00    0.00    0.00    0.00    0.00    0.00

30289242 57794167 vmx-vthread-5:V     1        0.00    0.00    0.00   96.48       -    0.00    0.00    0.00    0.00    0.00    0.00

30289243 57794167 vmx-mks:VM863_W     1        0.00    0.00    0.00   96.47       -    0.00    0.00    0.00    0.00    0.00    0.00

30289244 57794167 vmx-svga:VM863_     1        0.20    0.22    0.00   96.17       -    0.09    0.00    0.00    0.00    0.00    0.00

30289245 57794167 vmx-vcpu-0:VM86     1        9.31    9.31    0.00   86.69   29.53    0.48   57.16    0.23    0.00    0.00    0.00


Can someone tell me what %VMWAIT is telling me in this case? It looks like the VM is doing nothing so what is the vm "waiting" for? The value goes back down to zero after a few refreshes but seems to come back a little later.

Another thing I see is %USED > %RUN and sometimes %RUN > %USED and I'm not sure what that means. Is %USED what is physically used on processor and %RUN what the VM is actually trying to use? If so, does that mean that when %RUN > %USED , I have a resource problem?

Example:

On a host with 30% cpu used (as reported by the vcphere client and esxtop), %USED > %RUN (although sometimes I see things flip and %RUN > %USED for a few seconds but generally, the values are within 10 pionts of each other, with the VM using the most CPU being closer to 5-10).

On 2nd host with 46% reported in vsphere client, the load averages in esxtop are more like 0.8 and %RUN > %USED (some VMs have %RUN twice as much as %USED.)

Why is there such a difference between the GUI and esxtop in regards to load for the 2nd host? Is the 2nd host overloaded because %RUN > %USED for almost all vms? %RDY is anywhere from 1 to 10%. I think this is telling me that even though I seem to have cpu speed to spare, I have too many VMs on the 2nd host and trying to schedule them all is making it so none of them is getting enough CPU time?

Thanks

Reply
0 Kudos
2 Replies
tedg_vCrumbs
Enthusiast
Enthusiast

Do not rely on the GUI for real time stats.  Esxtop will alway be in real time.

You are also seeing %OVRLP.

I would look at I/O bottle necks.  Is that VM on different storage or requesting a lot I/O?



Info on esxtop below.  Hope some of this helps.

http://www.yellow-bricks.com/esxtop/

Interpreting esxtop 4.1 Statistics

http://www.running-system.com/wp-content/uploads/2012/08/esxtop_english_v11.pdf

"

Another thing I see is %USED > %RUN and sometimes %RUN > %USED and I'm not sure what that means. Is %USED what is physically used on processor and %RUN what the VM is actually trying to use? If so, does that mean that when %RUN > %USED , I have a resource problem?"


From the esxtop bible.

  • "%USED"

The percentage physical CPU time accounted to the world. If a system service runs on behalf of this world, the time spent by that service (i.e. %SYS) should be charged to this world. If not, the time spent (i.e. %OVRLP) should not be charged against this world. See notes on %SYS and %OVRLP.

%USED = %RUN + %SYS - %OVRLP

  • "%RUN"

The percentage of total scheduled time for the world to run.

+Q: What is the difference between %USED and %RUN?+

A: %USED = %RUN + %SYS - %OVRLP. (%USED takes care of the system service time.) Details above.

+Q: What does it mean if %RUN of a VM is high?+

+A: The VM is using lots of CPU resource. It does not necessarily mean the VM is under resource constraint. Check the description of %RDY below, for determining CPU contention.

  • "%WAIT"

The percentage of time the world spent in wait state.

This %WAIT is the total wait time. I.e., the world is waiting for some VMKernel resource. This wait time includes I/O wait time, idle time and among other resources. Idle time is presented as %IDLE.

--------------

"%VMWAIT" does not take into account idle time.



------ tedg Don't forget to mark posts as helpful or correct if they deserve it!
Reply
0 Kudos
FreddyFredFred
Hot Shot
Hot Shot

I've read all that stuff multiple times, but I still have a hard time grasping everything. I figured maybe someone might be able to explain it in a slightly different way.

Anyway, taking the %USED formula confuses me because the numbers never adds up. Taking another VM as an example (because the %VMWAIT on the original doesn't seem to be going high anymore):

%USED: 120.70

%RUN: 190.43

%SYS: 1.07

%WAIT: 619.36

%VMWAIT: 6.15

%RDY: 0.44

%IDLE: 5.75

%OVRLP: 0.25

%CSTP: 0.05

The used/run numbers are high but I don't think that's a problem because the VM is actually busy doing work but the problem is the %USED = %RUN + %SYS - %OVRLP doesn't add up in this example. What am I missing?

As for the IO potentially causing VMWAIT to be high, this is what I am seeing for that VM:

CMDS/s: 72.83

READS/s: 0.35

WRITES/s: 72.48

MBREAD/s: 0.02

MBWRITE/s: 2.23

Lat/rd:3.98

Lat/wr: 1.46

We have a couple of EqualLogic SANs (not grouped together) carved into 1.5TB luns and presented to the hosts. There's roughly 400vms total and we end up with 20-30 vms per luns.

Reply
0 Kudos