Understanding VirtualCenter Performance Statistics

Understanding VirtualCenter Performance Statistics

Introduction

VirtualCenter (VC) is the entry point for virtual platform management  but is less frequently used for performance analysis than esxtop. On the  surface, VC is insufficient for performance analysis. But this is not  necessarily the case. The VirtualCenter performance counter collection  is reduced by default to minimize the data maintained by VC's database.  The performance counters maintained by VC can be modified and detailed  analysis can be performed based on those counters. This document will  provide details necessary for understanding and enabling VC's  performance monitoring capabilities.

Refer to the Performance Monitoring and Analysis for information on using these counters.

VirtualCenter Statistic Archival

Our stats infrastructure has a lot of counters but our documentation has  traditionally been quite thin in terms of descriptions. I got so sick  of asking what stats are available at what stats level that I decided to  start this page. Obviously it needs to be made more readable, but  hopefully it is a start.

Remember that stats in VC are generally organized into 2 archival categories:

  1. Not archived: these are the "real-time" (past-hour) stats, which are  refreshed every 20 seconds, and are displayed for the past hour in the  VI client. These stats are not stored in the database.
  2. Archived stats. These stats are aggregations (rollups) of the  real-time stats. They are aggregated at different sampling intervals and  stored in the database. We follow the MRTG standard. 
    • Past day: past day stats take the real-time stats and roll  them up so that there is 1 data point for every 5 minutes. Thus, there  are 12 data points per hour and 288 per day.
    • Past week: past week stats take the past day stats and roll  them up so that there is 1 data point for every 30 minutes. Thus, there  are 48 data points per day and 336 per week.
    • Past month: past month stats take the past week stats and  roll them up so there is 1 data point per 2 hours. Thus, there are 12  data points per day and 360 per month (30-day month).
    • Past year: past year stats take the past month stats and roll  them up so there is 1 data point per day. Thus, there are 365 data  points per year.


The basic flow is this: an ESX host stores statistics at 20s granularity  for a period of 1 hour. Therefore, using the Host Client one can view  the stats for a host/VM for the past-hour, or one can view those stats  using the VI client attached to VirtualCenter. ESX will also aggregate  the statistics into past-day statistics and store them for up to 1 day.  These past-day statistics are sent to VC periodically and then stored in  the database. The database is responsible for periodically taking these  past-day stats and rolling them up into 30-minute weekly stats, and  then doing the same for converting the weekly stats to monthly stats,  etc. Because past-day, past-week, and past-month stats are stored in the  database, I call them "archived" stats.

VirtualCenter Statistics Level

Statistics level is a means of organizing statistics for archiving  purposes.  Its worth noting that only stats levels one and two are  useful for deployment performance monitoring and analysis.  Levels three  and four provide granularity and visibility that is useful only for  developers.

The concept of "stats level" applies only to the archived stats: we only  store a stat in the database if we are at the appropriate stats level  for that particular statistic. Non-archived stats are unaffected by  stats level. In other words, every metric listed below is collected at  20s granularity and stored on the ESX host for 1 hour. However, unless  VC is set to the stats level appropriate to that statistic, we will not  store the data in the database or rollup the stat into a past-day stat  on the ESX host. You can specify the stats level independently for each  of the archiving interval. In other words, you might want to store level  4 stats for up to 1 day, but level 3 stats for 1 week.

In practice, we use stats level to vary the level of detail for  statistics that are archived. At stats level 1, we have pretty  coarse-grained stats, while stats level 4 contains very detailed  statistics, and also includes statistics for various instances (e.g.,  for each NIC of a VM).

There are 3 important calls that I often use for stats (please refer to the SDK documentation for more information):

  1. QueryStatsByLevel: this tells you what stats are available at what  stats level. This is what I used to generate the tables below.
  2. QueryAvailableMetrics: this tells you what stats are available for a given entity during a specified time period.
  3. QueryPerf: this call takes a QuerySpec as an input and collects the  stats for the specified entity over the specified time interval.


Let me give a concrete example of stats level.

Suppose I want to know the value of mem.consumed.maximum for a given VM.  This is the maximum amount of machine memory allocated to a VM  (including overhead memory) over a specified interval. As shown below,  this is a "level 4" statistic. This means that if I've set the stats  level to 4 for past-day stats and then formulate a QuerySpec that asks  for the value of this data 20 minutes ago at "past-day" granularity  (i.e., at 5-minute granularity), then I will get a value. If the stats  level is 2 for past-day (5-minute granularity) statistics, however, then  such a query will not return a value, because it is level-4 stat and  only level-2/level-1 stats are being stored at 5-minute granularity. In  contrast, even if the stats level is 1, then if I formulate a QuerySpec  with 20s (i.e., "real-time" or "past hour") as the interval of  collection, I will get this value, because this data is stored for up to  one hour at 20s granularity no matter what the stats level.

Update Interval

Understanding the update interval is a key component to understanding  the performance statistics.  The Virtual Infrastructure Client (VIC)  displays live stats at a 20s update frequency.  Archived stats are  archived at their archive frequency.  This is key to understanding the  relative amounts of data presented by VC.

For instance, a ready time of 1,000ms in the VIC's live stats graph  translates into 5% ready time (1,000 / 20,000.)  The same amount of  ready time in a five minute archival frequency would be 15,000 ms.

Counter Index

For a list of all counters, see the vCenter Performance Counters page.

Comments

This is useful information.

Would it be possible here or in the counters list to start providing definitions for some of the key performance counters? I am doing some CPU accounting and have struggled to understand the relationships between the following CPU counters: usage, usagemhz, system, wait, ready, extra, used, and guaranteed. This article: http://kb.vmware.com/kb/1002356 provided a good start, but I still have questions. The counter definitions in the Programming Guide are self-referencing so not useful.

Some specific questions:

- Is usage % a percentage of multiple potential PCPUs, if the number of VCPUs > 1? Which of the time components represented by the msec counters are included in this % "busy"?

- What does a CPU time unit of MHz mean? I'm used to that metric as a clock rate, not a consumption metric.

- Which of the counters (system, wait, ready, extra, guaranteed) are also included in the counter "used"?

- What is a short definition for the counters "extra" and "guaranteed"?

Hello,

I included information in this document and the VirtualCenter Performance Counters page to answer your questions. Note that wait, extra, guaranteed and system are level three counters that provide you no information to guide your own monitoring and analysis work.

Scott

Version history
Revision #:
1 of 1
Last update:
‎05-15-2008 12:15 PM
Updated by: