Understanding VirtualCenter Performance Statistics

Introduction

VirtualCenter (VC) is the entry point for virtual platform management but is less frequently used for performance analysis than esxtop. On the surface, VC is insufficient for performance analysis. But this is not necessarily the case. The VirtualCenter performance counter collection is reduced by default to minimize the data maintained by VC's database. The performance counters maintained by VC can be modified and detailed analysis can be performed based on those counters. This document will provide details necessary for understanding and enabling VC's performance monitoring capabilities.

Refer to the Performance Monitoring and Analysis for information on using these counters.

VirtualCenter Statistic Archival

Our stats infrastructure has a lot of counters but our documentation has traditionally been quite thin in terms of descriptions. I got so sick of asking what stats are available at what stats level that I decided to start this page. Obviously it needs to be made more readable, but hopefully it is a start.

Remember that stats in VC are generally organized into 2 archival categories:

Not archived: these are the "real-time" (past-hour) stats, which are refreshed every 20 seconds, and are displayed for the past hour in the VI client. These stats are not stored in the database.
Archived stats. These stats are aggregations (rollups) of the real-time stats. They are aggregated at different sampling intervals and stored in the database. We follow the MRTG standard.
- Past day: past day stats take the real-time stats and roll them up so that there is 1 data point for every 5 minutes. Thus, there are 12 data points per hour and 288 per day.
- Past week: past week stats take the past day stats and roll them up so that there is 1 data point for every 30 minutes. Thus, there are 48 data points per day and 336 per week.
- Past month: past month stats take the past week stats and roll them up so there is 1 data point per 2 hours. Thus, there are 12 data points per day and 360 per month (30-day month).
- Past year: past year stats take the past month stats and roll them up so there is 1 data point per day. Thus, there are 365 data points per year.

The basic flow is this: an ESX host stores statistics at 20s granularity for a period of 1 hour. Therefore, using the Host Client one can view the stats for a host/VM for the past-hour, or one can view those stats using the VI client attached to VirtualCenter. ESX will also aggregate the statistics into past-day statistics and store them for up to 1 day. These past-day statistics are sent to VC periodically and then stored in the database. The database is responsible for periodically taking these past-day stats and rolling them up into 30-minute weekly stats, and then doing the same for converting the weekly stats to monthly stats, etc. Because past-day, past-week, and past-month stats are stored in the database, I call them "archived" stats.

VirtualCenter Statistics Level

Statistics level is a means of organizing statistics for archiving purposes. Its worth noting that only stats levels one and two are useful for deployment performance monitoring and analysis. Levels three and four provide granularity and visibility that is useful only for developers.

The concept of "stats level" applies only to the archived stats: we only store a stat in the database if we are at the appropriate stats level for that particular statistic. Non-archived stats are unaffected by stats level. In other words, every metric listed below is collected at 20s granularity and stored on the ESX host for 1 hour. However, unless VC is set to the stats level appropriate to that statistic, we will not store the data in the database or rollup the stat into a past-day stat on the ESX host. You can specify the stats level independently for each of the archiving interval. In other words, you might want to store level 4 stats for up to 1 day, but level 3 stats for 1 week.

In practice, we use stats level to vary the level of detail for statistics that are archived. At stats level 1, we have pretty coarse-grained stats, while stats level 4 contains very detailed statistics, and also includes statistics for various instances (e.g., for each NIC of a VM).

There are 3 important calls that I often use for stats (please refer to the SDK documentation for more information):

QueryStatsByLevel: this tells you what stats are available at what stats level. This is what I used to generate the tables below.
QueryAvailableMetrics: this tells you what stats are available for a given entity during a specified time period.
QueryPerf: this call takes a QuerySpec as an input and collects the stats for the specified entity over the specified time interval.

Let me give a concrete example of stats level.

Suppose I want to know the value of mem.consumed.maximum for a given VM. This is the maximum amount of machine memory allocated to a VM (including overhead memory) over a specified interval. As shown below, this is a "level 4" statistic. This means that if I've set the stats level to 4 for past-day stats and then formulate a QuerySpec that asks for the value of this data 20 minutes ago at "past-day" granularity (i.e., at 5-minute granularity), then I will get a value. If the stats level is 2 for past-day (5-minute granularity) statistics, however, then such a query will not return a value, because it is level-4 stat and only level-2/level-1 stats are being stored at 5-minute granularity. In contrast, even if the stats level is 1, then if I formulate a QuerySpec with 20s (i.e., "real-time" or "past hour") as the interval of collection, I will get this value, because this data is stored for up to one hour at 20s granularity no matter what the stats level.

Update Interval

Understanding the update interval is a key component to understanding the performance statistics. The Virtual Infrastructure Client (VIC) displays live stats at a 20s update frequency. Archived stats are archived at their archive frequency. This is key to understanding the relative amounts of data presented by VC.

For instance, a ready time of 1,000ms in the VIC's live stats graph translates into 5% ready time (1,000 / 20,000.) The same amount of ready time in a five minute archival frequency would be 15,000 ms.

Counter Index

For a list of all counters, see the vCenter Performance Counters page.

pwgayek · ‎06-09-2008

This is useful information.

Would it be possible here or in the counters list to start providing definitions for some of the key performance counters? I am doing some CPU accounting and have struggled to understand the relationships between the following CPU counters: usage, usagemhz, system, wait, ready, extra, used, and guaranteed. This article: http://kb.vmware.com/kb/1002356 provided a good start, but I still have questions. The counter definitions in the Programming Guide are self-referencing so not useful.

Some specific questions:

- Is usage % a percentage of multiple potential PCPUs, if the number of VCPUs > 1? Which of the time components represented by the msec counters are included in this % "busy"?

- What does a CPU time unit of MHz mean? I'm used to that metric as a clock rate, not a consumption metric.

- Which of the counters (system, wait, ready, extra, guaranteed) are also included in the counter "used"?

- What is a short definition for the counters "extra" and "guaranteed"?

drummonds · ‎06-09-2008

Hello,

I included information in this document and the VirtualCenter Performance Counters page to answer your questions. Note that wait, extra, guaranteed and system are level three counters that provide you no information to guide your own monitoring and analysis work.

Scott

All

Understanding VirtualCenter Performance Statistics