Time-based Measurements in Virtual Machines

Introduction

All benchmarking relies on accurate time keeping so production can be measured with respect to the passage of time. Because hosted and hypervisor products virtualize the hardware timer, minute fluctuations in guest time keeping can occur. Details are provided on this in many locations including the following:

During benchmarking these fluctuations in time can cause unexpected results. Performance measurements can appear inflated if time slows down while work occurs, depressed if time accelerates while work is occurring, or somewhere in between. This topic will provide some details on this phenomenon.

If the hypervisor (or host operating system for a hosted product) is busy with other tasks it may stall slightly when delivering timer interrupts to the VM. This means that the guest timer appears to run slow. VMware products will correct for these deviations by pushing time back to its correct position but these "slow downs" and "catch ups" may occur at different points in a benchmark.

Artificially High Results

Consider the case where a great number of IO operations are measured over a long period of time. If a benchmark wants to run 10,000 IOPS and the native system takes 10ms to process each operation, the benchmark would measure 10ms on a native system. However, if the benchmarking is run on a virtual system and the virtualization software is busy servicing the IO operation instead of updating time, time may not progress properly during the operation. Although 10ms or more would have passed, perhaps the VM was only informed of the passing of 9ms. In this case, the operation appeared to run faster on the VM than the native system.

For benchmarks where many operations are measured of a large time window, this isn't a problem. On our 10K operation benchmark, if time were started before operation one and stopped after operation 10,000, a fluctuation of 1ms will make no difference. After all, that's only a 1ms inaccuracy on a sequence that would take 100s to run.

However, if the benchmark measured each operation individually and summarized them all to product a result, each individual 1ms inaccuracy would be summed over the entire run. The benchmark would report that the average IO length was 9ms even though observation of wall time would still show the passage of 100s during the 10,000 IO operation run.