VMware

Virtual Performance

Scott Drummonds works in a variety of performance areas at VMware: VDI, application best practices, competitive analysis, customer performance investigations, and outward bound communications. This blog will detail some of my musings on these subjects.

3 Posts tagged with the memory tag
0

Love Your Balloon Driver

Posted by drummonds VMware Sep 9, 2009

A couple of days ago we finally got out one of my favorite papers from our ongoing vSphere launch activities. This paper on ESX memory management, written by Fei Guo in performance engineering, has three graphs that are absolute gems. These graphs show balloon driver memory savings next to throughput numbers for three common benchmarks. The conclusion is inescapable: the balloon driver reclaims memory from over-provisioned VMs with virtually no impact to performance. This is true on every workload save one: Java.

Example 1: Kernel Compile

Linux kernel compilation models a common developer environment involving a large number of code compiles. This process is CPU and IO intensive but uses very little memory.

Picture 1.png

Results of two experiments are shown on this graph: in one memory is reclaimed only through ballooning and in the other memory is reclaimed only through host swapping. The bars show the amount of memory reclaimed by ESX and the line shows the workload performance. The steadily falling green line reveals a predictable deterioration of performance due to host swapping. The red line demonstrates that as the balloon driver inflates, kernel compile performance is unchanged.

Kernel compilation performance remains high with ballooning because this workload needs very little memory and the guest OS can easily take unused pages from the application. Performance falls with swapping because ESX randomly selects virtual machine pages for swapping, whether those pages are in use by the application or not. The guest OS is better at selecting pages for reclamation than ESX is.

Example 2: Oracle/Swingbench

Oracle's database is best tested against Swingbench, the OLTP load generation tool provided by Oracle. Database workloads utilize all system resources but show a non-linear dependence on memory. Memory can be safely reclaimed from OSes running databases until the cache becomes smaller than needed by the workload. The following figure shows this.

Picture 2.png

As before, the virtual machine using only ballooning maintains higher performance under memory pressure than the virtual machine whose memory is being swapped away by the host. Performance is constant and shows no negative impact due to ballooning until the balloon encroaches on the SGA. Again, ESX's host swapping randomly selects pages to send to disk which degrades performance even at small swap amounts.

As with kernel compile, the balloon driver safely reclaims memory from over-provisioned VMs with little impact to application performance.

Example 3: Java/SPECjbb

Java provides a special challenge in virtual environments due to the JVM's introduction of a third level of memory management. The balloon driver draws memory from the virtual machine without impacting throughput because the guest OS efficiently claims pages that its processes are not using. But in the case of Java, the guest OS is unaware of how the JVM is using memory and is forced to select memory pages an arbitrarily and inefficiently as ESX's swap routine.

Picture 3.png

Neither ESX nor the guest OS can efficiently take memory from the JVM without significantly degrading performance. Memory in Java is managed internal to the JVM and efforts by the host or guest to remove pages will equally negatively impact Java applications performance. In these environments it is wise to manually set the JVM's heap size and specify memory reservations for the virtual machine in ESX to account for the JVM, OS, and heap.

Conclusions and Scott's Special Recommendation

Love your balloon driver. Your application owners are always asking for more memory than they need. With great comfort you can over-provision memory some and rely on ESX and the balloon driver to reclaim what is not in use. Without the balloon driver, ESX will be forced to use its last technology for managing memory over-commit: host swapping. And host swapping always decreases performance.

So here is my special recommendation for you: never, ever disable the balloon driver. This forces the host to swap that virtual machine's memory, should that resource become scarce. And where ballooning usually will not hurt performance, swapping always will. If you must protect an application from memory reclamation due to memory over-commitment, use reservations. They make admission control more effective, they self-document the needs of the VM, and they are easily configured.

0 Comments Permalink
0

Newer processors are much more important to virtualization than physical, un-virtualized environments. The generational improvements haven't just increased the raw compute power, they've also reduced the overheads associated with virtualization. This blog entry will describe three key changes that have particularly impacted virtual performance.

Hardware Assist Is Faster

In 2008, AMD became the first CPU vendor to produce a hardware memory management unit equipped to support virtualization. They called this technology Rapid Virtualization Indexing (RVI). This year Intel did the same with Extended Page Tables (EPT) on its Xeon 5500 line. Both vendors have been providing the ability to virtualize privileged instructions since 2006, with continually improving results. Consider the following graph showing the latency of one key instruction from Intel:

vmexit_latencies.png

This instruction, VMEXIT, is called each time the guest exits to the kernel. The graph shows its latency (delay) in completing this instruction, which represents a wait time incurred by the guest. Clearly Intel has made great strides in reducing VMEXIT's wait time from its Netburst parts (Prescott and Cedar Mill) to its Core architecture (Merom and Penryn) and on to its current generation, Core i7 (Nehalem). AMD processors have shown commensurate gains with AMD-V.

Pipelines Are Shorter

The longest pipelines in the x86 world were in Intel's Netburst processors. These processor's pipelines had twice as many stages at their counterparts at AMD and twice as many as the generation of Intel CPUs that followed. The increased pipeline length would have enabled support for 8 GHz silicon, had it arrived. Instead, silicon switching speeds hit a wall at 4 GHz and Intel (and its customers) were forced to suffer the drawbacks of large pipelines.

Large pipelines aren't necessarily a problem for desktop environments, where single threaded applications used to dominate the market. But in the enterprise, application thread counts were larger. Furthermore, consolidation in virtual environments drew thread counts even higher. With more contexts in the processor, the number of pipeline stalls and flushes increased, and performance fell.

Because of decreased efficiency of consolidated workloads on processors with long pipelines, VMware has often recommended that performance-intensive VMs be run on processors no older than 2-3 years. This excludes Intel's Netburst parts. VI3 and vSphere will do a fine job at virtualizing your less-demanding applications on any supported processors. But use newer parts for the applications that hold your highest performance expectations.

Caches Are Larger

A cache is highly effective when it fully contains the software's working set. The addition from the hypervisor of even a small about of code will change the working set and reduce cache hit rate. I've attempted to illustrate this concept with the following simplified view of the relationship between cache hit rates, application working set, and cache sizes:

cache_hit_rates.png

This graph is based on a model that greatly simplifies working sets and the hypervisor's impact on them. Assuming that ESX increases the working set by 256 KB, this graph shows the difference in cache hit rate due to the contributions of the hypervisor. Notice that with very small caches and very small application working sets, the cache hit rate suffers greatly due to the addition of even 256 KB of virtualization support instructions. And even up to 2 MB, a 10% decrease in cache hit rate can be seen in some applications. With a 256 KB contribution by the kernel, cache hit rates do not change significantly with cache sizes of 4 MB and beyond.

In some cases a 10% improvement in cache hit rate can double application throughput. This means that a doubling of cache size can profoundly effect the performance of virtual applications as compared to native. Given ESX's small contribution to the working set, you can see why we at VMware recommend that customers run their performance-intensive workloads on CPUs with 4 MB caches or larger.

0 Comments Permalink
0

I recently attended a practice talk for next week's Partner Exchange hosted by Kit Colbert, one of our senior engineers, who is leading a whole bunch of cool efforts around performance. I wanted to "leak" one slide that his showed us that we'll be touching up for publication. Some of you that are curious about memory counters and want a different take from Memory Performance Analysis and Monitoring may find this interesting.

guest_host_memory.png

Some of this stuff won't make sense outside of Kit's presentation, but let me point out a few things that may help consume the information in this incredible chart:

  • One of the key messages from Kit's presentation is that ESX reports memory with respect to the guest (the VM) and the host. The very top rectange shows memory stats reported for each VM. The second rectangle shows the single VM's memory stats reported by each host.
  • As can be seen from the above, the consumed memory in the host represents everything in the VM, minus the savings due to page sharing.
  • This graph doesn't yet highlight the difference between ballooned memory and swapped memory from the guest perspective. From the guest's perspective, swapped memory is much more attractive then ballooned memory, as the guest doesn't know that the swapped memory is gone. But it does see the ballooned memory as pinned. ESX is clever enough to deflate the balloon driver, if possible, when the guest starts to access swapped memory to avoid the host's swapping of guest memory.
  • The final rectangle shows memory of all VMs from the host's perspective. Don't pay attention to the reserved and unreserved memory; I'm told those are unnecessary distractions that will be removed.

Kit is going to be in Orlando with me next week to talk about ESX and guest memory management. He's going to explain the difficult process of recovering unused memory from guests to enable over-commitment. Be sure and see him if you're in town!

0 Comments Permalink

Virtual Performance

Scott Drummonds works in a variety of performance areas at VMware: VDI, application best practices, competitive analysis, customer performance investigations, and outward bound communications. This blog will detail some of my musings on these subjects.

Communities