VMware

Virtual Performance

Scott Drummonds works in a variety of performance areas at VMware: VDI, application best practices, competitive analysis, customer performance investigations, and outward bound communications. This blog will detail some of my musings on these subjects.

4 Posts tagged with the esx tag
2

I spent a great deal of time answering customers' questions about the scheduler. Never have so many questions been asked about such an abstruse component for which so little user influence is possible. But CPU scheduling is central to system performance, so VMware strives to provide as much information on the subject as possible. In this blog entry, I want to point out a few nuggets of information on the CPU scheduler. These four bullets answer 95% of the questions I get asked.

Item 1: ESX 4's Scheduler Better Uses Caches Across Sockets

On UMA systems with low load levels, virtual machine performance improves when each virtual CPU (vCPU) is placed on its own socket. This is because providing each vCPU its own socket also give it the entire cache on that CPU. On page 18 of a recent paper on the scheduler written by Seongbeom Kim, a graph highlights the case where vCPU spreading improves performance.

Picture 2.png

The X-axis represents different combinations of VM and vCPU counts. SPECjbb is memory intensive and shows great gains with increases in CPU cache. The few cases that show dramatic benefit due to the ESX 4.0 scheduler are benefiting from the distribution of vCPUs across sockets. Very large gains are possible in this somewhat uncommon case.

Item 2: Overuse of SMP Only Slows Consolidated Environments At Saturation

For years customers have asked me how many vCPUs they should give to their VMs. The best guidance, "as few as possible", seems too vague to satisfy. It remains the only correct answer, unfortunately. But a recent experiment performed by Bruce Herndon's team sheds some light on this VM sizing question.

In this experiment we ran VMmark against VMs that were configured outside of VMmark specifications. In one case some of the virtual machines were given too few vCPUs and in another they were given too many. Because VMmark's workload is fixed, changing VM sizes does not alter the amount of work performed by the VMs. In other words, the system's score does not depend on the VMs' vCPU count. Until CPU saturation, that is.

Picture 3.png

Notice that the scores are similar between the undersized, right-sized, and over-sized VMs. Up until tile 10 (60 VMs) they are nearly identical. There is a slight difference in processor utilization that begins to impact throughput (score) as the system runs out of CPU. At that point wasted cycles dedicated to unneeded vCPUs negatively impact the system performance. Two points I will call out from this work:

  • Sloppy VI admins that provide too many vCPUs need not worry about performance when their servers are under low load. But performance will suffer when CPU utilization spikes.
  • The penalty of over-sizing VMs gets worse as VMs get larger. Using a 2-way VM is not that bad, but unneeded use of 4-way VM when one or two processors suffice can cost up to 15% of your system throughput. I presume that unnecessarily eight vCPUs would be criminal.

Item 3: ESX Has Not Strictly Co-scheduled Since ESX 2.5

I have documented ESX's relaxation of co-scheduling previously (Co-scheduling SMP VMs in VMware ESX Server). But this statement cannot be repeated too frequently: ESX has not strictly co-scheduled virtual machines since version 2.5. This means that ESX can place vCPUs from SMP VMs individually. It is not necessary to wait for physical cores to be available for every vCPU before starting the VM. However, as Item 3 pointed out, this does not give you free license to over-size your VMs. Be frugal with your SMP VMs and assign vCPUs only when you need them.

Item 4: The Cell Construct Has Been Eliminated in ESX 4.0

In the performance best practices deck that I give at conferences I talk about the benefits of creating small virtual machines over large ones. In versions of ESX up to ESX 3.5, the scheduler used a construct called a cell that would contain and lock CPU cores. The vCPUs from a single VM could never span a cell. With a ESX 3.x's cell size of four this meant that VMs never spanned multiple four-core sockets. Consider this figure:

http://communities.vmware.com/servlet/JiveServlet/downloadImage/38-4886-6688/Picture+1.png

What this figure shows is that a four-way VM on ESX 3.5 can only be placed in two locations on this hypothetical two-socket configuration. There are 12 combinations for a two-way VM and eight for a uniprocessor VM. The scheduler has more opportunities to optimize VM placement when you provide it with smaller VMs.

In ESX 4 we have eliminated the cell lock so VMs can span multiple sockets, as item one states. Continue to think of this placement problem as a challenge to the scheduler that you can alleviate. By choosing multiple, smaller VMs you free the scheduler to pursue opportunities to optimize performance in consolidated environments.

2 Comments Permalink
0

I was recently copied on an internal thread discussing a performance tweak for VMware vSphere. The thread discussed gains that can be derived from an adjustment to the CPU scheduler. In ESX 3.5, ESX's cell construct limited vCPU mobility between different sockets. ESX 4.0 has no such limitations and its aggressive migrations are non-optimal in some cases.

This thread details the application of this change in ESX 4 and provides some insight into its impact. This scheduler modification is going to be baked in to the first update to ESX 4.

On 4socket (or more) Dunnington (or any non-NUMA) platform, VMmark score can be further improved by enabling CoschedHandoffLLC: In console OS, it can be enabled via vsish (available from VMware*debug-tools*.rpm):

vsish -e set /config/Cpu/intOpts/CoschedHandoffLLC 1
I believe that config parameter is also tunable through VC or VI client. (haven't confirmed myself)

The degree of improvement depends on the configurations but in one case, the improvement was about 10 - 20%.

In default setting, VMmark may suffer many inter-package vcpu migrations which causes performance degradation. Setting CoschedHandoffLLC reduces the number of inter-package vcpu migrations and recovers performance loss.

The fix is disabled by default in ESX 4.0 GA but will be enabled by default in ESX 4.0 u1.

Try this out and let me know if you see a significant change on any of your workloads.

0 Comments Permalink
11

My colleague in product management, Praveen Kannan, has been working to extend Perfmon to show some ESX performance counters. This capability is automatically installed with VMware Tools on vSphere 4. But Praveen and I have made a stand-alone version available to those of you that are still on VI3. Download it here to give it a try.

To install, place the file in an appropriately-named directory on any Windows VM on VI3. Double-click the executable, which will self-extract the files into the same directory. Run "install.bat" and you're done.

Once you bring up Perfmon you'll see two new performance objects on your computer: "VM Memory" and "VM Processor". These objects contain counters exposed by ESX that accurately reflect the VM's memory and CPU usage. Here's Perfmon on my test VM after I've installed the tool.

new_counters.png

This makes collection of host stats a breeze. Windows Management Instrumentation (WMI) programs can now easily get access to reliable host statistics. And anyone with access to Perfmon can get see their VM's resource usage. Unlike guest-based statistics, the host-statistics shown through these counters accurately reflect resource usage in the presence of virtualization overheads and time slicing of VMs.

Disclaimer:

This is a pre-release "sneak peak" version. Eventually this tool will be available for download on vmware.com and supported by VMware. But today there is no support for this tool and you're using it "as-is". Use at your own risk and do not contact VMware support for help with this release.
That's VMware's official position on this tool. But feel free to comment here with any ideas about this great new feature.

11 Comments Permalink
0

I recently attended a practice talk for next week's Partner Exchange hosted by Kit Colbert, one of our senior engineers, who is leading a whole bunch of cool efforts around performance. I wanted to "leak" one slide that his showed us that we'll be touching up for publication. Some of you that are curious about memory counters and want a different take from Memory Performance Analysis and Monitoring may find this interesting.

guest_host_memory.png

Some of this stuff won't make sense outside of Kit's presentation, but let me point out a few things that may help consume the information in this incredible chart:

  • One of the key messages from Kit's presentation is that ESX reports memory with respect to the guest (the VM) and the host. The very top rectange shows memory stats reported for each VM. The second rectangle shows the single VM's memory stats reported by each host.
  • As can be seen from the above, the consumed memory in the host represents everything in the VM, minus the savings due to page sharing.
  • This graph doesn't yet highlight the difference between ballooned memory and swapped memory from the guest perspective. From the guest's perspective, swapped memory is much more attractive then ballooned memory, as the guest doesn't know that the swapped memory is gone. But it does see the ballooned memory as pinned. ESX is clever enough to deflate the balloon driver, if possible, when the guest starts to access swapped memory to avoid the host's swapping of guest memory.
  • The final rectangle shows memory of all VMs from the host's perspective. Don't pay attention to the reserved and unreserved memory; I'm told those are unnecessary distractions that will be removed.

Kit is going to be in Orlando with me next week to talk about ESX and guest memory management. He's going to explain the difficult process of recovering unused memory from guests to enable over-commitment. Be sure and see him if you're in town!

0 Comments Permalink

Virtual Performance

Scott Drummonds works in a variety of performance areas at VMware: VDI, application best practices, competitive analysis, customer performance investigations, and outward bound communications. This blog will detail some of my musings on these subjects.

Communities