VMware

Virtual Performance

Scott Drummonds works in a variety of performance areas at VMware: VDI, application best practices, competitive analysis, customer performance investigations, and outward bound communications. This blog will detail some of my musings on these subjects.

6 Posts tagged with the vmmark tag
2

I spent a great deal of time answering customers' questions about the scheduler. Never have so many questions been asked about such an abstruse component for which so little user influence is possible. But CPU scheduling is central to system performance, so VMware strives to provide as much information on the subject as possible. In this blog entry, I want to point out a few nuggets of information on the CPU scheduler. These four bullets answer 95% of the questions I get asked.

Item 1: ESX 4's Scheduler Better Uses Caches Across Sockets

On UMA systems with low load levels, virtual machine performance improves when each virtual CPU (vCPU) is placed on its own socket. This is because providing each vCPU its own socket also give it the entire cache on that CPU. On page 18 of a recent paper on the scheduler written by Seongbeom Kim, a graph highlights the case where vCPU spreading improves performance.

Picture 2.png

The X-axis represents different combinations of VM and vCPU counts. SPECjbb is memory intensive and shows great gains with increases in CPU cache. The few cases that show dramatic benefit due to the ESX 4.0 scheduler are benefiting from the distribution of vCPUs across sockets. Very large gains are possible in this somewhat uncommon case.

Item 2: Overuse of SMP Only Slows Consolidated Environments At Saturation

For years customers have asked me how many vCPUs they should give to their VMs. The best guidance, "as few as possible", seems too vague to satisfy. It remains the only correct answer, unfortunately. But a recent experiment performed by Bruce Herndon's team sheds some light on this VM sizing question.

In this experiment we ran VMmark against VMs that were configured outside of VMmark specifications. In one case some of the virtual machines were given too few vCPUs and in another they were given too many. Because VMmark's workload is fixed, changing VM sizes does not alter the amount of work performed by the VMs. In other words, the system's score does not depend on the VMs' vCPU count. Until CPU saturation, that is.

Picture 3.png

Notice that the scores are similar between the undersized, right-sized, and over-sized VMs. Up until tile 10 (60 VMs) they are nearly identical. There is a slight difference in processor utilization that begins to impact throughput (score) as the system runs out of CPU. At that point wasted cycles dedicated to unneeded vCPUs negatively impact the system performance. Two points I will call out from this work:

  • Sloppy VI admins that provide too many vCPUs need not worry about performance when their servers are under low load. But performance will suffer when CPU utilization spikes.
  • The penalty of over-sizing VMs gets worse as VMs get larger. Using a 2-way VM is not that bad, but unneeded use of 4-way VM when one or two processors suffice can cost up to 15% of your system throughput. I presume that unnecessarily eight vCPUs would be criminal.

Item 3: ESX Has Not Strictly Co-scheduled Since ESX 2.5

I have documented ESX's relaxation of co-scheduling previously (Co-scheduling SMP VMs in VMware ESX Server). But this statement cannot be repeated too frequently: ESX has not strictly co-scheduled virtual machines since version 2.5. This means that ESX can place vCPUs from SMP VMs individually. It is not necessary to wait for physical cores to be available for every vCPU before starting the VM. However, as Item 3 pointed out, this does not give you free license to over-size your VMs. Be frugal with your SMP VMs and assign vCPUs only when you need them.

Item 4: The Cell Construct Has Been Eliminated in ESX 4.0

In the performance best practices deck that I give at conferences I talk about the benefits of creating small virtual machines over large ones. In versions of ESX up to ESX 3.5, the scheduler used a construct called a cell that would contain and lock CPU cores. The vCPUs from a single VM could never span a cell. With a ESX 3.x's cell size of four this meant that VMs never spanned multiple four-core sockets. Consider this figure:

http://communities.vmware.com/servlet/JiveServlet/downloadImage/38-4886-6688/Picture+1.png

What this figure shows is that a four-way VM on ESX 3.5 can only be placed in two locations on this hypothetical two-socket configuration. There are 12 combinations for a two-way VM and eight for a uniprocessor VM. The scheduler has more opportunities to optimize VM placement when you provide it with smaller VMs.

In ESX 4 we have eliminated the cell lock so VMs can span multiple sockets, as item one states. Continue to think of this placement problem as a challenge to the scheduler that you can alleviate. By choosing multiple, smaller VMs you free the scheduler to pursue opportunities to optimize performance in consolidated environments.

2 Comments Permalink
0

Last week Chris Wolf moderated a debate on virtual platform performance between myself and Simon Crosby, CTO of Citrix. A recording of the debate was put online shortly after its conclusion.

Simon and I disagreed on a few issues and demonstrated different strategies in the discussion. My goal in representing the fine efforts of our performance team was to show to the audience VMware's commitment to product performance. This commitment is demonstrated through a never ending series of benchmark publications and continual product improvement. In the years since I joined VMware we have quantified ESX's ability to serve web pages (SPECweb), enable massive numbers of database transactions (TPC-C, with disclaimers), and establish industry leadership in consolidated workloads (VMmark). As we released these and dozens of other numbers, Citrix has remained silent on its own product's performance.

I was pleased that the event's format gave me the opportunity to discuss our accomplishments. My only regret was that I lacked the time to dispense with the most important of several factual inaccuracies from Simon. At one point in the discussion Simon claimed that VMmark is not run by anyone except VMware. In fact, it is closer to the truth to say that VMmark is run by everyone except VMware. A quick view of the VMmark results page will show results from every major server vendor, with no submissions from VMware.

Thanks to the Burton Group and Chris Wolf for letting me participate. It was a pleasure.

0 Comments Permalink
0

I was recently copied on an internal thread discussing a performance tweak for VMware vSphere. The thread discussed gains that can be derived from an adjustment to the CPU scheduler. In ESX 3.5, ESX's cell construct limited vCPU mobility between different sockets. ESX 4.0 has no such limitations and its aggressive migrations are non-optimal in some cases.

This thread details the application of this change in ESX 4 and provides some insight into its impact. This scheduler modification is going to be baked in to the first update to ESX 4.

On 4socket (or more) Dunnington (or any non-NUMA) platform, VMmark score can be further improved by enabling CoschedHandoffLLC: In console OS, it can be enabled via vsish (available from VMware*debug-tools*.rpm):

vsish -e set /config/Cpu/intOpts/CoschedHandoffLLC 1
I believe that config parameter is also tunable through VC or VI client. (haven't confirmed myself)

The degree of improvement depends on the configurations but in one case, the improvement was about 10 - 20%.

In default setting, VMmark may suffer many inter-package vcpu migrations which causes performance degradation. Setting CoschedHandoffLLC reduces the number of inter-package vcpu migrations and recovers performance loss.

The fix is disabled by default in ESX 4.0 GA but will be enabled by default in ESX 4.0 u1.

Try this out and let me know if you see a significant change on any of your workloads.

0 Comments Permalink
0

Its been about 10 days since I posted the YouTube video showing Hyper-V's stability problems in consolidated environments. I immediately received a lot of questions about the configuration that I answered to the best of my ability in my "Video on Hyper-V Crashes" blog entry. Many respondents were not surprised by stability problems with a first-generation product and some people requested more detail on this issue for further discussion. But there were too many comments to address in all.

One of the more interesting emails I received pointed out that it unreasonable to blame Hyper-V for the collapse of these very large and very busy websites. Hyper-V's stability issues would bring down individual VMs or small groups when the parent partition blue screened. I think that this is a reasonable observation, so its worth including here. I can't say that Hyper-V was responsible for the MSDN and TechNet crashes. That would be for Microsoft to say, when and if they choose to expose the issue behind the outage.

Lastly, all comments come from people that fall into one of two categories: one camp thinks the video captures are bogus and the other believes they're based on a real, reasonable, repeatable workload. I'm not going to try and move you from one camp to the other.

It is clear that a small, vocal, and surprisingly profane number of you think that I made this whole thing up. The premise of this latter group appears to be that Microsoft wouldn't make a product that a customer could crash under normal conditions. If this is your reasoning then no video, discussion or demonstration is going to change your mind. I'll let everyone else make their decisions based on Microsoft's track record and his or her experience with Microsoft products.

Update: 5/15/09

The team responsible for the research has deciced to post details: Setting the Record Straight on the Hyper-V Video

0 Comments Permalink
0

Video on Hyper-V Crashes

Posted by drummonds VMware May 15, 2009

Since I posted the YouTube video showing Hyper-V blue screens last Friday I've received a lot of comments, questions, compliments and complaints. The video and descriptive text have raised more questions than answers, so here are a few details to help fill out the story.

  • The workload was not technically VMmark. There are two reasons for this:
    • VMmark's run rules specify that the VMs must be configured with a single virtual disk. Because this configuration can't make use of Hyper-V's paravirtualized SCSI driver, which requires a second virtual disk, the run rules were violated to make Hyper-V produce its best results.
    • The vendors that provided requirements for VMmark included use of SMP Linux guests. Hyper-V's lack of support for these configurations means that it is unable to run VMmark according to the rules. Those rules were ignored by the test team and the ESX and Hyper-V tests were run with uniprocessor Linux guests so that Hyper-V was able to produce some number.
  • The server ran 15 tiles* when ESX was installed. So, the hardware is good.
  • The server successfully ran 10 tiles* when Hyper-V was installed, although at a much higher CPU utilization and lower throughput than ESX. The server seems to run Hyper-V correctly.
  • The 11-tile* run was tried many, many times. Hyper-V was unable to run 11 tiles without guest blue screens or the parent partition crashing and bringing down the server.

(*) As detailed in the first bullet, these aren't real "tiles". They have been dumbed down (Linux SMP) and reconfigured (extra virtual disk) to work around Hyper-V limitations.

I'm hoping to convince the people responsible for the test to shed their anonymity and come out with an official paper. I'll provide those details as soon as I can get them.

Update: 5/15/09

The team reasonable for the research has posted details of the experiment. Read more at Setting the Record Straight on the Hyper-V Video.

0 Comments Permalink
2

DPM Power/Performance Video

Posted by drummonds VMware Nov 6, 2008

Back in September the performance team here at VMware embarked on a project to measure power savings as a result of using VI3's distributed power management (DPM). This feature, experimentally supported in VI3 will full support planned for the next release, leverages DRS to consolidate idle and lightly-loaded VMs onto as few servers as possible. Once the workload has been consolidated to the bare minimum hardware required, spare servers are powered down. The end result is the flexible performance due to automated load balancing and a halving of total power usage.

The experiment that we performed was based on a workload derived from VMmark. In fact, it was precisely the VMmark workload. But the execution of the test against a cluster of systems makes the results invalid for comparison against other systems. VMmark run rules require the test run against VMs on a single server.


We started the test with 13 tiles worth of VMs (108 VMs in all) on the DRS cluster. With all of these VMs idle, DPM consolidated them to a single host and turned off three servers. As the load was applied to the VMs at 9:00 AM and driven through an eight-hour workday, DRS and DPM powered on servers and balanced load, as needed. When the day ended at 5:00 PM, the load was again consolidated and servers were powered down. The video we shot includes power meters of the systems under test and screenshots of activity induced by DRS and DPM.


Check out the video on YouTube and let me know what you think. I'm considering recording some of the other amazing things that we're doing with our products and would love your feedback on what you'd like to see.

2 Comments 0 References Permalink

Virtual Performance

Scott Drummonds works in a variety of performance areas at VMware: VDI, application best practices, competitive analysis, customer performance investigations, and outward bound communications. This blog will detail some of my musings on these subjects.

Communities