The new source for Scott Drummonds' comments on the virtualization industry can be found at Pivot Point.
The new source for Scott Drummonds' comments on the virtualization industry can be found at Pivot Point.
My colleague in product management, Praveen Kannan, has been working to extend Perfmon to show some ESX performance counters. This capability is automatically installed with VMware Tools on vSphere 4. But Praveen and I have made a stand-alone version available to those of you that are still on VI3. Download it here to give it a try.
To install, place the file in an appropriately-named directory on any Windows VM on VI3. Double-click the executable, which will self-extract the files into the same directory. Run "install.bat" and you're done.
Once you bring up Perfmon you'll see two new performance objects on your computer: "VM Memory" and "VM Processor". These objects contain counters exposed by ESX that accurately reflect the VM's memory and CPU usage. Here's Perfmon on my test VM after I've installed the tool.
This makes collection of host stats a breeze. Windows Management Instrumentation (WMI) programs can now easily get access to reliable host statistics. And anyone with access to Perfmon can get see their VM's resource usage. Unlike guest-based statistics, the host-statistics shown through these counters accurately reflect resource usage in the presence of virtualization overheads and time slicing of VMs.
This is a pre-release "sneak peak" version. Eventually this tool will be available for download on vmware.com and supported by VMware. But today there is no support for this tool and you're using it "as-is". Use at your own risk and do not contact VMware support for help with this release.
That's VMware's official position on this tool. But feel free to comment here with any ideas about this great new feature.
A couple of days ago we finally got out one of my favorite papers from our ongoing vSphere launch activities. This paper on ESX memory management, written by Fei Guo in performance engineering, has three graphs that are absolute gems. These graphs show balloon driver memory savings next to throughput numbers for three common benchmarks. The conclusion is inescapable: the balloon driver reclaims memory from over-provisioned VMs with virtually no impact to performance. This is true on every workload save one: Java.
Linux kernel compilation models a common developer environment involving a large number of code compiles. This process is CPU and IO intensive but uses very little memory.
Results of two experiments are shown on this graph: in one memory is reclaimed only through ballooning and in the other memory is reclaimed only through host swapping. The bars show the amount of memory reclaimed by ESX and the line shows the workload performance. The steadily falling green line reveals a predictable deterioration of performance due to host swapping. The red line demonstrates that as the balloon driver inflates, kernel compile performance is unchanged.
Kernel compilation performance remains high with ballooning because this workload needs very little memory and the guest OS can easily take unused pages from the application. Performance falls with swapping because ESX randomly selects virtual machine pages for swapping, whether those pages are in use by the application or not. The guest OS is better at selecting pages for reclamation than ESX is.
Oracle's database is best tested against Swingbench, the OLTP load generation tool provided by Oracle. Database workloads utilize all system resources but show a non-linear dependence on memory. Memory can be safely reclaimed from OSes running databases until the cache becomes smaller than needed by the workload. The following figure shows this.
As before, the virtual machine using only ballooning maintains higher performance under memory pressure than the virtual machine whose memory is being swapped away by the host. Performance is constant and shows no negative impact due to ballooning until the balloon encroaches on the SGA. Again, ESX's host swapping randomly selects pages to send to disk which degrades performance even at small swap amounts.
As with kernel compile, the balloon driver safely reclaims memory from over-provisioned VMs with little impact to application performance.
Java provides a special challenge in virtual environments due to the JVM's introduction of a third level of memory management. The balloon driver draws memory from the virtual machine without impacting throughput because the guest OS efficiently claims pages that its processes are not using. But in the case of Java, the guest OS is unaware of how the JVM is using memory and is forced to select memory pages an arbitrarily and inefficiently as ESX's swap routine.
Neither ESX nor the guest OS can efficiently take memory from the JVM without significantly degrading performance. Memory in Java is managed internal to the JVM and efforts by the host or guest to remove pages will equally negatively impact Java applications performance. In these environments it is wise to manually set the JVM's heap size and specify memory reservations for the virtual machine in ESX to account for the JVM, OS, and heap.
Love your balloon driver. Your application owners are always asking for more memory than they need. With great comfort you can over-provision memory some and rely on ESX and the balloon driver to reclaim what is not in use. Without the balloon driver, ESX will be forced to use its last technology for managing memory over-commit: host swapping. And host swapping always decreases performance.
So here is my special recommendation for you: never, ever disable the balloon driver. This forces the host to swap that virtual machine's memory, should that resource become scarce. And where ballooning usually will not hurt performance, swapping always will. If you must protect an application from memory reclamation due to memory over-commitment, use reservations. They make admission control more effective, they self-document the needs of the VM, and they are easily configured.
I spent a great deal of time answering customers' questions about the scheduler. Never have so many questions been asked about such an abstruse component for which so little user influence is possible. But CPU scheduling is central to system performance, so VMware strives to provide as much information on the subject as possible. In this blog entry, I want to point out a few nuggets of information on the CPU scheduler. These four bullets answer 95% of the questions I get asked.
On UMA systems with low load levels, virtual machine performance improves when each virtual CPU (vCPU) is placed on its own socket. This is because providing each vCPU its own socket also give it the entire cache on that CPU. On page 18 of a recent paper on the scheduler written by Seongbeom Kim, a graph highlights the case where vCPU spreading improves performance.
The X-axis represents different combinations of VM and vCPU counts. SPECjbb is memory intensive and shows great gains with increases in CPU cache. The few cases that show dramatic benefit due to the ESX 4.0 scheduler are benefiting from the distribution of vCPUs across sockets. Very large gains are possible in this somewhat uncommon case.
For years customers have asked me how many vCPUs they should give to their VMs. The best guidance, "as few as possible", seems too vague to satisfy. It remains the only correct answer, unfortunately. But a recent experiment performed by Bruce Herndon's team sheds some light on this VM sizing question.
In this experiment we ran VMmark against VMs that were configured outside of VMmark specifications. In one case some of the virtual machines were given too few vCPUs and in another they were given too many. Because VMmark's workload is fixed, changing VM sizes does not alter the amount of work performed by the VMs. In other words, the system's score does not depend on the VMs' vCPU count. Until CPU saturation, that is.
Notice that the scores are similar between the undersized, right-sized, and over-sized VMs. Up until tile 10 (60 VMs) they are nearly identical. There is a slight difference in processor utilization that begins to impact throughput (score) as the system runs out of CPU. At that point wasted cycles dedicated to unneeded vCPUs negatively impact the system performance. Two points I will call out from this work:
Sloppy VI admins that provide too many vCPUs need not worry about performance when their servers are under low load. But performance will suffer when CPU utilization spikes.
The penalty of over-sizing VMs gets worse as VMs get larger. Using a 2-way VM is not that bad, but unneeded use of 4-way VM when one or two processors suffice can cost up to 15% of your system throughput. I presume that unnecessarily eight vCPUs would be criminal.
I have documented ESX's relaxation of co-scheduling previously (Co-scheduling SMP VMs in VMware ESX Server). But this statement cannot be repeated too frequently: ESX has not strictly co-scheduled virtual machines since version 2.5. This means that ESX can place vCPUs from SMP VMs individually. It is not necessary to wait for physical cores to be available for every vCPU before starting the VM. However, as Item 3 pointed out, this does not give you free license to over-size your VMs. Be frugal with your SMP VMs and assign vCPUs only when you need them.
In the performance best practices deck that I give at conferences I talk about the benefits of creating small virtual machines over large ones. In versions of ESX up to ESX 3.5, the scheduler used a construct called a cell that would contain and lock CPU cores. The vCPUs from a single VM could never span a cell. With a ESX 3.x's cell size of four this meant that VMs never spanned multiple four-core sockets. Consider this figure:
What this figure shows is that a four-way VM on ESX 3.5 can only be placed in two locations on this hypothetical two-socket configuration. There are 12 combinations for a two-way VM and eight for a uniprocessor VM. The scheduler has more opportunities to optimize VM placement when you provide it with smaller VMs.
In ESX 4 we have eliminated the cell lock so VMs can span multiple sockets, as item one states. Continue to think of this placement problem as a challenge to the scheduler that you can alleviate. By choosing multiple, smaller VMs you free the scheduler to pursue opportunities to optimize performance in consolidated environments.
Just over a week ago I had the privilege of riding along with VMware's Professional Services Organization as they piloted a possible performance offering. We are considering two possible services: one for performance troubleshooting and another for infrastructure optimization. During this trip we piloted the troubleshooting service, focusing on the customer's disappointing experience with SQL Server's performance on vSphere.
If you have read my blog entries (SQL Server Performance Problems Not Due to VMware) or heard me speak, you know that SQL performance is a major focus of my work. SQL Server is the most common source of performance discontent among our customers, yet 100% of the problems I have diagnosed were not due to vSphere. When this customer described the problem, I knew this SQL Server issue was stereotypical of my many engagements:
"We virtualized our environment nearly a year ago and and quickly determined that virtualization was not right for our SQL Servers. Performance dropped by 75% and we know this is VMware's fault because we virtualized on much newer hardware on the exact same SAN. We have since moved the SQL instance back to native."
Most professionals in the industry stop here, incorrectly bin this problem as a deficiency of virtualization, and move on with their deployments. But I know that vSphere's abilities with SQL Server are phenomenal, so I expect to make every user happy with their virtual SQL deployment. I start by challenging the assumptions and trust nothing that I have not seen for myself. Here are my first steps on the hunt for the source of the problem:
Instrument the SQL instance that has been moved back to native to profile its resource utilization. Do this by running Perfmon to collect stats on the database's memory, CPU, and disk usage.
Audit the infrastructure and document the SAN configuration. Primarily I will need RAID group and LUN configuration and an itemized list of VMDKs on each VMFS volume.
Use esxtop and vscsiStats to measure resource utilization of important VMs under peak production load.
There are about a dozen other things that I could do here, but my experience in these issues is that I can find 90% of all performance problems with just these three steps. Let me start by showing you the two RAID groups that were most important to the environment. I have greatly simplified the process of estimating these groups' performance, but the rough estimate will serve for this example:
RAID5 using 4 15K disks
4 x 200 = 800 IOPS
RAID5 using 7 10K disks
7 x 150 = 1050 IOPS
We found two SQL instances in their environment that were generating significant IO: one that had been moved back to native and one that remained in a virtual machine. By using Perfmon for the native instance and vscsiStats the virtual one, we documented the following demands during a one-hour window:
In the customer's first implementation of the virtual infrastructure, both SQL Servers, X and Y, were placed on RAID group A. But in the native configuration SQL Server X was placed on RAID group B. This meant that the storage bandwidth of the physical configuration was approximately 1850 IOPS. In the virtual configuration the two databases shared a single 800 IOPS RAID volume.
It does not take a rocket scientist to realize that users are going to complain when a critical SQL Server instances goes from 1050 IOPS to 400. And this was not news to the VI admin on-site, either. What we found as we investigated further was that virtual disks requested by the application owners were used in unexpected and undocumented ways and frequently demanded more throughput than originally estimated. In fact, through vscsiStats analysis (Using vscsiStats for Storage Performance Analysis), my contact and I were able to identify an "unused" VMDK with moderate sequential IO that we immediately recognized as log traffic. Inspection of the application's configuration confirmed this.
Despite the explosion of VMware into the data center we remain the new kid on the block. As soon as performance suffers the first reaction is to blame the new kid. But next time you see a performance problem in your production environment, I urge you to look at the issue as a consolidation challenge, and not a virtualization problem. Follow the best practices you have been using for years and you can correct this problem without needing to call me and my colleagues to town.
Of course, if you want to fly us out for to help you correct a specific problem or optimize your design, I promise we will make it worth your while.
Last week Chris Wolf moderated a debate on virtual platform performance between myself and Simon Crosby, CTO of Citrix. A recording of the debate was put online shortly after its conclusion.
Simon and I disagreed on a few issues and demonstrated different strategies in the discussion. My goal in representing the fine efforts of our performance team was to show to the audience VMware's commitment to product performance. This commitment is demonstrated through a never ending series of benchmark publications and continual product improvement. In the years since I joined VMware we have quantified ESX's ability to serve web pages (SPECweb), enable massive numbers of database transactions (TPC-C, with disclaimers), and establish industry leadership in consolidated workloads (VMmark). As we released these and dozens of other numbers, Citrix has remained silent on its own product's performance.
I was pleased that the event's format gave me the opportunity to discuss our accomplishments. My only regret was that I lacked the time to dispense with the most important of several factual inaccuracies from Simon. At one point in the discussion Simon claimed that VMmark is not run by anyone except VMware. In fact, it is closer to the truth to say that VMmark is run by everyone except VMware. A quick view of the VMmark results page will show results from every major server vendor, with no submissions from VMware.
Thanks to the Burton Group and Chris Wolf for letting me participate. It was a pleasure.
I was recently copied on an internal thread discussing a performance tweak for VMware vSphere. The thread discussed gains that can be derived from an adjustment to the CPU scheduler. In ESX 3.5, ESX's cell construct limited vCPU mobility between different sockets. ESX 4.0 has no such limitations and its aggressive migrations are non-optimal in some cases.
This thread details the application of this change in ESX 4 and provides some insight into its impact. This scheduler modification is going to be baked in to the first update to ESX 4.
On 4socket (or more) Dunnington (or any non-NUMA) platform, VMmark score can be further improved by enabling CoschedHandoffLLC: In console OS, it can be enabled via vsish (available from VMwaredebug-tools.rpm):
vsish -e set /config/Cpu/intOpts/CoschedHandoffLLC 1
I believe that config parameter is also tunable through VC or VI client. (haven't confirmed myself)
The degree of improvement depends on the configurations but in one case, the improvement was about 10 - 20%.
In default setting, VMmark may suffer many inter-package vcpu migrations which causes performance degradation. Setting CoschedHandoffLLC reduces the number of inter-package vcpu migrations and recovers performance loss.
The fix is disabled by default in ESX 4.0 GA but will be enabled by default in ESX 4.0 u1.
Try this out and let me know if you see a significant change on any of your workloads.
There's been no shortage of comments on the Hyper-V video I posted. I made a comment on this action in a VMTN blog entry. Read up and comment here or there.
A few weeks ago our communities' administrators setup an XML aggregation of all blogs in VMware's performance community. In addition to the regular postings coming from VROOM! and me, there are several other members of our performance team that irregularly contribute new content. If you follow the aggregator and its RSS feed then you'll be notified of new performance content as it goes live.
The aggregator can be found at http://www.vmware.com/vmtn/planet/vmware/performance.xml.
Newer processors are much more important to virtualization than physical, un-virtualized environments. The generational improvements haven't just increased the raw compute power, they've also reduced the overheads associated with virtualization. This blog entry will describe three key changes that have particularly impacted virtual performance.
In 2008, AMD became the first CPU vendor to produce a hardware memory management unit equipped to support virtualization. They called this technology Rapid Virtualization Indexing (RVI). This year Intel did the same with Extended Page Tables (EPT) on its Xeon 5500 line. Both vendors have been providing the ability to virtualize privileged instructions since 2006, with continually improving results. Consider the following graph showing the latency of one key instruction from Intel:
This instruction, VMEXIT, is called each time the guest exits to the kernel. The graph shows its latency (delay) in completing this instruction, which represents a wait time incurred by the guest. Clearly Intel has made great strides in reducing VMEXIT's wait time from its Netburst parts (Prescott and Cedar Mill) to its Core architecture (Merom and Penryn) and on to its current generation, Core i7 (Nehalem). AMD processors have shown commensurate gains with AMD-V.
The longest pipelines in the x86 world were in Intel's Netburst processors. These processor's pipelines had twice as many stages at their counterparts at AMD and twice as many as the generation of Intel CPUs that followed. The increased pipeline length would have enabled support for 8 GHz silicon, had it arrived. Instead, silicon switching speeds hit a wall at 4 GHz and Intel (and its customers) were forced to suffer the drawbacks of large pipelines.
Large pipelines aren't necessarily a problem for desktop environments, where single threaded applications used to dominate the market. But in the enterprise, application thread counts were larger. Furthermore, consolidation in virtual environments drew thread counts even higher. With more contexts in the processor, the number of pipeline stalls and flushes increased, and performance fell.
Because of decreased efficiency of consolidated workloads on processors with long pipelines, VMware has often recommended that performance-intensive VMs be run on processors no older than 2-3 years. This excludes Intel's Netburst parts. VI3 and vSphere will do a fine job at virtualizing your less-demanding applications on any supported processors. But use newer parts for the applications that hold your highest performance expectations.
A cache is highly effective when it fully contains the software's working set. The addition from the hypervisor of even a small about of code will change the working set and reduce cache hit rate. I've attempted to illustrate this concept with the following simplified view of the relationship between cache hit rates, application working set, and cache sizes:
This graph is based on a model that greatly simplifies working sets and the hypervisor's impact on them. Assuming that ESX increases the working set by 256 KB, this graph shows the difference in cache hit rate due to the contributions of the hypervisor. Notice that with very small caches and very small application working sets, the cache hit rate suffers greatly due to the addition of even 256 KB of virtualization support instructions. And even up to 2 MB, a 10% decrease in cache hit rate can be seen in some applications. With a 256 KB contribution by the kernel, cache hit rates do not change significantly with cache sizes of 4 MB and beyond.
In some cases a 10% improvement in cache hit rate can double application throughput. This means that a doubling of cache size can profoundly effect the performance of virtual applications as compared to native. Given ESX's small contribution to the working set, you can see why we at VMware recommend that customers run their performance-intensive workloads on CPUs with 4 MB caches or larger.
At VMworld Europe 2009 my engineering colleague Chethan Kumar and I presented the results of a six-month investigation into the performance of SQL Server on ESX. Tomorrow (May 12 at 09:00 PDT) we're going to offer an updated version of this session to the general public. If you have any interest in virtualized SQL Server deployments, please register and attend the presentation to discover what we learned in our investigation.
I provided some notes on that presentation in a blog entry (SQL Server Performance Problems Not Due to VMware) right after the show. But the large numbers of attendees and exceptionally high ratings encouraged me to setup this encore session. And since Chethan's research on SQL Server performance tuning has continued, we have some updates to the experimental results.
In tomorrow's webinar we will tell the story of our exploration into persistent rumors of SQL Server performance problems. The search began after VMworld 2008 when I decided to engage every customer with a complaint on SQL Server performance. At the same time Chethan investigated every possible application, operating system, and hypervisor parameter that could impact SQL performance. I talked to dozens of customers and Chethan spent hundreds of hours on this work.
This presentation will detail the results of our investigation and leave its attendees with a clear understanding SQL performance on VMware. Our conclusions are surprisingly simple and certain to help you get the most out of your virtual infrastructure.
There's a lot of confusion out there on VMware's support for the CPU vendors' virtualization assist technology. VMware has always led the industry with its support for hardware assist. We were the first vendor to support AMD-v and Intel VT-x in 2006, the first to support AMD RVI in 2008, and will be the first to support Intel EPT when vSphere 4 becomes publicly available. These technologies
which we call hardware assistprovide value to the part of ESX we call the monitor.
As we prepare for vSphere's general availability we're generating a lot of documentation to help people get the most out of the new version of ESX. One of my colleagues started a document that details the role of the monitor and how it flexibly uses different hardware assist technologies. I've summarized the default behavior of our monitor in several situations in ESX Monitor Modes. Of course vSphere's users will be able to override these defaults if they want to experiment with their workloads.
I wanted to include a textual summary of the role of the monitor in virtualization but found myself getting bogged down with the writing. So, I thought I'd try something new. Let me know what you think of this short video clip explaining the role of the monitor and how it might leverage hardware assist.
I recently attended a practice talk for next week's Partner Exchange hosted by Kit Colbert, one of our senior engineers, who is leading a whole bunch of cool efforts around performance. I wanted to "leak" one slide that his showed us that we'll be touching up for publication. Some of you that are curious about memory counters and want a different take from Memory Performance Analysis and Monitoring may find this interesting.
Some of this stuff won't make sense outside of Kit's presentation, but let me point out a few things that may help consume the information in this incredible chart:
One of the key messages from Kit's presentation is that ESX reports memory with respect to the guest (the VM) and the host. The very top rectange shows memory stats reported for each VM. The second rectangle shows the single VM's memory stats reported by each host.
As can be seen from the above, the consumed memory in the host represents everything in the VM, minus the savings due to page sharing.
This graph doesn't yet highlight the difference between ballooned memory and swapped memory from the guest perspective. From the guest's perspective, swapped memory is much more attractive then ballooned memory, as the guest doesn't know that the swapped memory is gone. But it does see the ballooned memory as pinned. ESX is clever enough to deflate the balloon driver, if possible, when the guest starts to access swapped memory to avoid the host's swapping of guest memory.
The final rectangle shows memory of all VMs from the host's perspective. Don't pay attention to the reserved and unreserved memory; I'm told those are unnecessary distractions that will be removed.
Kit is going to be in Orlando with me next week to talk about ESX and guest memory management. He's going to explain the difficult process of recovering unused memory from guests to enable over-commitment. Be sure and see him if you're in town!
Microsoft SQL Server runs at roughly 80% of native on VI3 in most benchmarked environments. In production environments, and under loads that model those conditions, SQL Server runs at 90-95% of native on ESX 3.5. I can say this with confidence despite a large amount of the industry's skepticism because I've spent so much time on SQL Server in the past half year. I'd like to share some of my research on the subject and observations with you.
Two weeks ago my colleague Chethan Kumar and I presented on SQL Server in Cannes, France for VMworld Europe 2009. This presentation was the culmination of six months of investigation that was started at VMworld 2008 in Las Vegas. At that event I heard so many concerns about SQL Server performance that I was resolved to identify the problems. I talked with every customer I could find that claimed that SQL ran at anything less than 70% of native. So many of these contacts claimed that they had measured SQL at 25% of native or worse, that I knew that something was going wrong.
First, let me show you a slide that Chethan presented at the show in Cannes:
Chethan spent three months investigating SQL Server to find out how much he could improve virtual performance from the "out of the box" experience. As this figure details, the sum total of performance improvements was 15%. Here's another break-down of these results:
The only option that we found in ESX to improve virtual performance was static transmit coalescing, which is documented on page four of one of our SPECweb papers. Large pages and SQL's priority boost, which are best practices provided by Microsoft for SQL Server configuration, provide the largest gains in performance.
The key messages that we communicated to our audience were that a properly running SQL Server should run at 80% of native or better. In most production cases it can run at a performance indistinguishable from native speed. And if performance is lagging, there don't exist many changes that can be made to ESX that can yield and performance gains at all.
This begs the question: "If ESX can't be tuned to double SQL performance, what is causing these reports of terrible SQL Server throughput?" The great majority of the problems are coming from mis-configured storage. But a variety of other items such as poor hardware selection or use of the wrong virtualization software contribute to the confusion, as well. I've been documenting these issues in Best Practices for SQL Server on this community and will continue to update that document as more problems are discovered.
If you have a SQL Server running un-virtualized in your environment, I'd like you to try virtualizing it again. Follow our best practices document and pay close attention to your storage configuration during deployment. I feel confident that once you've setup your environment properly, you're going to like what you see.