Skip navigation

I was recently copied on an internal thread discussing a performance tweak for VMware vSphere.  The thread discussed gains that can be derived from an adjustment to the CPU scheduler.  In ESX 3.5, ESX's cell construct limited vCPU mobility between different sockets.  ESX 4.0 has no such limitations and its aggressive migrations are non-optimal in some cases.


This thread details the application of this change in ESX 4 and provides some insight into its impact.  This scheduler modification is going to be baked in to the first update to ESX 4.



On 4socket (or more) Dunnington (or any non-NUMA) platform, VMmark score can be further improved by enabling CoschedHandoffLLC:  In console OS, it can be enabled via vsish (available from VMwaredebug-tools.rpm):


vsish -e set /config/Cpu/intOpts/CoschedHandoffLLC 1
I believe that config parameter is also tunable through VC or VI client. (haven't confirmed myself)


The degree of improvement depends on the configurations but in one case, the improvement was about 10 - 20%.


In default setting, VMmark may suffer many inter-package vcpu migrations which causes performance degradation. Setting CoschedHandoffLLC reduces the number of inter-package vcpu migrations and recovers performance loss.


The fix is disabled by default in ESX 4.0 GA but will be enabled by default in ESX 4.0 u1.



Try this out and let me know if you see a significant change on any of your workloads.

drummonds Hot Shot

My Hyper-V Video

Posted by drummonds Jun 10, 2009


There's been no shortage of comments on the Hyper-V video I posted.  I made a comment on this action in a VMTN blog entry.  Read up and comment here or there.






drummonds Hot Shot

Drink From the Fire Hose

Posted by drummonds Jun 3, 2009

A few weeks ago our communities' administrators setup an XML aggregation of all blogs in VMware's performance community.  In addition to the regular postings coming from VROOM! and me, there are several other members of our performance team that irregularly contribute new content.  If you follow the aggregator and its RSS feed then you'll be notified of new performance content as it goes live.


The aggregator can be found at



Newer processors are much more important to virtualization than physical, un-virtualized environments.  The generational improvements haven't just increased the raw compute power, they've also reduced the overheads associated with virtualization.  This blog entry will describe three key changes that have particularly impacted virtual performance.


Hardware Assist Is Faster

In 2008, AMD became the first CPU vendor to produce a hardware memory management unit equipped to support virtualization.  They called this technology Rapid Virtualization Indexing (RVI).  This year Intel did the same with Extended Page Tables (EPT) on its Xeon 5500 line.  Both vendors have been providing the ability to virtualize privileged instructions since 2006, with continually improving results.  Consider the following graph showing the latency of one key instruction from Intel:



This instruction, VMEXIT, is called each time the guest exits to the kernel.  The graph shows its latency (delay) in completing this instruction, which represents a wait time incurred by the guest.  Clearly Intel has made great strides in reducing VMEXIT's wait time from its Netburst parts (Prescott and Cedar Mill) to its Core architecture (Merom and Penryn) and on to its current generation, Core i7 (Nehalem).  AMD processors have shown commensurate gains with AMD-V.


Pipelines Are Shorter

The longest pipelines in the x86 world were in Intel's Netburst processors.  These processor's pipelines had twice as many stages at their counterparts at AMD and twice as many as the generation of Intel CPUs that followed.  The increased pipeline length would have enabled support for 8 GHz silicon, had it arrived.  Instead, silicon switching speeds hit a wall at 4 GHz and Intel (and its customers) were forced to suffer the drawbacks of large pipelines.


Large pipelines aren't necessarily a problem for desktop environments, where single threaded applications used to dominate the market.  But in the enterprise, application thread counts were larger.  Furthermore, consolidation in virtual environments drew thread counts even higher.  With more contexts in the processor, the number of pipeline stalls and flushes increased, and performance fell.


Because of decreased efficiency of consolidated workloads on processors with long pipelines, VMware has often recommended that performance-intensive VMs be run on processors no older than 2-3 years.  This excludes Intel's Netburst parts.  VI3 and vSphere will do a fine job at virtualizing your less-demanding applications on any supported processors.  But use newer parts for the applications that hold your highest performance expectations.


Caches Are Larger

A cache is highly effective when it fully contains the software's working set.  The addition from the hypervisor of even a small about of code will change the working set and reduce cache hit rate.  I've attempted to illustrate this concept with the following simplified view of the relationship between cache hit rates, application working set, and cache sizes:



This graph is based on a model that greatly simplifies working sets and the hypervisor's impact on them.  Assuming that ESX increases the working set by 256 KB, this graph shows the difference in cache hit rate due to the contributions of the hypervisor.  Notice that with very small caches and very small application working sets, the cache hit rate suffers greatly due to the addition of even 256 KB of virtualization support instructions.  And even up to 2 MB, a 10% decrease in cache hit rate can be seen in some applications.  With a 256 KB contribution by the kernel, cache hit rates do not change significantly with cache sizes of 4 MB and beyond.


In some cases a 10% improvement in cache hit rate can double application throughput.  This means that a doubling of cache size can profoundly effect the performance of virtual applications as compared to native.  Given ESX's small contribution to the working set, you can see why we at VMware recommend that customers run their performance-intensive workloads on CPUs with 4 MB caches or larger.