Re: Achieving High Web Throughput Scaling with vSp...

ssetty · ‎03-04-2010

We just published a SPECweb2005 benchmark score of 62,296 -- the highest result published to date on a virtual configuration. This result was obtained on an HP ProLiant DL380 G6 server running VMware vSphere 4 and featuring Intel Xeon 5500 series (Nehalem) processors, and Intel 82598EB 10 Gigabit AF network interface cards. While driving the network throughput from a single host to just under 30 Gbps, this benchmark score still stands at 85% of the level achieved in native (non-virtualized) execution on equivalent hardware configurations.These results clearly demonstrate that VMware software works very efficiently with HP systems and Intel processors to provide high performance virtualization solutions to meet the performance and scaling needs of modern data centers.

The benchmark result just published includes the following distinctive characteristics:

Use of VMDirectPath for virtualizing network I/O that builds upon Intel's VT-d technology
High performance and linear scaling with the addition of virtual machines
A highly simplified setup that does not require binding of interrupts to CPUs
85% of native performance while driving ~30 Gbps of network traffic

We will elaborate upon each of these characteristics. We focus first on the use of VMDirectPath for this publication. We will then describe the workload configuration before discussing the remaining three aspects listed above.

Use of VMDirectPath

In VMware vSphere, network I/O can be virtualized using either device emulation, paravirtualization, or through the use of VMDirectPath capability. The result we just published is notably different from our previous results in that this time we used VMDirectPath feature to take benefit of the higher performance that it makes possible. To explain why this is the case, let us first describe how the three methods of network I/O virtualization work, and their implications in a customer environment.

Emulation: Under emulation, the hypervisor presents to the guest a virtual device such as an e1000 NIC. A potent benefit of the emulation approach is that the guest does not need to be modified, as it already has the driver support for the commonly emulated devices. The guest remains unaware of the actual hardware through which the hypervisor conducts the I/O, and it runs merely as though it were running in a physical platform that had the emulated device in it, even if the actual physical platform had some radically different type of communication gear for which the guest had no driver available. This flexibility and simplicity comes at some performance cost: each interaction of the guest with the emulated device causes a transition into hypervisor, which incurs overhead at the CPU. Even so, we would like to note that improvements in virtualization technology in modern processors as well as in the algorithms that VMware employs have been effective in ensuring that the overhead from emulation is low or tolerable in the majority of customer environments.

Paravirtualization: Under paravirtualization, the operating system in a guest uses a device driver that is explicitly designed to drive traffic through a virtual device such as the vmxnet2 or the vmxnet3 implementations; these implementations are made available by VMware for nearly all popular guest operating systems. This technique permits the guest to interact with the hypervisor through a send-receive interface designed specifically for optimal mediation by the hypervisor. Compared to emulation, paravirtualization reduces the number of transitions through the hypervisor, reducing latency and CPU usage.

Paravirtualization in combination with VMware NetQueue is a proven high performance network I/O virtualization technique that is applicable and recommended in most customer environments. Note that NetQueue and paravirtualization complement each other in 10 Gigabit Ethernet consolidation environments. Let us briefly amplify upon NetQueue. Ordinarily, I/O operations from different VMs have to be multiplexed over a single I/O channel in a NIC adapter. VMware's NetQueue capability takes advantage of multiple I/O channel capability in such NICs as Intel's 82598EB to organize the traffic from different VMs into separate queues and bypasses the need for the hypervisor to perform such multiplexing.

We have used paravirtualization in combination with VMware NetQueue in our prior world record SPECweb2005 results that featured fifteen virtual machines. Together they handled close to sixteen Gigabits per second web traffic on a single ESX host.

VMDirectPath: VMDirectPath is a more recent technique in vSphere 4 that builds upon Intel VT-D (Virtualization Technology for Directed I/O) capability engineered into recent Intel processors. The technique allows guest operating systems to directly access an I/O device, bypassing the virtualization layer. This direct path or pass-through can improve performance in certain situations that demand requirements to drive large amounts of network traffic from a single VM.

We note that most customers whose performance requirements are met by either emulation or paravirtualization would not find the VMDirectPath option compelling, since VMotion, Fault-tolerance, VMSafe, and Memory-Overcommit features rely on hypervisor controlled abstraction to properly decouple virtual machines from the hardware infrastructure.

However, for the high bandwidth requirement that we targeted in this benchmark test in which each of the VMs drove close to eight Gigabits per second traffic, it was critical to employ VMDirectPath technology. The vmxnet3 implementation presently faces a potential interrupt handling bottleneck in cases where a guest must deal with large amount of network traffic because receive side scaling (RSS) remains to be implemented for certain guest OSes. While such a situation is not a frequent one, VMDirectPath can certainly help in such situations. In the more common case where the objective is to satisfy the aggregate network I/O requirements of multiple gigabits in multiple VMs, paravirtualization is generally successful. In such cases VMDirectPath provides the means for near-complete avoidance of the hypervisor overheads to reach peak results as we demonstrated in our publication.

We proceed in the next section to describe the workload and configuration details.

Workload

The SPECweb2005 benchmark consists of three workloads: Banking, Ecommerce, and Support, each with different workload characteristics representing the three widespread usages of web servers. Each workload measures the number of simultaneous user sessions a web server can support while still meeting stringent quality-of-service and error-rate requirements. The aggregate metric reported by the SPECweb2005 benchmark is a normalized metric based on the performance scores obtained on all three workloads.

Benchmark Configuration

In our test configuration, the system under test was an HP ProLiant DL380 G6 server with dual-socket, quad-core Intel Xeon X5570 2.933 GHz processors and 96 GB memory. The SUT was configured with VMware vSphere 4 that hosted four virtual machines. Each virtual machine was configured with 4 vCPUs, and 21 GB memory. Each of the four virtual machines used a separate Intel 82598EB 10 Gigabit AF NIC (configured with VMDirectPath) for client traffic.

We used the 64-bit SuSE Linux Enterprise Server (SLES) 11 release as the guest operating system in the four virtual machines. The Linux kernel in SLES 11 is based on 2.6.27.x kernel source, which incorporates TX multi-queue support and MSI-X improvements. In conjunction with VMDirectPath, these improvements in SLES 11 kernel further reduce the interactions between guest OS and hypervisor, and thereby help improve performance and scaling.

The web serving software consisted of the Rock Webserver and the Rock JSP server. The same web serving software was used in native benchmark submissions.

We next describe three distinctive aspects of our SPECWeb2005 publication: (1) high performance with linear scaling, (2) highly simplified setup, and (3) competitiveness of the virtualized system's performance with that of prior native results on equivalent hardware.

High Performance with Linear Scaling

In a consolidated server environment, one can expect multiple virtual machines with high network I/O demands. Although, VMDirectPath bypasses the virtualization layer to a large extent for the network interactions, we still face a measurable number of guest OS and hypervisor interactions, such as those needed for the hypervisor to vector the interrupts to the guests that own each of the physical network adapters. The possibility exists, therefore that the hypervisor can become a scaling limiter in a multi-VM environment. The scaling data obtained in our tests and charted in Figure 1 removes this concern.

Figure 1 shows the aggregate throughput of 1, 2, 3, and 4 virtual machines for each of the three SPECweb2005 workloads. As depicted in the Figure 1, performance scales linearly as we add more VMs.

Highly Simplified Setup

A technique commonly employed in SPECweb2005 is to bind device interrupts to specific processors, which maximizes performance by removing the overhead and scaling hurdles from unbalanced interrupt loads. Results published at the SPECweb2005 website reveal the complexity of "interrupt pinning" that is common in the configurations in a native setting, generally employed in order to make full use of all the cores in today's multicore processors.

By comparison, our results show that virtualization approach can dramatically simplify the networking configuration by dividing the load among multiple VMs, each of which is smaller and therefore easier to keep core-efficient.

Virtualization Performance

VMware vSphere 4 is designed for high performance. With a number of superior optimizations, even the most I/O intensive applications perform well when deployed on vSphere 4.

The table below compares the performance of the industry standard SPECweb2005 workload on a virtualized system with the prior native results published on equivalent hardware that featured Intel Xeon X5570 2.933 GHz processors, 96 GB memory and Intel 82598EB 10 Gigabit NICs.

As shown in the table, the aggregate performance obtained on a virtualized environment was close to 85% of the scores obtained on equivalent native configurations. For more details concerning the test configuration, tuning, and performance results, please refer to the full disclosure reports published at the SPECweb2005 website.

Conclusion

The 10 Gigabit networks and the increasing number of cores in the systems pose new challenges in a virtualized server environment. Our SPECweb2005 result shows that VMware, together with its partners Intel and HP is able to provide innovative virtualization solutions that can, in this instance, achieve a network throughput of 30 Gbps and reach a highly respectable performance level of 85% of the best reported native results on equivalent physical configuration. In addition, the simplification achieved through such consolidation contributes to easing the cost of setting up and administering the software environment.

cskowmh · ‎03-18-2010

Interesting test. Do you have a full write up of any tuning you did to the SLES 11 VM? I'm sure you didn't use VMI since hardware solution on the 5500s is considered superior, but I'm interested to see if you did any fiddling with sysctl.conf, or if you tested using the more likely vmxnet3 instead of VMDirectPath.

Edit, to be honest I would have been much more interested in a scaling test with say 1 or 2 vCPU by 1 to 8 GB of memory. Using 21GB of memory for a web server, even with a pretty large JVM heap, is pretty extreme.

Message was edited by: cskowmh

ssetty · ‎03-22-2010

We did very minimal tuning to SLES 11 VM. We did not use VMI, as VMware is planning to drop the support for VMI in future. The only significant tuning we used in our testing was tuning the interruptThrottleRate value when we load the ixgbe driver inside the VM. Note that Intel is only supporting ixgbe as a loadable module at this time. When I load the ixgbe driver I use the tuning options including 'InterruptThrottleRate=1500' and 'RSS=2'. The default throttle rate of 8000 turns out to be quite sub-optimal in the VMDirectPath configuration when running specweb workload. RSS sets the number of receive queues, basically to fan out the interrupt processing to multiple VCPUs. You can check out all the tunings and configuration information in the full disclosure report, if you are interested, at the SPEC website :- http://www.spec.org/osg/web2005/results/res2010q1/web2005-20100129-00145.txt

We did try out vmxnet3. We realized some CPU savings when we enabled VMDirectPath.

In one of our previous tests we already tested the consolidation of as many as fifteen VMs, each configured with 6GB memory, and 1 VCPU, to drive 16 Gb/s network throughput. In fact, by breaking down the serialization points in the web layer, we showed that the agggregate performance of multiple virtual machines running in a virtual environment was higher than the score posted on equivalent native machines that use a single web server stack. Do check out the post:- "http://blogs.vmware.com/performance/2009/02/vmware-sets-performance-record-with-specweb2005-result.html"

All

Achieving High Web Throughput Scaling with vSphere 4 on Intel Xeon 5500 series (Nehalem) servers