Details on the ESX Server scheduler are commonly requested when I engage customers and partners. People want to know more about how the scheduler works, when SMP should be used, and what the deal is with SMP co-scheduling. This page will answer these questions and others as they arise in the forum or the discussion portion of this page.
In VMware parlance, the monitor is the part of our products that provides a virtual interface to the guest operating systems. The VMkernel is the part of our products that manages interactions with the devices, handles memory allocation, and schedules access to the CPU resources, among other things. This is shown in the following figure.
This document will provide information on one part of the VMkernel: the scheduler.
It is a critical requirement for enterprise deployments that an operating system provide fast and fair access to the underlying resources. As a critical part of this design, the scheduler has undergone countless engineer-years of development to guarantee that this requirement is met. We've now released dozens of papers showing linear scaling of workloads as vCPU count is scaled up within a single VM and VM count is scaled up within a single host. Here are a few such papers that contain supporting data.
The scheduler's ability to fairly scale up to and beyond totally committed CPU resources is no accident. In fact, in a conversation I had with a QA manager I was assured that the VMkernel's scheduler would fairly distribute CPU resources to all VMs at least up to 4x CPU overcommitment. Of course, on a system with the CPU over-committed by 4x each VM will only run at 1/4 native speed but the scheduler keeps the VMs running at that performance. Not one at 1/8 speed, one at 1/10 speed, and another at 1/4 speed.
As ESX Server supports uniprocessor (UP) and symmetric multiprocessor (SMP) VMs, the fair-and-fast requirement for the scheduler must be upheld in the presence of concurrently executing UP and SMP VMs. Internal testing of this requirement shows fair scheduling even in the presence of concurrently executing 1-way, 2-way, and 4-way VMs.
In fact, the ability to fairly execute under such environments is a very tricky problem for a scheduler. We've run analysis on competitors' products and found that the ability to fairly balance differently-sized VMs is something of which ESX Server alone is capable. Stay tuned in the coming months as we back this claim up with performance data.
One construct that assists the scheduler in optimally placing VMs on a heavily utilized system is a cell. A cell is a logical grouping of a subset of CPU cores in the system. In ESX 3 versions the cell size is equal to four. Since the cell is statically assigned to physical cores, this means that each four-core processor is in exactly one cell. When only dual-core processors are present, a cell is comprised of two sockets. The most important thing to know about cells is the following:
A VM cannot span more than one cell.
This means that four-way VMs run on only one socket at a time in systems with quad-core CPUs. For this case, the number of options presented to the scheduler is equal to the number of sockets. In future versions of ESX we plan to increase the cell size to eight. In some cases (such as systems with hexa-core CPUs) a modification of the cell size can improve performance. See KB article 1007361 for more information.
When and if to use SMP is a common question from VMware users. The simple answer to this is to only use SMP when needed. Why only use SMP when needed? There are two reasons:
Back in the days of ESX Server 2.5, SMP VMs had to have their vCPUs co-scheduled at the same instant to begin running. Because only 2-way VMs were supported at this time, that meant that two CPU cores had to be available simultaneously to launch a 2-way VM. On a server with a total of only two cores, this meant that the VM could not be launched concurrently with any other process on the server. This would include the service console, the web interface, or any other process.
This requirement was reduced in ESX Server 3.0 through a process called relaxed co-scheduling. Effectively SMP VMs can have their vCPUs scheduled at slightly different times and idle vCPUs didn't necessarily have to be scheduled concurrently with running vCPUs. More details on this are available in the Co-scheduling SMP VMs in VMware ESX Server page.
Support for non-uniform memory access (NUMA) architectures was introduced in ESX Server 2. This meant that the scheduler became aware that memory was not uniform across each CPU. Each CPU node had access to its own local memory and a larger pool of remote memory (which was divided as local memory for the other CPU nodes.) Memory access to local memory is much faster than remote memory so the scheduler should favor the placement of processes on nodes that held the processes' memory.
Subsequent generations of ESX Server continued to optimize for the use of NUMA memory. This included placement of vCPUs next to needed memory and startup of VMs at NUMA nodes with resources available for execution. All of this is transparently handled by the scheduler but it should be noted that the newer your version of ESX Server, the better its NUMA scheduling is.
Scott,
Thanks for this document. I think it will be very helpful over time. Keep up the good work!
KLC
Ken Cline
Technical Director, Virtualization
TVAR Solutions, A Wells Landers Group Company
VMware Communities User Moderator
Hi Scott,
Can you please update this docnote to explain how the cell system works in ESX4?
Also this OFFICIAL VMWARE document, http://www.vmware.com/pdf/tips_tricks_infrastructure_services.pdf, on Page 4, states that there are 1820 ways of scheduling a 4 vCPU workload on a 16 core ESX3 system, can you please confirm that this is patently incorrect.
Regards,
Alex
The count of 1820 possibilities considers individual vCPU mappings to specific cores. This means that a system with only two cores has two ways to place a 2-way VM. But there is no performance difference in these two positions so I eliminate the second choice, which flips the vCPU to CPU mapping. I think that both documents are correct.
I'll update this document for ESX 4 once we release our vSphere scheduler document. In short, there is no longer a cell with VMware vSphere.
Scott
More information on my communities blog and on Twitter:
Hi Scott,
Do you have an ETA on the vSphere scheduler document? Something definitive that shows the benefits of vSphere would be very much appreciated.
Many thanks,
Kevin
Two weeks.
More information on my communities blog and on Twitter:
Where to get information of how VMKernel works if VM are assigned of more resources than one physical ESX has? For example cluster is 2 ESX each 2ghz quad core and you reserve 10 ghz for VM in a cluster.
A virtual machine can only use 100% of the number of vCPUs provided to it. Increasing limits above this numberif this is even possiblewould not provide a performance gain.
Scott
More information on my communities blog and on Twitter:
So it is not possible for virtual machine have more cpu resources than physical ESX server on which VM is residing has?