Richard A. Brunner

An Office of the CTO Guest Blog
By Richard A. Brunner, Chief Platform Architect, Office of the CTO

 

 

 


Introduction

 

At VMware, we are committed to ensuring that our customers, service providers, and partners have reliable vSphere support for the full gamut of new and existing microprocessor and server technologies in a timely manner. Our goal is to ensure support of new microprocessor generations and servers on the very day that they launch with available vSphere releases. (Because computer engineers like to start counting at zero, we call the microprocessor launch day, "day-zero".) The result of this support is that customers can go to VMware's Certified Compatibility Guide on day-zero for a given microprocessor generation and find a number of servers from our partners already listed there.

 

Since vSphere releases are not naturally aligned with microprocessor launches, a release may have "latent" support for a future microprocessor generation that launches after the release itself in order to ensure day-zero support. The "latent" support will be officially announced on VMware's Certified Compatibility Guide when the microprocessor generation and its supporting servers are actually launched. Typically, vSphere update releases provide support only for existing known features in these new microprocessor generations; major vSphere releases add support for new features. For example:

 

  • vSphere 4.0 update 1 launched in December of 2009, but already had latent support for Intel Xeon 7500 Series Processors that launched in March of 2010;
  • vSphere 4.1 launched in July of 2010 and added x2apic support for Intel Xeon 7500 Series Processors.

For our server-vendor partners to list a new microprocessor and server combination on VMware's Certified Compatibility Guide, the microprocessor must first be internally validated and qualified by VMware. VMware has a very thorough process for this validation that will be discussed in detail later in this article. Once VMware's internal qualification completes, then our partners use the VMware Workbench Server Certification plug-in to setup and execute a rigorous set of server certification test suites on their particular server. Test results are automatically sent to VMware for review, and if correct, then the server with the new microprocessor shows up on the guide.

 

Planning

 

The complete validation for day-zero enablement can be a daunting task. For each vSphere release, we conceptually must validate all the combinations of software features, server vendors, microprocessor vendors, and microprocessor generations. (For brevity, I am neglecting the additional complications from multiple I/O vendors and storage array vendors.) Of course, there are simplifying assumptions we use to make the task slightly more manageable. In general, we know that we can test on representative samples of each category to gain sufficient test coverage and so significantly reduce the number of combinations.  But that still leaves a large number of combinations to test; so it was clear to us several years back that day-zero enablement would require very careful planning and coordination. In early 2009, my colleagues and I formed an internal server roadmap team, that has been meeting weekly since then, to plan for new microprocessor launches up to two years in advance. By tracking the microprocessor launches, we are generally able to support the launch of new servers, which are usually aligned. In this way we can ensure that we have timely support for the latest and greatest microprocessor and server technologies.

 

Planning a few years in advance for the intersection of vSphere releases with new microprocessor generations and their supporting server platforms is challenging. We must first plan at the vSphere release level for which new generations to support. Then we must also plan around the schedule, quantity, and allocation of the prototype platforms that we will receive for development and validation across multiple releases of vSphere. Note that each new microprocessor generation has its own independent timeline for prototype availability, development, validation, and launch. That timeline is usually not aligned with those of other generations that may be launching a few months earlier or later.

 

Our actual development and validation are greatly assisted by AMD, Intel and our key server-vendor partners who provide us very early access to prototype microprocessors and platforms. A fair number of back-up plans are generated and employed because we are subject to the industry norm of microprocessor schedules that shift left and right by several months just at the time when we can least afford them. Fortunately, the planning process is interactive with our key partners; we are constantly sharing and updating each other on our advance roadmaps so that we can effectively plan for these future launches.

 

 

In general, our release planning and validation processes have to account for a large number of new and existing microprocessor generations and servers. Below are some simple statistics for VMware vSphere releases in 2010 that highlight the complexity of the task.

 

  • VMware vSphere releases: 4 -- ESX 3.5u5, vSphere 4.0u1, vSphere 4.0u2, vSphere 4.1
  • Between AMD and Intel, we had to support 45 different microprocessor series over 13 major generations (see figure below). Microprocessor generations already supported by older releases of vSphere still required validation on the new vSphere releases.
  • Unique server platforms that certified across the above releases and above microprocessor generations: 288.

 

Intel Microprocessor Generations Supported by vSphere 4.x
GenerationMicroprocessor Series Based on the Generation
PrescottIntel Xeon (Prescott) Series
Intel Xeon MP (Cranford/Potomac) Series
Intel Dual-Core Xeon (Irwindale) Series
PaxvilleIntel Dual-Core Xeon DP (Paxville) Series
Intel Xeon 70xx Series
CedarMillIntel Xeon 50xx Series
Intel Xeon 71xx Series
MeromIntel Xeon 30xx Series
Intel Xeon 32xx Series
Intel Xeon 51xx Series
Intel Xeon 53xx Series
Intel Xeon 72xx Series
Intel Xeon 73xx Series
PenrynIntel Xeon 31xx Series
Intel Xeon 33xx Series
Intel Xeon 52xx Series
Intel Xeon 54xx Series
Intel Xeon 74xx Series
NehalemIntel Xeon 34xx Lynnfield Series
Intel Xeon 35xx Series
Intel Xeon 55xx Series
Intel Xeon 65xx Series
Intel Xeon 75xx Series
WestmereIntel i3/i5 Clarkdale Series
Intel Xeon 34xx Clarkdale Series
Intel Xeon 36xx Series
Intel Xeon 56xx Series
AMD Microprocessor Generations Supported by vSphere 4.x
GenerationMicroprocessor Series Based on the Generation
K8-CxAMD Opteron 2xx Rev-C  Series
AMD Opteron 8xx Rev-C  Series
K8-ExAMD Opteron 1xx Rev-E (Dual-Core) Series
AMD Opteron 2xx Rev-E (Single Core)  Series
AMD Opteron 2xx Rev-E (Dual Core) Series
AMD Opteron 8xx Rev-E (Single Core)  Series
AMD Opteron 8xx Rev-E (Dual Core) Series
K8 rev-F
(Santa Rosa)
AMD Opteron 12xx  Series
AMD Opteron 22xx  Series
AMD Opteron 82xx  Series
"Barcelona" & "ShanghaiAMD Opteron 13xx  Series
AMD Opteron 23xx  Series
AMD Opteron 83xx  Series
"Istanbul"AMD Opteron 14xx  Series
AMD Opteron 24xx  Series
AMD Opteron 84xx  Series
"Lisbon" & "Magny Cours"AMD Opteron 41xx  Series
AMD Opteron 61xx  Series

 

The day-zero program itself had the chance for intense validation during the first half of 2010.  We had been planning since early 2009 for a large number of systems based on 7 different microprocessors generations to be launched in 2010 at that time. These microprocessors needed support from VMware ESX 3.5u4, and vSphere 4.0u1, 4.0u2, and 4.1 which were all launching during that window. Each of these different microprocessors have different requirements for VMotion, Enhanced VMotion, and our recently introduced fault-tolerant (FT) features. Here were the microprocessor generations that we were tracking for day-zero enablement:

 

  • Intel Xeon 3400 "Clarksdale"
  • Intel Xeon 3600 and Intel Xeon 5600 ("Westmere")
  • Intel Xeon 6500 and Intel Xeon 7500 ("Nehalem-EX")
  • AMD Opteron 6100 ("Magny-Cours") and AMD Opteron 4100 ("Lisbon")

Validation of A New Microprocessor Generation

 

While we have discussed validation of new microprocessor generations in the aggregate, it may be helpful to see the process we apply to each new generation as it arrives at VMware. The goal of the process is to finish the internal development and validation of a new generation on each relevant release of vSphere early enough to allow server certification to complete before the day-zero date of the microprocessor. If we can meet that goal, then our server partners are able to list their new servers on VMware's Certified Compatibility Guide on day-zero.

 

The process comprehends the limited availability and immaturity of new microprocessor prototypes by judiciously routing it to the various internal teams at VMware in a phased approach; serial and parallel deployment within each phase will occur as needed. As each team interacts with the new microprocessor prototypes, new or fixed code will be checked into the source-code trees of the various supporting vSphere releases under development and new builds of each will be generated. While there are many teams at VMware that are critical to new microprocessor enablement, I have only space to describe a few of them below.

 

  • Monitor Development and Verification: these engineers develop, extend, and validate our virtual machine monitor to support new and existing features on new microprocessor generations. This team also includes a special Enhanced VMotion inter-operability lab.
  • vmkernel ("ESX") Development: two teams of engineers -- one in our Platform "Core" Engineering team and the other in our Continuing Product Division -- that extend the vSphere kernel scheduler and resource manager to support new platform features.
  • I/O Device Driver Engineering: the driver experts in our Ecosystem Engineering team that provide support for new I/O devices found in both the standard microprocessor chipsets and add-in devices.
  • Hardware Enablement Quality Engineering: these engineers are quite honestly the miracle workers when it comes to new platform enablement. With limited hardware in a very limited amount of time, they manage to run many thousands of hours of rigorous internal validation tests to give VMware confidence that server certification can begin.
  • Performance Engineering: This is the team that developed VMware VMmark. They measure and consult on the performance of new releases of vSphere on a wide range of microprocessors and server platforms.
  • Prototype Engineering: this group in our Ecosystem Engineering team plans and manages the shipping, receiving, internal allocation, routing, first boot, upgrades, and repairs to literally hundreds of prototype systems "in flight" at VMware at any given instant.  These guys are in high demand, so we only occasionally let them leave the lab.

Based on our experience over the last few years, we have developed the process around four phases of availability and maturity of new microprocessor and prototype components (see figure below). The timelines for these phases can be described relative to the day-zero date of a given microprocessor generation. Note that, as mentioned earlier, every new microprocessor generation has its own independent timeline that is seldom aligned with any other. (The timeframes discussed below are for a new major generation, such as the introduction of the Intel "Nehalem" generation; the timeframes for minor generational changes, such as the introduction of the Intel "Westmere" generation is more compressed.)

 

Drawing.gif


  • 1st Phase CPU Prototypes: this is when VMware gets the very first samples of a new microprocessor (CPU) generation in very fragile platforms directly from AMD and Intel. This phase starts between 10 to 11 months before the day-zero date.
  • 2nd Phase CPU Prototypes: in this phase, VMware receives more mature microprocessor revisions that are adequate for us to finish our development processes. Typically, microprocessors in this phase show up between 7 to 8 months before the day-zero date. The same microprocessors also tend show up a few weeks later in the first phase OEM prototypes.
  • 3rd Phase OEM Prototypes: our server-vendor partners provide us the first prototypes of actual retail servers that will use the new microprocessor technology. This phase starts between 5 to 6.5 months before the day-zero date.
  • 4th Phase OEM Production: this is the final step where VMware validates candidate releases of vSphere on near production-level server platforms. This phase is usually 2 to 3 months before the day-zero date. If we are successful in our final internal testing, the certification window for partners opens soon thereafter.

In the 1st phase, we gratefully receive a very small number (three or less) of new microprocessor development platforms from the appropriate microprocessor vendor, either AMD or Intel. We immediately route those platforms to our VMware Monitor Verification team so they can begin an intensive multi-week validation using "FrobOS" -- a special-purpose OS that VMware developed to test the interactions of our Virtual Machine Monitor with the vagaries of x86-architectural and implementation-specific behavior. FrobOS is a pretty good silicon bug catcher, especially for the earliest revisions of a new microprocessor generation. As our Monitor Verification team gains confidence in these early platforms, one or two of them are then sent to our Monitor Developers to begin implementing new and advanced features for the microprocessor generation  -- such as was done for AMD's Rapid Virtualization Indexing (RVI) and Intel's Extended Page Table technologies.

 

Normally in the 2nd phase, the microprocessor development platforms have the required stability and platform features such that they can be routed to the vmkernel and I/O Device Driver Engineering teams for enablement. While the available number of platforms for a given generation can exceed twenty at this phase, the platforms arrive in small groups at staggered intervals forcing us to serially route the first few platforms between VMware teams. The 2nd phase is where our developers implement support for new platform features that are intrinsic to the new microprocessor and supporting chipsets. These are the features that will be present on most retail server-vendor platforms. Examples of these intrinsic features include:

 

  • improved microprocessor core topology detection and scheduling;
  • multi-socket  (Non-Uniform Memory Access) memory affinity;
  • power management;
  • I/O device driver support for integrated chipset features such as SATA, USB, and Networking controllers.

Meanwhile, microprocessors provided during the 2nd phase allow our monitor team to ensure that Enhanced VMotion will work flawlessly on this new generation.

 

At the start of the 3rd phase, a number of the server vendors generously loan us early prototype platforms of their new servers populated with the new microprocessor generation. It is not possible to recognize all of our partners here, but companies such as AMD, Cisco, Dell, Fujitsu, IBM, Intel, HP, and many others have supported VMware in this way. These platforms allow enablement for server-vendor specific features by the vmkernel and I/O Device Driver Engineering teams. The platforms also serve as the workhorses for VMware's Hardware Enablement Quality Engineering (HWE-QE) team to begin rigorous validation and eventual qualification of the new microprocessor generation on the appropriate vSphere releases. It is critical to receive these platforms because often advanced features of a microprocessor generation only exhibit new and surprising bugs when present in an actual retail server platform. As validation uncovers bugs or anomalies, these are filed against the development teams to triage and dispose of as appropriate for the several vSphere releases in development. New builds are subsequently generated, and HWE-QE gets to repeat part or all of its testing again; this cycle may happen many times during the 3rd phase.

 

HWE-QE runs a large battery of directed tests on the platform to exercise each supported feature. One very simple but effective validation method used by HWE-QE is to start, stop, and restart a large number of virtual machines on all the microprocessor cores and threads of the platform. These kinds of tests are repeated non-stop on the platform for several weeks and include virtual machines running many different guest operating systems. This method is a great way to find hidden microprocessor and server instabilities and has repeatedly proven its value in the past.

 

One of the last stops for a new microprocessor generation is at the lab of our Performance Engineering team. This team characterizes the performance improvement we can expect to see from a new microprocessor generation. Oftentimes they find performance bottlenecks that require attention in either our code or the microprocessor itself. One of the most critical activities they perform is to run VMware's VMmark benchmark on these prototype systems to ensure that performance expectations have been met. This analysis always happens in the 4th phase and may happen in the 3rd phase if the server vendor platforms are stable enough.

 

During the 4th phase, our server vendors loan us platforms that are very close to the retail production platforms that they will soon launch. In this phase, VMware's HWE-QE team focuses on finishing the rigorous internal qualification of these platforms several weeks in advance of the actual microprocessor launch. The same tests performed in the 3rd phase are again done. Usually during the 4th phase, the various vSphere releases that will support a new microprocessor generation have either already launched themselves or will soon do so. As a result, fixes for very late-breaking bugs found in this phase will likely be deferred to vSphere patch vehicles or later update releases.

 

The output of the 4th phase is the "green light" from VMware to our server-vendor partners to begin server certification. We want the phase to end early enough so that our partners have sufficient time to run the certification suites and get the results sent back to VMware before the microprocessor launch date. This allows day-zero listing of servers using the new microprocessor on VMware's Certified Compatibility Guide.

 

Summary

 

Even by employing key simplifying assumptions, testing the combination of vSphere releases, software features, server vendors, and microprocessor generations is a tremendous planning and validation challenge. It is made more difficult by the natural misalignment of software and hardware schedules. It is also a pretty impressive juggling act to manage and route the large number of microprocessors and servers that are "in flight" at VMware at any given time. But the pay-off for this strategic and tactical complexity is worth the effort because it gives our customers the choice to run the current set of vSphere releases across a broad range of server vendors, microprocessor vendors, and microprocessor generations.

 

I am happy to report that we met those goals in 2010 using the processes described above. The day-zero program in 2010 ensured timely support of 288 new server platforms using 7 different microprocessor generations by four releases of vSphere. Each of these servers were listed on VMware's Certification Compatibility Guide on the very day that the server launched. This accomplishment was made possible by the collaboration between VMware and its microprocessor and server-vendor partners. It demonstrated the ingenuity and perseverance of the internal teams at VMware that are dedicated to the day-zero program.

 

Without breaking any confidences, the 2011 day-zero program has a similar set of challenges resulting from the planned intersection of multiple vSphere releases, microprocessor generations, and server-vendor platforms. We are already executing on the day-zero plan for AMD Bulldozer, Intel Sandy-Bridge, and Intel Westmere-EX microprocessors. So far, the prospects look good. Our partners are doing a fantastic job of loaning us critical prototypes to meet our schedules. And with another year of experience, our engineers are getting even more efficient at day-zero enablement. It appears that we are well on our way to matching the success of the previous year and establishing an important precedent for future microprocessor and server support.