Re: GPU Virtualization

ruready511 · ‎01-26-2011

To whom it may concern,

I'm looking to develop a system to virtualize the client desktop machines at my company. The problem with my particular virtualization project is client machine load - these are very heavy machines. Our standard desktop applications include Adobe software as well as Autodesk's Revit Architecture application and several other pieces of Autodesk's software linup. Revit in particular requires Direct3D rendering capabilities.

I am looking to NVIDIA to provide the GPUs that will exist in the host machines. More specifically, NVIDIA's Quadro FX linup of either the 3800, 4800, or 5800 GPUs (possibly the new 6000 series, or the coming 7000 series GPUs if needed). I would like to make sure that the drivers that exist in either ESX 4.x or in the View Manager (if any) can effectively pool the GPU resources and present them to the VMs. This step is critical - because having a one-to-one mapping of one VM-to-GPU is not going to work. Doing so would severly limit my VM density and thus cause the project to grow far beyond expected costs.

I know this should be possible if I specify an Intel processor with VT-x support and a processor and motherboard combination that support VT-d (for Directed IO). Parallels was able to dedicate one GPU per VM per host, but this limits my VM density on the hosts to the number of PCI/e/Express slots available. I don't really like Parallels' solution to the problem - and we already have our entire server infrastructure running on VMware.

So, to restate the question in a nutshell: can VMware's View 4.5, or ESX 4.x, pool the GPU resources that exist on the host and present them to the VMs to allow for Direct3D rendering?

I would love to hear from VMware on this issue - but any input is more than welcome!

Thanks,

-David

PS - ADMINISTRATORS: if I have posted in the wrong forum, please move this post to the appropriate forum.

AndreTheGiant · ‎01-26-2011

Welcome to the community.

As I know, actually is not possible.

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

admin · ‎01-27-2011

If you are looking at 6000 / 7000 Ultra-High end class GPUs for your users today 1:1. Do you think they will be able to share one in the future with the same level of performance?

WP

ruready511 · ‎01-28-2011

It looks like this is not possible on VMware's platform. ESX lacks the display drivers and virtual hardware to provide a WDDM device with D3D (or DirectX) support to the virtual machine. As much as I'd like to keep all my eggs in one basket and stay with VMware for desktop virtualization, I may have to migrate over to Microsoft's Hyper-V platform. I'll use the new RemoteFX technology in Windows Server 2008 R2 and the Remote Desktop Connection 7.1 client to achieve what I need. PNY also has some well priced enterprise graphics hardware that can assist in delivering multiple GPUs to multiple hosts (something like the PNY Quadro Plex 2200 S4). And with the way VMware is headed with their pricing, this will be a significantly less expensive solution.

I achieved PoC in a setup similar to this last September, but when I develop a more well rounded solution I'll post it here for reference.

Thanks for the info,

-David

admin · ‎01-28-2011

Do you have more details you are willing to share on what you want to achieve and your performance expectations?

- How many users per GPU do you need to get so you feel you have a well priced solution?

- Is price or performance more important for you use case?

- If your users use a highend class GPU such as the 5800 / 6000 etc today. will sharing the same GPU give them the same level of performance?

- Will the server system have enough bus bandwidth and power to handle four GPUs for the higher end 3D work / rendering to make it cost effective?

- Is HA important? In the event of a failure?

- How many monitors per user at what resolution do your users need?

Best case scenario:

Using the components below you could potentially get 48 users on a single host. This assumes you will get 24 users per GPU. RemoteRX has only been tested up to 12, based on current public info. This also assumes the system will have enough bus bandwidth to support 4 GPUs on a shared bus which is not likely. This might be possible for Aero workloads but is less likely for AERO and CAD or Illustrators

Next Case Scenario:

Because it's unlikely to get 4 GPUs on any of today's systems or any that are coming right away. It is more likely you could get 24 users per host. This is likely going to be ok for hardware accelerated Aero and light weight CAD and illustrator work but, not Workstation class CAD/CAM design work.

HA Case Scenario:

If you need any level of availability for faster recovery in the event of a problem or failure. You are looking at two systems in a cluster with live migration and 12 users per host.

This all might ok and a well priced solution for your use case. IT is hard to say without more details.

Can you share some more specific details, so we can better understand your use case an requirements?

I am not aware of our View pricing changing? Can you help us understand and map out the specifics of how you will end up with a cheaper solution?

Thanks,

WP

ruready511 · ‎02-11-2011

wponder,

Thank you for your response.

To address some of the points you brought up:

- The workstations we are currently purchasing are $5k on average (some going up to $7k). My expected cost per host is roughly $20k-$30k, so my VM density needs to be at least 6:1 or 7:1 at an absolute minimum - but I would like to see 12:1 or 15:1.

- Price and performance need to be very well balanced. The performance needs to be at least the level we are at now, and maybe a little extra, but not by much. Price is probably going to be the driving force in this decision - but the bottom line is this: a hardware refresh will cost me $220k at a minimum and up to $290k at the top. My preliminary estimates put the virtualization project at $150k to $190k. So, that is really the bracket that I'm looking at right now.

- My users currently use the NVIDIA Quadro FX 3800 cards in their workstations, so having them share a Quadro FX 5800 should be just fine. And to be honest, it's the video RAM that I'm after - not the cards advanced processing features. The Quadro FX 5800 has 4GB of dedicated video RAM.

- The bus bandwidth will be just fine. I'll most likely use the PNY Quadro Plex 2200 S4 (an external 1U server with 4x Quadro FX 5800 cards). The Quadro Plex 2200 S4 attaches to the host via 2 (full or half height) x16 PCI Express cards with custom cables - and all 4 cards can be utilized by one host. I'll also note that only 50% of the cost of this external piece of equipment is included in the unit cost for one host. I will attach one Quadro Plex 2200 S4 to two different hosts - giving each host 2 Quadro FX 5800 cards (this is a supported configuration).

- HA is important, which is why 2 or 3 hosts will be placed in each of our offices (or many hosts in one location, should we decide to fully centralize the desktops).

- Our users currently only have one monitor each. Some have one 19'' monitor and others have one 24'' monitor. Generally they have their resolution set to 1680x1050 or 1920x1200 respectively (none of our users have resolutions higher than that). Although, dual monitors is an option for the future.

The theoretical (obviously untested) VM density I can achieve using only video RAM as my limitation is about 25:1 (2 GPUs per host). The problem here is that each user will need at at a bare minimum 2GB of RAM to run the operating system and a few light applications and up to 8GB of RAM (or more) for the heavier applications and renderings. If I specify the minimum amount of RAM as 2GB on each VM with the maximum exceeding the host's capacity - than I really only need 64GB of RAM in each host (actually it would be 50GB, but that isn't really a 'round' number when dealing with storage capacity). All I need to do now is bet that not everyone will kick off a render at the same time (but this would fall under some extended capacity planning). This will, at least in theory, allow me to achieve a VM density of 25:1. Even if I don't get quite 25:1, I should still be able to exceed my expectations set previously.

Now, about the VMware pricing. Although I am not looking to a specific product that has increased in price lately - VMware's overall cost has increased significantly over the last year or so. Microsoft's solution is looking much more attractive simply because of server and client licensing. In my current environment I would need to upgrade my vCenter server from Foundation to Standard to be able to add the additional host processors into my environment, plus I would need the licenses for the additional host, plus the View software, and full copies of Windows 7 client operating systems. With Microsoft's solution, the licenses for the server are included in the cost of the operating system (Windows Server 2008 R2 Enterprise), I would still need SCVMM, though. But then, when I purchase my lightweight Windows 7 client machines with Software Assurance I'll be able to skip the additional cost of purchasing retail instances to run in the datacenter. Maybe if I already had the Standard edition of vCenter and if View was cheaper, it might work better in my situation.

But aside from pricing is the subject of the platform's capability. While VMware's View 4.5 has PCoIP and is very attractive - it cannot virtualize the GPU to the VMs and therefore cannot meet the requirements of this project (as far as I am aware). Microsoft's solution using Hyper-V and RemoteFX can successfully virtualize the GPUs that exist in a host and divide them among the VMs. This is the functionality I am looking for - I must be able to access a WDDM display driver that is able to perform Direct3D rendering to meet my application workload.

Anyway, sorry about the late response - I got caught up here at work. This is all still up in the air, though, and it may still come back to favor VMware, but this is where I am at right now.

I'll try to keep posting to this thread as I get more details about the project.

Thank you so much for your responses, though.

-David

admin · ‎02-11-2011

David,

This is helpful. I broke things down like this.

Current Environment:

- Currently using workstations, not PC's with Quadro 3800 GPUs 1:1

- Seems these are certainly 3D workloads and not typical knowledge worker workloads

Requirements:

- Support for 1920 x 1200 displays, dual monitor in the future

- Minimum 6:1 consolidation. Preferred 12:1 or 15:1 - Assumed possible is 25:1

- Desire is to provide equal to or slightly greater than performance today

- HA is required

Proposed configuration:

- S4 GPU chassis with 4 x 5800 GPUs @ 4GB RAM each

- 2 hosts for VMs

Challenges I think you will face:

Please keep in mind I am not trying to fodder you or create any FUD. Regardless of the direction you take. My info is only based on what I know, have insights into and have experienced from our own research and development shaping up our own assumptions, expectations and ultimately the approach we are taking to solve different customer use cases. Some of my points could be wrong here, and I am ok with being wrong. I do not work for MS and all the public / final info is forthcoming in a few weeks, if anything has changed. I have several associates, current and past working in this area and new friends and associates here familiar with the challenges at hand. I say we are all cooking with water . Too some degree.

I think your theoretical limit using only VRAM as your limiting should be more like 36 VMs per host using two GPUs and desktops @ 1 x 1900 x 1200

With 2 x 1900x1200, 26 VMs per host using 2 GPUs ( I am guessing you used dual display for your calculation ).

- In both cases; this is twice as many VMs per GPU than what has been tested ( Per MS documentation )

- This is also before adding any variables like HA.

Can you hit your target of 12:1 or 15:1 per host?

- Older MS documentation stated 4 GPUs per host could be supported, current public documentation states 2 GPUs ( Granted this could change as server systems and GPUs change over time ) ( If I read the tea leaves I would say mid - late 2012). Again, speculation on my part.

- Note: a common Nahalem processor can only support 1 x 16x lane PCI slot per socket ( you will need at least 2 for full performance)

- An Expansion chassis like the PNY, interconnects with the host using one 16x slot so you are sharing the bandwidth between two GPUs

- With a single host and stated tested numbers using your monitor config. You are looking at 24 VMs per host. In that case; I think you can get 24 VMs per host and exceed you 12:1/15:1 target. I think this is more of a fit for hardware accelerated Aero though, and not 3D workloads like you seem to be dealing with. I think your users that are use to having a 3800 and will now be sharing a 5800 with 11 others will end up with less than equal to experiance. It also does not meet the HA requirement.

- With a two host config for HA using 4 GPUs. In this case; two of your GPUs will need to essentially be in stand by mode ready for a failure. It will be active / active and each host will need to be able to support the VMs from the other host in the event of a failure. Here you are looking at the following.

- 2 hosts

- 4 5800 GPUs @ 4GB RAM: 2 GPUs per host one active one standby

- Assuming 12 VMs per GPU and 12 VMs per host in an active / active config you can get 24 VMs total. Or 12 per host

- This will meet your 12:1 consolidation.

The big question then is; is it 12:1 doing knowledge worker hardware accelerated Aero or can I truly consolidate 3D workstation class workloads? I think you will find you can meet the needs of a more rich desktop experiance using Aero and some basic 3D. I do not think at 12:1 your 3D workstation class users will be happy. This is a total guess on my part. I think if you drop down to 3:1 or 2:1 you could consolidate some more heavy 3D stuff and meet their needs.

Summary:

I think you should for sure try it out if the math works for you. You have nothing to loose. I do not know anything about your app. Assuming it only needs D3D 9.0c support and nothing specifically from the nVidia driver you will be set there.

If you go back and look at presentations from VMworld 2009 we laid out a vision for the role hardware assistance would play with virtual desktops before anyone else in the market started talking about it. I think we are dialed in preaty well. I would not assume we are going nothing, do not have plans or are even behind. We have been doing virtual graphics, starting with our personal products longer than anyone. At a minimum I would work with your VMware sales team so you know what our plans are.

Sorry, I did not follow you on the pricing stuff. View pricing has not changed for some time. The View price includes vSphere. It sounds like you are are using standard vSphere licensing rather than View Licensing.

Good luck with your testing!!!

WP

ruready511 · ‎02-11-2011

wponder,

Thank you so much for your quick response. I cannot tell you how valuable it is to have a knowledgeable person make meaningful comments to my questions and theory. I also very much appreciate your attitude by not locking the thread or telling me that I cannot mention a competetors product / solution.

I'll be taking this into very delicate consideration throughout my research process. The project is simply an initiative I've started, but I'll need to know that 1.) the technology exists, and 2.) it can be implimented in the specific way I need it to be. Once I can get that under my belt, I'll see if I can get some capital for some preliminary purchases to build a working lab. Sometimes, though, virtualization gets me running in circles and it can be difficult to sift through all the products and solutions to find one that matches my needs.

Like I said, I have not written off VMware in this project. In fact, I would love to stay with VMware since our current server infrastructure is running on it. I just need to make sure that whatever solution I choose meets the needs of the project.

Thanks again,

-David

Gridlock2011101 · ‎07-27-2011

ruready511:

I wanted to ask what solution you ultimately selected?

I'm in a similar boat, in that, I am attempting to figure out a process for virtualizing racks of 3D accelerated systems.

We work with OpenGL and D3D applications, and RemoteFX is fickle when it comes to OpenGL. I don't suppose vSphere/View 5 added any better functionality for 3D acceleration in VMs?

Does anybody else have experience using the NVIDIA S4 stuff, or even the Dell C410x, to assign dedicated GPUs to VMs? On the same subject, I guess the thought of GPU virtualization or GPU resource pooling isn't exactly possible? No possibility of splitting a NVIDIA tesla M2070Q card between a couple of VMs?

ruready511 · ‎07-28-2011

Gridlock,

I did not choose either solution. In lieu of the capital required to properly test the system, my budget was trashed. However, to answer your other question – yes, GPU virtualization and GPU resource pooling is possible. I did get a working proof of concept around the time that RemoteFX had emerged onto the scene (last September/October). I used a Dell Precision T7500 that we use for rendering content from 3ds Max and V-Ray. The workstation was between users, so I reformatted it and started playing around with D3D virtualization configurations. The host GPU was a single NVIDIA Quadro FX 4800 and I used Hyper-V (WS2k8R2) as the hypervisor. This configuration ran 2 Windows 7 SP1 guests, each running our D3D application at the same time via RemoteFX and RDP 7.1 (on the separate client machine). This test was more about getting the technology to work rather than trying to achieve high VM density.

The key to unlocking functional high density D3D GPU resource pooling is in the hardware. Just as a feature film budget is 50% equipment and 50% everything else – the solution to this problem is 50% equipment and 50% everything else. First, you’ll need to use an NVIDIA GPU that supports SLI Multi-OS. At the time I made proof of concept, this was only the Quadro FX 3800, 4800, and 5800 GPUs. The mainboard (motherboard) chipset must support VT-d (Directed IO) and the processor must support VT-x (basically it’s an instruction set that makes the guests think they’re running on bare metal and not managed by a hypervisor) – this also allows Directed IO to function properly. After that it gets more complicated… (ie: you’ll probably need to call the mainboard manufacturers to get the next pieces of information)

I’ve built a few renderless HD editing rigs for the commercial post-production world (for AVID not junk like FCP ). This next step is what makes those systems renderless, and is a key player in making functional high density D3D-capable VMs possible. Each PCI Express slot you intend to fill (with a GPU or a device like the S4) must be on a dedicated bus segment to the northbridge chipset controller (not the southbridge). There are newer processors available that take on the role of the northbridge controller; however, in this application I would shy away from such processors. I would imagine it would put too much stress on the processor under increased workload with all VMs requiring D3D functionality. The more dedicated bandwidth we can get from the GPU to the processor means the more users we can fit on that pipeline (until we fill the GPUs onboard GDDRx memory, of course).

The next critical thing you want to avoid is bus bottlenecking. This happens on so many systems. Both data buses for the CPU and memory must match exactly. The memory speed and FSB (front side bus) must match exactly. And the total amount of bandwidth between the northbridge controller and the CPU, and between the northbridge controller and the memory – must be as much or greater than the total bus bandwidth between the utilized PCI Express slot(s) and the northbridge controller (if you draw it out it makes sense). You may also want to note the bus bandwidth from the southbridge to the northbridge controller. You’ll want that to exceed your communications requirement. For instance, if you need several 1Gb aggregated links to high performance shared storage to boot guests from (like 4-10Gbps), you’ll want to make sure that the bandwidth from the southbridge to the northbridge has enough bandwidth for that and still not cause PCI congestion within the northbridge controller. If you have no bus bottlenecks, you’ll be able to comfortably achieve high VM density without loss of functionality from the GPU or other critical parts of the system.

Now, what makes this difficult is that there is no certification for this process yet. If we were purchasing a preconfigured purpose-built machine for AVID MediaComposer, Autodesk Automotive or Maya, or for industry specific software applications for geology, biology, finance, etc – it would be really easy to simply get a preconfigured HP Z800 workstation that is certified for the exact purpose we need it for (the HP Z800 workstations are famous for this in many industries). Unfortunately, we can’t get a Z800 certified for “Hypervisor Hosted D3D GPU Virtualization” (however, I have been wrong before). So for right now, we have to build these machines using the guidelines mentioned above – and others if anyone has any other helpful information on the topic.

About the NVIDIA Tesla cards – having never worked with a Tesla accelerator, I’m speaking purely from my own speculation here. However, I believe the preferred usage would be to put them into a high performance compute cluster (running Microsoft’s HPC 2008 R2, Red Hat, or Suse) and submit jobs to the host’s job scheduler from other clients on the same network. I don’t know how ‘virtualization-friendly’ they are.

Here are three articles for some of the stuff I mentioned:

- NVIDIA's SLI Multi-OS press release:

http://www.nvidia.com/object/io_1238408514209.html

- An old but informative article on PCI and PCI Express from Dell:

http://content.dell.com/us/en/corp/d/business~solutions~whitepapers~en/Documents~wp-2004_pciexpress....

- Very good overview of the PCI Express bus, and it's more current that the article above:

http://www.pcisig.com/specifications/pciexpress/resources/PCI_Express_White_Paper.pdf

I hope this helps!

-David

johnhearns · ‎07-29-2011

If you are working with graphics cards such as Nvidia FX and Quadro you should be looking at the hardware Teradici cards.

You install a 'host card' into a PCI slot beside the graphics card and connect the output of the graphics card via a short loop cable.

You can connect to the host card either via a hardware zero client, or the Vmware View client.

Works fine with the graphics cards you mention.

I would give serious consideration in your situation to either racking the servers in a machine room,

or using blade servers with built-in Pcoip cards. Plus deploy zero clients on the users desktop.

Gridlock2011101 · ‎07-29-2011

@johnhearns

Since we're already sitting on racks of computers, we've definitely considered the host cards for the systems that require intensive graphics. We'd just like to consolidate as much as possible. I guess from there we could have a few host cards sprinkled around, and we could virtualize the rest of the non-intensive machines.

Since it will be a ground up build, I don't suppose the View connection server can pool those physical hosts with my consolidated VMs? I didn't see any host cards via Teradichi's page that said any were View ready? Would we be presented with options at the 0-client level with which system we want to connect to? I'm not famiiliar with View and the 0-client side's interaction, or if a physical host/VM has to be setup in connection server for a specific 0-client?

Bladed servers are also on the table, as I believe we'd be covered on processing and a dedicated GPU per blade.

As with everything, I'm sure it will come down to budget again.

admin · ‎07-29-2011

Gridlock -

From prior posts in this thread you will see we break this down pretty specific to use case. As you have found ( and I mentioned ) vGPU is not the be all end all silver bullet solution for handling highend graphics workloads. The when the GPU is fully virtualized you loose a lot of benefits of native driver capabilities etc. There is still a place for it though.

Have you zeroed in on your requirements yet? Do you feel you have Workstation class workloads that require a discrete GPU with low latency performance? Do you feel a vGPU solution will adequately meet your graphics requirements? Will your existing systems support multiple GPUs and allow you to consolidate even at lower ratios? There are trade-offs with any approach.

My team is looking for people to work with in this area and are qualifying opportunities / people to work with early around a few things we are working on.

http://www.youtube.com/watch?v=g4-WaxwIbC4

Thanks,

WP

johnhearns · ‎07-29-2011

Gridlock,

all of the Teradici host cards are compatible with Vmware View.

The View Connection Broker can broker connections to virtual hosts or to PCI host cards (as far as I know)

Regarding the zero clients, they can be configured to use a connection Broker.

http://www.vmware.com/files/pdf/VMware-View-Using-PCoIP-HostCards-TN-EN.pdf

ruready511 · ‎07-29-2011

Someone correct me if I'm wrong, but I believe VMware assisted in the development efforts of the PCoIP protocol. I'm almost positive that all of Teradichi's host encoders are compatible with View. In fact, I'm pretty sure that any device that is PCoIP compatible can be used as a View endpoint. To outline the industry competition:

- VMware has PCoIP

- Citrix has HDX

- Microsoft has RemoteFX

If you are looking to go the blade route with PCoIP hardware encoders, I wouldn't expect to get high VM density. In fact, you'll only get 1-to-1. You'll need one host, one GPU, one PCoIP hardware encoder, and one PCoIP zero client.

If you want to play around with this stuff, just ask your VMware reseller for a demo license of View and a Wyse P20 zero client demo unit. This way, you can test the package without actually purchasing anything. You may have trouble getting a demo unit of a PCoIP hardware encoder - however, the protocol can be rendered via software, so the hardware is only necessary if you need a performance boost on the host. I had a chance to demo a few thin and zero client solutions, my absolute favorite solution is Panologic - hands down. Their management software is out of this world - it integrates with serveral directory services, DNS, View, Hyper-V, and XenDesktop. The problem with Panologic is that they refuse to support PCoIP, HDX, or RemoteFX in favor of their own (inferior) display protocol. So it really doesn't work for any of our needs here.

However, if you are interested in View, PCoIP, and zero clients - I would highly recommend the Wyse solution for the clients and client management software. You'll probably be wanting the Wyse P20 that I mentioned earlier.

Anyway, I can't really say much more about this type of setup because I have no experience with it

-David

Gridlock2011101 · ‎08-16-2011

So....

Did anybody see the stuff at SIGGRAPH about NVIDIA's Virtual Graphics and Project Maximus?

http://www.engadget.com/2011/08/08/nvidias-project-maximus-takes-multi-gpu-mainstream-virtual-gr/

They referenced XenServer again... that's the type of stuff I'm looking for from VMware.

'Enables high-performance Quadro graphics in a virtual machine'