ESXi 5.5, Dell R720 with GRID K1s, PSOD

jrpvt · ‎08-24-2014

We have 3 PowerEdge R720 servers, all on ESXi 5.5 build 1892794 (Dell customized). Each server has two NVIDIA GRID K1 GPUs. I am using vDGA passthrough for 8 VMs on each host (each VM gets one GPU). I've followed the VMware guide for vDGA setup which includes configuring the pciHole.start parameter and reserving all guest memory on boot. The VMs are diskless and stream a Citrix Provisioning Services VHD, caching all writes to guest memory. Each VM has just over 40 GB of memory.

All three servers will randomly (often when booting the VMs, but not always) fail with a PSOD (see attached). Here's what I've tried for troubleshooting so far, although none of these items has resolved the issue:

Disable the MMIO above 4 GB BIOS option (per VMware recommendation). With this option disabled, one GRID card goes missing in ESXi.
Enable the MMIO above 4 GB BIOS option. Both GRID cards are detected and function.
Remove one GRID card
Replace motherboard and GRID card
Swap GRID card locations
Install the Nvidia ESXi 5.5 GRID driver
Disable all C-states for power management

Hardware events on each host show a bus fatal error on the slot corresponding to the physical GPU location at exactly the same time that the PSOD occurs. It appears this is a hardware error, but to see the same issue on all 3 servers is strange, especially on the host that had a motherboard replaced. Could anything in the configuration be causing the PSOD to occur? Any ideas are appreciated.

admin · ‎08-24-2014

Hi jrpvt,

The first thing that comes to my mind is to update firmware for your Dell R720 servers. I've had problems before with GRID K1 card on IBM server (PCI bus error), which went away after updating the BIOS.

Hope this helps.

jrpvt · ‎08-25-2014

Forgot to mention that step. BIOS is the latest (2.2.3).

Johnnyk1 · ‎09-08-2014

I've got a very similar issue. Dell R720 latest firmware and it PSOD's after about 5 days. Hope there is a fix soon. I've got cases in with Dell and VMware

ebohnhorst · ‎09-10-2014

Trying to understand if this issue is the same (Dell R720 as well) that I am seeing at a customer. When does the PSOD appear?

BTW: The Nvidia ESXi 5.5 GRID driver is for vSGA and not required for vDGA.

jrpvt · ‎09-11-2014

It seems to occur when several VMs are restarted at once. If I'm using one of the VMs, the session will become unresponsive and eventually the server fails. Last week I converted the image from Provisioning Services to a standard VMDK and used MCS to deploy the VMs instead. Since then, all servers have been stable. If that's the fix then great, but I'd like to know why that would cause a hardware failure.

jrpvt · ‎09-19-2014

Quick update, another host failure after 3 weeks of stability. I thought changing the provisioning method to MCS fixed it but not so fast. More and more this looks like a hardware failure.

chal86 · ‎02-28-2015

I was wondering if you ever found a solution for this. I'm looking at buying one of these server with a GRID K1, but if they're not stable, I'm not sure.

ebohnhorst · ‎03-02-2015

Hi Chal86,

I have several customers that are using Dell R720 and GRID K1 and haven't had any issues. What are you interested in doing with the GRID K1 cards?

Thanks,

Erik

robsisk1972 · ‎04-13-2015

We've had similar issues with our new HP DL580 Gen 8 servers with Nvidia Grid K2 cards.

We've done everything we can think of to troubleshoot this issue. Currently HP has duplicated our environment in thier lab to test and troubleshoot. So far, they have duplicated but do not have a solution.

At this point we can make it fail on demand by putting a load on the graphics card. We take on of our host out of the cluster and migrate 20 test vms to it configured at 3d auto and 512mb of VRam. Within 15 minutes of accessing youtube videos on these VMs, it PSODs.

As far as the card resource matrix, we never go over 50% utilization on the processors or memory.

robsisk1972 · ‎04-21-2015

My problem has been solved by modifying the HP Power Profile from "Balanced" to "Maximum Performace". Apparently the GPUs were asking for more powere than host was prepared to allocate with the balanced setting.

Whocarez · ‎10-28-2015

WE are experiencing the same issue

We use Xenapp 7.6 with PVS 7.6 During some time for no reason a host will go to 100% fan speed.

We use hp gen9 dl380 with nvidia grid cards. All patched to the lastest firmware.

Switched to High performance but stil no luck.

Has anyone had an update or an fix regarding this issue