VMware Cloud Community
jrpvt
Contributor
Contributor

ESXi 5.5, Dell R720 with GRID K1s, PSOD

We have 3 PowerEdge R720 servers, all on ESXi 5.5 build 1892794 (Dell customized).  Each server has two NVIDIA GRID K1 GPUs.  I am using vDGA passthrough for 8 VMs on each host (each VM gets one GPU).  I've followed the VMware guide for vDGA setup which includes configuring the pciHole.start parameter and reserving all guest memory on boot.  The VMs are diskless and stream a Citrix Provisioning Services VHD, caching all writes to guest memory.  Each VM has just over 40 GB of memory.

All three servers will randomly (often when booting the VMs, but not always) fail with a PSOD (see attached).  Here's what I've tried for troubleshooting so far, although none of these items has resolved the issue:

  • Disable the MMIO above 4 GB BIOS option (per VMware recommendation).  With this option disabled, one GRID card goes missing in ESXi.
  • Enable the MMIO above 4 GB BIOS option.  Both GRID cards are detected and function.
  • Remove one GRID card
  • Replace motherboard and GRID card
  • Swap GRID card locations
  • Install the Nvidia ESXi 5.5 GRID driver
  • Disable all C-states for power management

Hardware events on each host show a bus fatal error on the slot corresponding to the physical GPU location at exactly the same time that the PSOD occurs.  It appears this is a hardware error, but to see the same issue on all 3 servers is strange, especially on the host that had a motherboard replaced.  Could anything in the configuration be causing the PSOD to occur?  Any ideas are appreciated.psod.png

11 Replies
admin
Immortal
Immortal

Hi jrpvt,

The first thing that comes to my mind is to update firmware for your Dell R720 servers. I've had problems before with GRID K1 card on IBM server (PCI bus error), which went away after updating the BIOS.

Hope this helps.

0 Kudos
jrpvt
Contributor
Contributor

Forgot to mention that step.  BIOS is the latest (2.2.3).

0 Kudos
Johnnyk1
Contributor
Contributor

I've got a very similar issue. Dell R720 latest firmware and it PSOD's after about 5 days. Hope there is a fix soon. I've got cases in with Dell and VMware

0 Kudos
ebohnhorst
Contributor
Contributor

Trying to understand if this issue is the same (Dell R720 as well) that I am seeing at a customer. When does the PSOD appear?

BTW: The Nvidia ESXi 5.5 GRID driver is for vSGA and not required for vDGA.

0 Kudos
jrpvt
Contributor
Contributor

It seems to occur when several VMs are restarted at once.  If I'm using one of the VMs, the session will become unresponsive and eventually the server fails.  Last week I converted the image from Provisioning Services to a standard VMDK and used MCS to deploy the VMs instead.  Since then, all servers have been stable.  If that's the fix then great, but I'd like to know why that would cause a hardware failure.

0 Kudos
jrpvt
Contributor
Contributor

Quick update, another host failure after 3 weeks of stability.  I thought changing the provisioning method to MCS fixed it but not so fast.  More and more this looks like a hardware failure.

0 Kudos
chal86
Contributor
Contributor

I was wondering if you ever found a solution for this.  I'm looking at buying one of these server with a GRID K1, but if they're not stable, I'm not sure.

0 Kudos
ebohnhorst
Contributor
Contributor

Hi Chal86,

I have several customers that are using Dell R720 and GRID K1 and haven't had any issues. What are you interested in doing with the GRID K1 cards?

Thanks,

Erik

0 Kudos
robsisk1972
Enthusiast
Enthusiast

We've had similar issues with our new HP DL580 Gen 8 servers with Nvidia Grid K2 cards.  psod.jpg

We've done everything we can think of to troubleshoot this issue.  Currently HP has duplicated our environment in thier lab to test and troubleshoot.  So far, they have duplicated but do not have a solution.

At this point we can make it fail on demand by putting a load on the graphics card. We take on of our host out of the cluster and migrate 20 test vms to it configured at 3d auto and 512mb of VRam.   Within 15 minutes of accessing youtube videos on these VMs, it PSODs.

As far as the card resource matrix, we never go over 50% utilization on the processors or memory.

0 Kudos
robsisk1972
Enthusiast
Enthusiast

My problem has been solved by modifying the HP Power Profile from "Balanced" to "Maximum Performace".  Apparently the GPUs were asking for more powere than host was prepared to allocate with the balanced setting.

Whocarez
Contributor
Contributor

WE are experiencing the same issue

We use Xenapp 7.6 with PVS 7.6 During some time for no reason a host will go to 100% fan speed.

We use hp gen9 dl380 with nvidia grid cards. All patched to the lastest firmware.

Switched to High performance but stil no luck.

Knipsel.PNG

Has anyone had an update or an fix regarding this issue

0 Kudos