We have 3 PowerEdge R720 servers, all on ESXi 5.5 build 1892794 (Dell customized). Each server has two NVIDIA GRID K1 GPUs. I am using vDGA passthrough for 8 VMs on each host (each VM gets one GPU). I've followed the VMware guide for vDGA setup which includes configuring the pciHole.start parameter and reserving all guest memory on boot. The VMs are diskless and stream a Citrix Provisioning Services VHD, caching all writes to guest memory. Each VM has just over 40 GB of memory.
All three servers will randomly (often when booting the VMs, but not always) fail with a PSOD (see attached). Here's what I've tried for troubleshooting so far, although none of these items has resolved the issue:
Hardware events on each host show a bus fatal error on the slot corresponding to the physical GPU location at exactly the same time that the PSOD occurs. It appears this is a hardware error, but to see the same issue on all 3 servers is strange, especially on the host that had a motherboard replaced. Could anything in the configuration be causing the PSOD to occur? Any ideas are appreciated.
Hi jrpvt,
The first thing that comes to my mind is to update firmware for your Dell R720 servers. I've had problems before with GRID K1 card on IBM server (PCI bus error), which went away after updating the BIOS.
Hope this helps.
Forgot to mention that step. BIOS is the latest (2.2.3).
I've got a very similar issue. Dell R720 latest firmware and it PSOD's after about 5 days. Hope there is a fix soon. I've got cases in with Dell and VMware
Trying to understand if this issue is the same (Dell R720 as well) that I am seeing at a customer. When does the PSOD appear?
BTW: The Nvidia ESXi 5.5 GRID driver is for vSGA and not required for vDGA.
It seems to occur when several VMs are restarted at once. If I'm using one of the VMs, the session will become unresponsive and eventually the server fails. Last week I converted the image from Provisioning Services to a standard VMDK and used MCS to deploy the VMs instead. Since then, all servers have been stable. If that's the fix then great, but I'd like to know why that would cause a hardware failure.
Quick update, another host failure after 3 weeks of stability. I thought changing the provisioning method to MCS fixed it but not so fast. More and more this looks like a hardware failure.
I was wondering if you ever found a solution for this. I'm looking at buying one of these server with a GRID K1, but if they're not stable, I'm not sure.
Hi Chal86,
I have several customers that are using Dell R720 and GRID K1 and haven't had any issues. What are you interested in doing with the GRID K1 cards?
Thanks,
Erik
We've had similar issues with our new HP DL580 Gen 8 servers with Nvidia Grid K2 cards.
We've done everything we can think of to troubleshoot this issue. Currently HP has duplicated our environment in thier lab to test and troubleshoot. So far, they have duplicated but do not have a solution.
At this point we can make it fail on demand by putting a load on the graphics card. We take on of our host out of the cluster and migrate 20 test vms to it configured at 3d auto and 512mb of VRam. Within 15 minutes of accessing youtube videos on these VMs, it PSODs.
As far as the card resource matrix, we never go over 50% utilization on the processors or memory.
My problem has been solved by modifying the HP Power Profile from "Balanced" to "Maximum Performace". Apparently the GPUs were asking for more powere than host was prepared to allocate with the balanced setting.
WE are experiencing the same issue
We use Xenapp 7.6 with PVS 7.6 During some time for no reason a host will go to 100% fan speed.
We use hp gen9 dl380 with nvidia grid cards. All patched to the lastest firmware.
Switched to High performance but stil no luck.
Has anyone had an update or an fix regarding this issue