VMware Cloud Community
HelloFelix
Contributor
Contributor

Deep investigation on GPU Passthrough not working anymore after upgraded from 6.5 to 6.7, what's different on PCIe resetting?

I have been tried for a month to investigate on the GPU passthrugh issue of 6.7, Here is what I found.

Motherboard: MX32-L40 (a Gigabyte Serverboard which officially announced support ESXi 6.5, All ESXi passthrough requirements are meet by this MB)

VM OS: Windows 10 1809 Oct

ESXi Version: ESXi6.5u2(with latest patch), ESXi6.7u1(with latest patch)

GPU: I tried both AMD RX590 and Nvidia 1660Ti

Passthroughed Devices: All sub devices of the GPU, including HDMI audio and related bus.

Issue:

Basically,

if I start the VM the first time after ESXi host started, the GPU just works like a charm.

If I restart or stop/start the VM, the GPU device stopped working with a warning in device manager, error code 43.

If I disable the GPU before a VM restart/stop-start in device manager, then I'm able to re-enable the GPU after the VM reboot.

First, I'm pretty sure all of the following tweak doesn't help:

  1. UEFI or Legacy boot of ESXi host
  2. UEFI or BIOS boot of Windows 10 VM
  3. ESXi 6.5(with latest patch) or ESXi 6.7(with latest patch)
  4. AMD Rx590 or Nvidia 1660 Ti
  5. pciPassthru.use64bitMMIO
  6. hypervisor.cpuid.v0
  7. pciHole.start/end
  8. svga.present

I tried them one by one, with ALL combinations, which took me several days, since server MBs are really slow to boot.

The conclusion is the same,

If it's the first time starting the VM after ESXi boot, the GPUs just works. If I reboot/stop-start the VM, then the GPUs stopped working with error code 43.

Then I realized it's a PCIe resetting issue. so I tried the following /etc/vmware/passthrough.conf combinations:

# NVIDIA

10de  ffff  link   false

10de  ffff  bridge   false

10de  ffff  d3d0   false

10de  2182  link   false

10de  2182  bridge   false

10de  2182  d3d0   false

# AMD Video Card

1002 ffff link false

1002 ffff bridge false

1002 ffff d3d0 false

It took me a whole week to try ALL those combinations. Finally, I found that, ONLY ONE combination works for me:

  • ESXi 6.5
  • 10de  2182  d3d0   false

Then I tried to upgrade the ESXi to 6.7u1 with the SAME settings, it just doesn't work anymore.

I found something interesting in the log. When resetting the PCIe devices,

ESXi 6.5 resets them ONE BY ONE, with 4 seconds interval:

2019-03-07T05:56:29.586Z| vcpu-0| I125: UHCI: HCReset

2019-03-07T05:56:29.593Z| vcpu-0| I125: PCIPassthru: Resetting Device at 0000:50:00.0    // This is my GPU

2019-03-07T05:56:33.603Z| vcpu-0| I125: PCIPassthru: Resetting Device at 0000:50:00.1    // This is my GPU

2019-03-07T05:56:37.613Z| vcpu-0| I125: PCIPassthru: Resetting Device at 0000:50:00.2    // This is my GPU

2019-03-07T05:56:41.622Z| vcpu-0| I125: PCIPassthru: Resetting Device at 0000:50:00.3    // This is my GPU

2019-03-07T05:56:45.632Z| vcpu-0| I125: PCIPassthru: Resetting Device at 0000:72:00.0

2019-03-07T05:56:49.692Z| vcpu-0| I125: NVME-PCI: PCI reset on controller nvme0.

while ESXi 6.7 resets them in a batch, without intervals:

2019-03-07T09:08:05.219Z| vcpu-0| I125: UHCI: HCReset

2019-03-07T09:08:05.223Z| vcpu-0| I125: PCIPassthru: Resetting Device at 0000:50:00.0    // This is my GPU

2019-03-07T09:08:05.224Z| vcpu-0| I125: PCIPassthru: Resetting Device at 0000:50:00.1    // This is my GPU

2019-03-07T09:08:05.225Z| vcpu-0| I125: PCIPassthru: Resetting Device at 0000:50:00.2    // This is my GPU

2019-03-07T09:08:05.225Z| vcpu-0| I125: PCIPassthru: Resetting Device at 0000:50:00.3    // This is my GPU

2019-03-07T09:08:05.227Z| vcpu-0| I125: PCIPassthru: Resetting Device at 0000:72:00.0

2019-03-07T09:08:09.258Z| vcpu-0| I125: NVME-PCI: PCI reset on controller nvme0.

There must be some different between 6.5 and 6.7 the way they reset the PCIe devices.

Anyone know what's the difference and how to make it work in 6.7?

Tags (3)
13 Replies
daphnissov
Immortal
Immortal

I am unable to confirm your statement that this motherboard (MX32-4L0) has support for any version of ESXi whatsoever. Secondly, you're using a brand new version of Windows 10 which is known to cause problems across the board, and third you're attempt to pass-through consumer graphics adapters. So, from what I can tell, everything about what you're attempting here is either unsupported or shaky at best. The only thing I can suggest is to open a support case with VMware if you are entitled to support, but I suspect you aren't.

Reply
0 Kudos
HelloFelix
Contributor
Contributor

I appreciate your helpful reply,

This is a brand new MB with latest C246 chipset and coffee lake Xeon E-21xx support. According to the spec from gigabyte: MX32-4L0 (rev. 1.0) | Server Motherboard - GIGABYTE Global

ESXi6.5 is supported.

Latest Windows 10 is somehow buggy but the issue is not likely a system related.

I’m trying to passthrough a consumer GPU coz this is in home lab environment.

Of course I can’t open a support case for my personal use case with my company’s account. but I think this is the community, right? I’m not expecting getting commercial support here. or I should move the post to somewhere else like reddit/homelab?

Reply
0 Kudos
ltycomputer
Contributor
Contributor

Just creating a new VM,and choose ESXi 6.5 compatibility.

Althrough the host is ESXi 6.7/7.0,passthrough works fine like ESXi 6.5.

Reply
0 Kudos
Hossy_923
Contributor
Contributor

Hi HelloFelix​,

I know I'm digging up the past here, but I was wondering if you found a solution to this.  I've tried the things you mentioned here (short of downgrading back to 6.5) and nothing seems to work.  I have a Quadro P2200 that I'm trying to passthrough to a Windows 2012 R2 VM.  I'm getting the "Windows has stopped this device because it has reported problems. (Code 43)" and Problem code 2B (0000002B).  VID is 10d3 and DID id 1c31.

Hardware:

  • Intel NUC NUC9VXQNX (Xeon E-2286M)
  • 64GB ECC DDR4-2666
  • 2x 1TB Samsung 970 EVO Plus, 1x 512GB Samsung 970 Pro
  • PNY NVIDIA Quadro P2200

vSphere Info:

  • vCenter 7.0.0b
  • ESXi 6.7 EP 15

VM Info:

  • Windows 2012 R2
  • 6 GB RAM (100% reserved)
  • VM Hardware version 11

I do notice one thing (just now) in my logs that I'm going to investigate further, but I wanted to post them in case anyone had any ideas.

My logs:

0:00:00:05.463 cpu0:2097152)PCI: 2161: 0000:01:00.0: Device is disabled by the BIOS, Command register 0x0

0:00:00:05.464 cpu0:2097152)PCI: 488: 0000:01:00.0: PCIe v2 PCI Express Legacy Endpoint

0:00:00:05.464 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x2 (Virtual Channel)

0:00:00:05.464 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x18 (Latency Tolerance Reporting)

0:00:00:05.464 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x4 (Power Budgeting)

0:00:00:05.464 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x1 (Advanced Error Reporting)

0:00:00:05.464 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0xb (Vendor Specific)

0:00:00:05.464 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x19 (Secondary PCI Express)

0:00:00:05.464 cpu0:2097152)PCI: 435: Found onboard instance 0x8101 from SMBIOS for 0000:01:00.0

0:00:00:05.464 cpu0:2097152)PCI: 2161: 0000:01:00.1: Device is disabled by the BIOS, Command register 0x0

0:00:00:05.464 cpu0:2097152)PCI: 488: 0000:01:00.1: PCIe v2 PCI Express Endpoint

0:00:00:05.464 cpu0:2097152)PCI: 248: 0000:01:00.1: Found support for extended capability 0x1 (Advanced Error Reporting)

0:00:00:05.464 cpu0:2097152)PCI: 423: 0000:01:00.1: PCIe v2 PCI Express Endpoint

0:00:00:05.464 cpu0:2097152)PCI: 1067: 0000:01:00.0: probing 10de:1c31 10de:131b

0:00:00:05.464 cpu0:2097152)PCI: 404: 0000:01:00.0: Adding to resource tracker under parent 0000:00:01.0.

0:00:00:05.464 cpu0:2097152)PCI: 1067: 0000:01:00.1: probing 10de:10f1 10de:131b

0:00:00:05.464 cpu0:2097152)PCI: 404: 0000:01:00.1: Adding to resource tracker under parent 0000:00:01.0.

0:00:00:05.487 cpu0:2097152)PCI: 1282: 0000:01:00.0: registering 10de:1c31 10de:131b

0:00:00:05.487 cpu0:2097152)PCI: 2234: 0000:01:00.0: Enabling device, Command register mask: 0x3

0:00:00:05.487 cpu0:2097152)PCI: 1282: 0000:01:00.1: registering 10de:10f1 10de:131b

0:00:00:05.487 cpu0:2097152)PCI: 2234: 0000:01:00.1: Enabling device, Command register mask: 0x2

2020-06-26T03:24:39.179Z cpu15:2097610)PCI: 814: 0000:01:00.0 to 3

2020-06-26T03:24:39.179Z cpu15:2097610)WARNING: PCI: 189: 0000:01:00.0: Bypassing non-ACS capable device in hierarchy

2020-06-26T03:24:39.180Z cpu15:2097610)PCI: 814: 0000:01:00.0 to 3

2020-06-26T03:24:39.180Z cpu15:2097610)WARNING: PCI: 189: 0000:01:00.0: Bypassing non-ACS capable device in hierarchy

2020-06-26T03:24:39.181Z cpu15:2097610)PCI: 814: 0000:01:00.1 to 3

2020-06-26T03:24:39.181Z cpu15:2097610)WARNING: PCI: 189: 0000:01:00.1: Bypassing non-ACS capable device in hierarchy

2020-06-26T03:24:42.610Z cpu2:2097622)WARNING: PCI: 189: 0000:01:00.0: Bypassing non-ACS capable device in hierarchy

2020-06-26T03:24:42.610Z cpu2:2097622)PCIPassthru: PCIPassthruAttachDev:222: Attached to device 0000:01:00.0

2020-06-26T03:24:42.611Z cpu2:2097622)WARNING: PCI: 189: 0000:01:00.1: Bypassing non-ACS capable device in hierarchy

2020-06-26T03:24:42.611Z cpu2:2097622)PCIPassthru: PCIPassthruAttachDev:222: Attached to device 0000:01:00.1

2020-06-26T03:26:50.601Z cpu15:2101532)PCI: 967: Skipping device reset on 0000:01:00.0 because PCIe link to the device is down.

2020-06-26T03:26:50.601Z cpu15:2101532)IOMMU: 2507: Device 0000:01:00.0 placed in new domain 0x43055a8e8900.

2020-06-26T03:26:50.601Z cpu15:2101532)PCI: 967: Skipping device reset on 0000:01:00.1 because PCIe link to the device is down.

2020-06-26T03:26:50.737Z cpu5:2101543)PCI: 967: Skipping device reset on 0000:01:00.0 because PCIe link to the device is down.

2020-06-26T03:26:50.738Z cpu5:2101543)PCI: 967: Skipping device reset on 0000:01:00.0 because PCIe link to the device is down.

2020-06-26T03:26:50.738Z cpu5:2101543)IOMMU: 2507: Device 0000:01:00.0 placed in new domain 0x43055a8e8900.

2020-06-26T03:26:50.738Z cpu5:2101543)PCI: 967: Skipping device reset on 0000:01:00.1 because PCIe link to the device is down.

2020-06-26T03:26:50.739Z cpu5:2101543)PCI: 967: Skipping device reset on 0000:01:00.1 because PCIe link to the device is down.

Reply
0 Kudos
Hossy_923
Contributor
Contributor

Some additional information.  I've made progress, but I'm puzzled as to what exactly is happening (or why).

So, I decided to start from a clean slate.

Windows Server 2019 VM

no VMware Tools

svga.present = "TRUE"

hypervisor.cpuid.v0 = "FALSE"

SMBIOS.reflectHost = "TRUE"

Installed Windows; added PCI pass-through for VGA and Audio device from Quadro P2200.

The interesting thing that I've found so far is that if I tell the host's BIOS (Intel NUC NUC9VXQNX) to use the internal graphics card, the P2200 doesn't work in the VM (Code 43).  BUT, if I set the host's BIOS to "Auto" for the display adapter, BIOS and the initial part of the ESXi boot process will work and then the video output will freeze (presumably because ESXi grabs the GPU for PCI pass-through).  When I boot the VM, the GPU works just fine.  The downside here is that I can no longer use Intel AMT to access the server remotely (it doesn't work at all without using the internal graphics) so I have no way to troubleshoot ESXi in the future via the console remotely.

Has anyone run into this?

This is what I've observed in the vmkernel.log (notice the parts highlighted in red).  Is ESXi not activating the GPU correctly?

With the BIOS graphics set to "Auto":

[root@esxi:~] tail -n 10000 -f /var/log/vmkernel.log | grep '01:00.[01]'

0:00:00:05.469 cpu0:2097152)PCI: 488: 0000:01:00.0: PCIe v2 PCI Express Legacy Endpoint

0:00:00:05.469 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x2 (Virtual Channel)

0:00:00:05.469 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x18 (Latency Tolerance Reporting)

0:00:00:05.469 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x4 (Power Budgeting)

0:00:00:05.469 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x1 (Advanced Error Reporting)

0:00:00:05.469 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0xb (Vendor Specific)

0:00:00:05.469 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x19 (Secondary PCI Express)

0:00:00:05.469 cpu0:2097152)PCI: 435: Found onboard instance 0x8101 from SMBIOS for 0000:01:00.0

0:00:00:05.469 cpu0:2097152)PCI: 2161: 0000:01:00.1: Device is disabled by the BIOS, Command register 0x0

0:00:00:05.469 cpu0:2097152)PCI: 488: 0000:01:00.1: PCIe v2 PCI Express Endpoint

0:00:00:05.469 cpu0:2097152)PCI: 248: 0000:01:00.1: Found support for extended capability 0x1 (Advanced Error Reporting)

0:00:00:05.469 cpu0:2097152)PCI: 423: 0000:01:00.1: PCIe v2 PCI Express Endpoint

0:00:00:05.469 cpu0:2097152)PCI: 1067: 0000:01:00.0: probing 10de:1c31 10de:131b

0:00:00:05.469 cpu0:2097152)PCI: 404: 0000:01:00.0: Adding to resource tracker under parent 0000:00:01.0.

0:00:00:05.469 cpu0:2097152)PCI: 1067: 0000:01:00.1: probing 10de:10f1 10de:131b

0:00:00:05.469 cpu0:2097152)PCI: 404: 0000:01:00.1: Adding to resource tracker under parent 0000:00:01.0.

0:00:00:05.494 cpu0:2097152)PCI: 1282: 0000:01:00.0: registering 10de:1c31 10de:131b

0:00:00:05.494 cpu0:2097152)PCI: 1282: 0000:01:00.1: registering 10de:10f1 10de:131b

0:00:00:05.494 cpu0:2097152)PCI: 2234: 0000:01:00.1: Enabling device, Command register mask: 0x2

0:00:00:05.509 cpu0:2097152)PCI: 814: 0000:01:00.0 to 4

2020-06-26T22:58:13.179Z cpu12:2097610)PCI: 814: 0000:01:00.0 to 3

2020-06-26T22:58:13.179Z cpu12:2097610)WARNING: PCI: 189: 0000:01:00.0: Bypassing non-ACS capable device in hierarchy

2020-06-26T22:58:13.180Z cpu12:2097610)PCI: 814: 0000:01:00.0 to 3

2020-06-26T22:58:13.180Z cpu12:2097610)WARNING: PCI: 189: 0000:01:00.0: Bypassing non-ACS capable device in hierarchy

2020-06-26T22:58:13.181Z cpu12:2097610)PCI: 814: 0000:01:00.1 to 3

2020-06-26T22:58:13.181Z cpu12:2097610)WARNING: PCI: 189: 0000:01:00.1: Bypassing non-ACS capable device in hierarchy

2020-06-26T22:58:16.311Z cpu8:2097622)WARNING: PCI: 189: 0000:01:00.0: Bypassing non-ACS capable device in hierarchy

2020-06-26T22:58:16.311Z cpu8:2097622)PCIPassthru: PCIPassthruAttachDev:222: Attached to device 0000:01:00.0

2020-06-26T22:58:16.311Z cpu8:2097622)WARNING: PCI: 189: 0000:01:00.1: Bypassing non-ACS capable device in hierarchy

2020-06-26T22:58:16.311Z cpu8:2097622)PCIPassthru: PCIPassthruAttachDev:222: Attached to device 0000:01:00.1

With the BIOS graphics set top "IGFX" (internal graphics):

[root@esxi:~] tail -n 10000 -f /var/log/vmkernel.log | grep '01:00.[01]'

0:00:00:05.455 cpu0:2097152)PCI: 2161: 0000:01:00.0: Device is disabled by the BIOS, Command register 0x0

0:00:00:05.455 cpu0:2097152)PCI: 488: 0000:01:00.0: PCIe v2 PCI Express Legacy Endpoint

0:00:00:05.455 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x2 (Virtual Channel)

0:00:00:05.455 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x18 (Latency Tolerance Reporting)

0:00:00:05.455 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x4 (Power Budgeting)

0:00:00:05.455 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x1 (Advanced Error Reporting)

0:00:00:05.455 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0xb (Vendor Specific)

0:00:00:05.455 cpu0:2097152)PCI: 248: 0000:01:00.0: Found support for extended capability 0x19 (Secondary PCI Express)

0:00:00:05.455 cpu0:2097152)PCI: 435: Found onboard instance 0x8101 from SMBIOS for 0000:01:00.0

0:00:00:05.455 cpu0:2097152)PCI: 2161: 0000:01:00.1: Device is disabled by the BIOS, Command register 0x0

0:00:00:05.455 cpu0:2097152)PCI: 488: 0000:01:00.1: PCIe v2 PCI Express Endpoint

0:00:00:05.455 cpu0:2097152)PCI: 248: 0000:01:00.1: Found support for extended capability 0x1 (Advanced Error Reporting)

0:00:00:05.455 cpu0:2097152)PCI: 423: 0000:01:00.1: PCIe v2 PCI Express Endpoint

0:00:00:05.455 cpu0:2097152)PCI: 1067: 0000:01:00.0: probing 10de:1c31 10de:131b

0:00:00:05.455 cpu0:2097152)PCI: 404: 0000:01:00.0: Adding to resource tracker under parent 0000:00:01.0.

0:00:00:05.455 cpu0:2097152)PCI: 1067: 0000:01:00.1: probing 10de:10f1 10de:131b

0:00:00:05.455 cpu0:2097152)PCI: 404: 0000:01:00.1: Adding to resource tracker under parent 0000:00:01.0.

0:00:00:05.477 cpu0:2097152)PCI: 1282: 0000:01:00.0: registering 10de:1c31 10de:131b

0:00:00:05.477 cpu0:2097152)PCI: 2234: 0000:01:00.0: Enabling device, Command register mask: 0x3

0:00:00:05.478 cpu0:2097152)PCI: 1282: 0000:01:00.1: registering 10de:10f1 10de:131b

0:00:00:05.478 cpu0:2097152)PCI: 2234: 0000:01:00.1: Enabling device, Command register mask: 0x2

2020-06-26T23:26:37.179Z cpu9:2097610)PCI: 814: 0000:01:00.0 to 3

2020-06-26T23:26:37.179Z cpu9:2097610)WARNING: PCI: 189: 0000:01:00.0: Bypassing non-ACS capable device in hierarchy

2020-06-26T23:26:37.180Z cpu9:2097610)PCI: 814: 0000:01:00.0 to 3

2020-06-26T23:26:37.180Z cpu9:2097610)WARNING: PCI: 189: 0000:01:00.0: Bypassing non-ACS capable device in hierarchy

2020-06-26T23:26:37.180Z cpu9:2097610)PCI: 814: 0000:01:00.1 to 3

2020-06-26T23:26:37.180Z cpu9:2097610)WARNING: PCI: 189: 0000:01:00.1: Bypassing non-ACS capable device in hierarchy

2020-06-26T23:26:40.606Z cpu4:2097622)WARNING: PCI: 189: 0000:01:00.0: Bypassing non-ACS capable device in hierarchy

2020-06-26T23:26:40.606Z cpu4:2097622)PCIPassthru: PCIPassthruAttachDev:222: Attached to device 0000:01:00.0

2020-06-26T23:26:40.607Z cpu4:2097622)WARNING: PCI: 189: 0000:01:00.1: Bypassing non-ACS capable device in hierarchy

2020-06-26T23:26:40.607Z cpu4:2097622)PCIPassthru: PCIPassthruAttachDev:222: Attached to device 0000:01:00.1

Reply
0 Kudos
WTP_TH
Contributor
Contributor

Thanks HelloFelix, you give me the lead to fix my problem!

I've had the problem that i could boot the VM (Server 2019, ESX 6.7U2 compatible) with a Quadro P2000 in passthrough, but when i rebooted the ESX host got a PSOD and restarted.

My hardware is a HP DL380p G8 with 2x Xeon 2670 and a PNY Quadro P2000 5GB running ESX 6.7 U3 (HP custumized).

The Quadro P2000 is set as secondary GPU in my HP DL380 G8 server, so the internal graphics card is used for ESX host so iLO access and management works.

I've read your post and saw that you've edited the /etc/vmware/passthrough.conf (in ESX 6.5 i think).

Your solution in 6.5 was to add "10de  2182  d3d0   false" line under # NVIDIA.

My card has a different device id (1c30 = Quadro P2000) because i use a different Nvidia card than you.

I've found mine with the lspci -nn command (can also be found in the vsphere gui in the hardware section).

Then i wanted to add the line "10de  1c30  d3d0     false" in /etc/vmware/passthrough.conf, but this file doesn't exist in ESX 6.7 U3.
In ESX 6.7 U3 the file is called: /etc/vmware/passthru.map

I've added the line there under #NVIDIA ("10de ffff bridge false" was already there) ,i just added the new line beneath it.

It looks like this:

# NVIDIA

10de  ffff  bridge   false

10de  1c30  d3d0     false

After that i rebooted the ESX 6.7 host and after it booted again i started up the VM with the Quadro P2000 in passthrough (also with the audio PCI device in passthrough).

Then connect to it with a RDP session en rebooted the VM. Now the VM reboots without getting a PSOD in ESX 6.7U3 and the VM booted up without any problems.

Thanks so much for the lead!

I hope you and/or others can use this information to solve their problem.

Kind regards,

Thijs

Hossy_923
Contributor
Contributor

Two questions...

1. Have you added any non-default configuration to your VM/.vmx file?

2. Does the P2000 also have an audio controller like the P2200?  If so, does 1c30 represent the GPU or the audio controller?

Reply
0 Kudos
WTP_TH
Contributor
Contributor

To answer your questions as detailed as possible and with the update that only the reboot worked and all does not work anymore atm:

Hardware: HP DL380p G8 - 2x Xeon E5-2670, BIOS P70 07/01/2015

Hypervisor: ESXi 6.7U3 Build 15160138 (custom HP)

VM: Server 2019 Standard, 4 cores, 8GB RAM, VMXNET 3 NIC, EFI bios mode without uefi secure boot enabled

In started using these these vmx settings:

  1. hypervisor.cpuid.v0
  2. pciPassthru.64bitMMIOSizeGB = “16” (Also experimented with the value 64 but this did not make any difference, i think because MMIO is not supported and used with this card and config)
  3. pciHole.start = "2048"

The VM would start but when i shutdown or reboot the VM the ESX hosts crashed and rebooted.

After i added the bold line below in /etc/vmware/passthru.map and rebooted the ESX host i could boot and reboot the VM without issues.

# NVIDIA

10de  ffff  bridge   false

10de  1c30  d3d0     false

But a shutdown of the VM still crashes the ESX hosts.

After that i updated the BIOS of my server from P70 07/01/2015 to P70 05/24/2019.

Since then i wasn't able to even startup the VM anymore without the whole ESX host crashing and restarting.

Tried serveral different vmx settings and different passthru.map shutdown settings with nu luck.

Tomorrow i'm starting from scratch with this lead: 6.7U1 vs 6.5U2 passthrough regression

I will report back when i get some good results.

Kind regards,

Thijs

Reply
0 Kudos
m_anders
Contributor
Contributor

Did you make any progress?

In my own experiences, pass through seemed to work ok on 6.5 with just the gpu pid/vid d3d0 false in the passthru.map file.  In later versions 6.7+ I kept getting the dreaded error 43 after a guest reboot.  It would work once after a host reboot then that error kept popping up.

After wasting too much time on this it seemed the consensus was to disable the gpu before reboot and enable after.

Post #245 here outlines the general idea https://forums.servethehome.com/index.php?threads/troubleshooting-gpu-passthrough-esxi-6-5.12631/pos...

I think the way pcie resets changed in some fundamental way in 6.7 and newer.  And/or something in the devices themselves changed with the newer nvidia cards.

Reply
0 Kudos
aseniuk81
Contributor
Contributor

I also have a DL380 G8, and after I upgraded my home lab to 6.7 I never thought it would be this hard to get my Quadro P2000 working in a passthrough. I found this post after months of searching for an answer. I added the second line to my passthrough and boom, started working like it should.

DL380 G8, Running ESXi 6.7 U3
edit /etc/vmware/passthru.map and add both lines, at first I only had the top line which would reboot the host server.

# NVIDIA
10de ffff bridge false
10de 1c30 d3d0 false

On the VM I am still using the advanced config options
pciPassthru0.msiEnabled = FALSE
pciHole.start = 2048
SMBIOS.reflectHost = TRUE
hypervisor.cpuid.v0 = FALSE

I am going to play around to see if these settings have anything to do with making it run.

Thank you for posting your solution.

Reply
0 Kudos
vHojan
Contributor
Contributor

Hi Hossy_923,

 

I'm having the exact same issues with the NUC 9, did you find a solution yet? (other than disabling access to the console).

Thanks!

Reply
0 Kudos
AJ_SAJJAN
Contributor
Contributor

Hi Hossy

I have the same NUC and experience the same issue.

I think you are on to something here.

I have my BIOS set to IGFX so I can use the remote Intel AMT session.

I wonder if anyone has come to fix this issue.

Tags (1)
Reply
0 Kudos
jrivam
Contributor
Contributor

Using an Esxi 6.5 VM didn't work for me. But, disabling the gpu before restarting the virtual machine worked for me. So, what I end up using is a powershell script that run in the Local Policy Group Editor and disables my gpu before shutdown and enables it at startup. Some reference using this links 

Assign Computer Shutdown Scripts | Microsoft Learn

Automatically Disable or Enable your GPU (or any other device) when your laptop power state changes ... 

Reply
0 Kudos