VMware Horizon Community
danwilnot123
Contributor
Contributor
Jump to solution

VMs not powering on with NVidia A40 Grid

I've tried everything to get the vGPU VMs to power on but keep getting the error, Could not initialize plugin '/usr/lib64/vmware/plugin/libnvidia-vgx.so' for vGPU 'nvidia_a40-8q'

1. Disabled ECC Memory

2. Enabled SRIOV in the BIOS on the R750 host

3. Set Graphics mode to Shared Direct in VSphere

Reply
0 Kudos
1 Solution

Accepted Solutions
EmilianW1
Enthusiast
Enthusiast
Jump to solution

@danwilnot123 ,

Please check ESXi version and nVIDIA drivers (VIB) version. Latest nVIDIA drivers requires ESXi 7.0 U1 as minimum.

This might help: https://docs.nvidia.com/grid/latest/grid-software-quick-start-guide/index.html

 

View solution in original post

9 Replies
fabio1975
Commander
Commander
Jump to solution

Ciao 

Can you post nvidia-smi -q command output?

 

Fabio

Visit vmvirtual.blog
If you're satisfied give me a kudos

danwilnot123
Contributor
Contributor
Jump to solution

==============NVSMI LOG==============

Timestamp : Tue Feb 22 18:34:31 2022
Driver Version : 470.103.02
CUDA Version : Not Found

Attached GPUs : 2
GPU 00000000:17:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322221067050
GPU UUID : GPU-0be6efaf-fd2d-3fee-0b15-539707c2af4f
Minor Number : 0
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0x1700
GPU Part Number : 900-2G133-0100-030
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x17
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:17:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 48687 MiB
Used : 0 MiB
Free : 48687 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 30 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 31.25 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 712.500 mV
Processes : None

GPU 00000000:CA:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322221062802
GPU UUID : GPU-bb45f5bc-a9a2-4727-6593-e676a1b5b5e4
Minor Number : 1
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0xca00
GPU Part Number : 900-2G133-0100-030
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xCA
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:CA:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 48687 MiB
Used : 0 MiB
Free : 48687 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 31 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 32.84 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 712.500 mV
Processes : None

Reply
0 Kudos
danwilnot123
Contributor
Contributor
Jump to solution

I tried. This reply was marked as spam and has been removed. If you believe this is an error, submit an report

Reply
0 Kudos
EmilianW1
Enthusiast
Enthusiast
Jump to solution

@danwilnot123 ,

Please check ESXi version and nVIDIA drivers (VIB) version. Latest nVIDIA drivers requires ESXi 7.0 U1 as minimum.

This might help: https://docs.nvidia.com/grid/latest/grid-software-quick-start-guide/index.html

 

fabio1975
Commander
Commander
Jump to solution

@danwilnot123 

How do you post the command output?
Try to add a screenshot

Fabio

Visit vmvirtual.blog
If you're satisfied give me a kudos

Reply
0 Kudos
danwilnot123
Contributor
Contributor
Jump to solution

Thanks, I'm using nvidia 13.2 drivers. We are on 6.7 still

Reply
0 Kudos
danwilnot123
Contributor
Contributor
Jump to solution

Here's some of the output

Reply
0 Kudos
fabio1975
Commander
Commander
Jump to solution

Ciao 

If you haven't solved it yet, I don't know if these two links will help you


https://forums.developer.nvidia.com/t/a6000-in-vgpu-13-0-and-esxi-7-0u2-failed-to-start-vgpu-instanc...


https://support.lenovo.com/my/en/solutions/ht512536-changing-display-modes-on-ampere-series-gpus-len...

 

I check with the tools they indicate (Display Mode Selector) I would do.

Was the A40 used elsewhere before? On a Windows system?

 

Fabio

Visit vmvirtual.blog
If you're satisfied give me a kudos

EmilianW1
Enthusiast
Enthusiast
Jump to solution

Reply
0 Kudos