1 2 3 Previous Next 32 Replies Latest reply on Nov 28, 2016 2:23 PM by Fpuma

    Host stability Issues with nvidia drivers

    respodu Novice

      We have been having issues with the hosts where the nvidia card are installed where the hostd process crashes and disconnects the host from vCenter until it restarts the process again and we had two servers with the purple screen of dead and both times seems to have been related to nvidia processes.



      We have two clusters one has four hosts with the nvidia vSGA driver and the other cluster has 3 hosts with the vGPU driver.


      On the vSGA cluster we set the pool setting to be automatic that way if hardware GPU resources are not available it can fail back to software. But it seems that if the host gets under load 65 to 75% CPU utilization it becomes unstable to the point where processes like hostd start crashing thus causing it to become unavailable on vCenter... then everything is in chaos mode since view starts complaining and trying to perform operations on the disconnected host.


       

      This happening with vGPU and vSGA on vSphere 6.0. The only way to alleviate the issue so far has been to set some of the pools from graphics acceleration from automatic to disabled on vSGA pools, and Stop using vGPU on the vGPU enabled cluster.

       

      VMware so far tells us that every time there is a crash is related to an NVidia process causing errors.


      But it seems getting support for this issue is an uphill one since it seems not many techs on VMware are familiar with NVIDIA.

        • 1. Re: Host stability Issues with nvidia drivers
          junew Novice

          Update: customer has shared server logs with Support Team, who is investigating.

          • 2. Re: Host stability Issues with nvidia drivers
            respodu Novice

            gpuvm command not reporting the gpu usage but nvidia-smi is reporting VMs using gpu processes... The pools are set to 3D Render Automatic..

             

            gpuvm

            Xserver unix:0, PCI ID 0:6:0:0, vSGA mode, GPU maximum memory 4173824KB

                    GPU memory left 4173824KB.

            Xserver unix:1, PCI ID 0:7:0:0, vSGA mode, GPU maximum memory 4173824KB

                    GPU memory left 4173824KB.

            Xserver unix:2, PCI ID 0:8:0:0, vSGA mode, GPU maximum memory 4173824KB

                    GPU memory left 4173824KB.

            Xserver unix:3, PCI ID 0:9:0:0, vSGA mode, GPU maximum memory 4173824KB

                    GPU memory left 4173824KB.

            Xserver unix:4, PCI ID 0:68:0:0, vSGA mode, GPU maximum memory 4173824KB

                    GPU memory left 4173824KB.

            Xserver unix:5, PCI ID 0:69:0:0, vSGA mode, GPU maximum memory 4173824KB

                    GPU memory left 4173824KB.

            Xserver unix:6, PCI ID 0:70:0:0, vSGA mode, GPU maximum memory 4173824KB

                    GPU memory left 4173824KB.

            Xserver unix:7, PCI ID 0:71:0:0, vSGA mode, GPU maximum memory 4173824KB

                    GPU memory left 4173824KB.

             

            +------------------------------------------------------+

            | NVIDIA-SMI 346.69     Driver Version: 346.69         |

            |-------------------------------+----------------------+----------------------+

            | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

            | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

            |===============================+======================+======================|

            |   0  GRID K1             Off  | 0000:06:00.0     Off |                  N/A |

            | N/A   33C    P8     8W /  31W |    123MiB /  4095MiB |      0%      Default |

            +-------------------------------+----------------------+----------------------+

            |   1  GRID K1             Off  | 0000:07:00.0     Off |                  N/A |

            | N/A   31C    P8     8W /  31W |     44MiB /  4095MiB |      0%      Default |

            +-------------------------------+----------------------+----------------------+

            |   2  GRID K1             Off  | 0000:08:00.0     Off |                  N/A |

            | N/A   27C    P8     8W /  31W |     35MiB /  4095MiB |      0%      Default |

            +-------------------------------+----------------------+----------------------+

            |   3  GRID K1             Off  | 0000:09:00.0     Off |                  N/A |

            | N/A   28C    P8     8W /  31W |     18MiB /  4095MiB |      0%      Default |

            +-------------------------------+----------------------+----------------------+

            |   4  GRID K1             Off  | 0000:44:00.0     Off |                  N/A |

            | N/A   33C    P8     8W /  31W |     18MiB /  4095MiB |      0%      Default |

            +-------------------------------+----------------------+----------------------+

            |   5  GRID K1             Off  | 0000:45:00.0     Off |                  N/A |

            | N/A   33C    P8     8W /  31W |     18MiB /  4095MiB |      0%      Default |

            +-------------------------------+----------------------+----------------------+

            |   6  GRID K1             Off  | 0000:46:00.0     Off |                  N/A |

            | N/A   27C    P8     8W /  31W |     18MiB /  4095MiB |      0%      Default |

            +-------------------------------+----------------------+----------------------+

            |   7  GRID K1             Off  | 0000:47:00.0     Off |                  N/A |

            | N/A   29C    P8     8W /  31W |     18MiB /  4095MiB |      0%      Default |

            +-------------------------------+----------------------+----------------------+

             

             

            +-----------------------------------------------------------------------------+

            | Processes:                                                       GPU Memory |

            |  GPU       PID  Type  Process name                               Usage      |

            |=============================================================================|

            |    0     35541    G                                                    6MiB |

            |    0     39491    G   EDUCReVM16                                       3MiB |

            |    0     39507    G   EDUCGASWVM-46                                    3MiB |

            |    0     40332    G   MSVEVMv1-12                                      3MiB |

            |    0     41946    G   EDUCGASWVM-31                                    3MiB |

            |    0     42448    G   LCVMv152                                         3MiB |

            |    0     42464    G   LCVMv145                                         3MiB |

            |    0     45548    G   PC224VMv1-1                                      3MiB |

            |    0     49299    G   MSVEVMv1-1                                       3MiB |

            |    0     62621    G   EDUCReVM29                                       3MiB |

            |    0     74852    G   BALKVMv1-12                                      3MiB |

            |    0     88087    G   BALKVMv1-3                                       3MiB |

            |    0     88596    G   EDUCGASWVM-11                                    3MiB |

            |    0     92832    G   LCVMv113                                         3MiB |

            |    0     99304    G   LCVMv115                                         3MiB |

            |    0     99663    G   EDUCTL1VM15                                      3MiB |

            |    0    102135    G   LCVMv117                                         3MiB |

            |    0    102497    G   LCVMv118                                         3MiB |

            |    0    107508    G   LCVMv12                                          3MiB |

            |    0    110270    G   EDUCReVM9                                       32MiB |

            |    1     35701    G                                                    6MiB |

            |    1     39518    G   EDUCGASWVM-32                                    3MiB |

            |    1    107092    G   BALKVMv1-4                                       3MiB |

            |    1    108078    G   LCVMv120                                         3MiB |

            |    1    110270    G   EDUCReVM9                                        3MiB |

            |    1    116644    G   LCVMv111                                         3MiB |

            |    1    118652    G   LCVMv124                                         3MiB |

            |    1    146080    G   EDUCReVM23                                       3MiB |

            |    2     35851    G                                                    6MiB |

            |    2     39491    G   EDUCReVM16                                       9MiB |

            |    2    103524    G   EDUCTL1VM13                                      3MiB |

            |    2    103961    G   EDUCReVM6                                        3MiB |

            |    3     36003    G                                                    6MiB |

            |    4     36278    G                                                    6MiB |

            |    5     36443    G                                                    6MiB |

            |    6     36591    G                                                    6MiB |

            |    7     36745    G                                                    6MiB |

            +-----------------------------------------------------------------------------+

            • 3. Re: Host stability Issues with nvidia drivers
              iforbes Hot Shot

              Hi. Did you get any resolution on this? I had an ESXi host with nVidia VIB installed crash with a purple screen of death recently.

              • 4. Re: Host stability Issues with nvidia drivers
                dexterous2000 Lurker
                VMware Employees

                I have been seeing this issue as well. Have you contacted VMware support?

                • 5. Re: Host stability Issues with nvidia drivers
                  Ray_handels Master
                  Community WarriorsvExpert

                  Wanted to chime in here.

                  We also have 2 K1 cards per server and have had about 24 servers of which 8 already had a PSOD.

                  Vmware tells us it's an issue with hardware, HP tells us we need some sort of patch for it. Thing is that it does crash with a "recursive panic on same CPU" whatever that may be. But 8 out of 24 is to much to call it a hardware issue to be honest.

                   

                  What ESX drivers are you all using?

                  For the record, we have a ticket open for this issue for over 4 months now with both VMWare and HP..

                  • 6. Re: Host stability Issues with nvidia drivers
                    TomMar Novice

                    We've been having a very similar issue with our hosts.  We have 2 K1 cards for each server, we're running them on Dell R720s.  We'll have a PSOD cascade through the cluster.  If one host crashes, the next will crash in a few minutes and so on and so on.

                     

                    We just started testing 352.70 on one of our hosts, we've disabled the K1s on the rest.  They've tried to tell us that its a hardware issue as well, but 4 identical hosts all having the same hardware failure at the same time suggests that's not the case.

                     

                    We also had an issue with pipe errors in the VMware console for certain desktop VMs.  We'd have to kill their VM process to get them to restart.  Ever since disabling the NVIDIA drivers on our hosts that issue has gone away as well.  It hasn't come back on the one host we are testing 352.70 on yet though.

                     

                    One other thing, I notice that some VMs only use 1MB of GPU memory, while others are much higher in the 128-500mb range.  All of the VMs are using the same desktop resolution and running similar apps (Office etc).  I don't know if that points to anything but it just doesn't seem right.

                    • 7. Re: Host stability Issues with nvidia drivers
                      Fpuma Novice

                      Hello

                       

                      Same problem here.

                      With the pool setting on 'hardware"

                       

                      5 servers in a pool all with one K1 card, mixed R720 and R730.

                      They sooner or later all crashed, most of the time in pairs, but we have 5 ;-)

                      They all crashed after 40 days or more running

                       

                      In esx 5.5 we have ran almost two years without problem, then we updated the servers bios and other firmware and went to esxi 6.0.0, 3073146

                       

                      On advice from dell we installed the dell iso and drivers below, still the same problem

                       

                      1.    VMware-VMvisor-Installer-6.0.0.update01-3380124.x86_64-Dell_Customized-A04.iso
                      2.    SNDK_bootbank_scsi-iomemory-vsl_3.2.11.1585-1OEM.600.0.0.2159203.vib
                      3.    NVIDIA-kepler-VMware_ESXi_6.0_Host_Driver_352.70-1OEM.600.0.0.2494585.vib
                      • 8. Re: Host stability Issues with nvidia drivers
                        respodu Novice

                        For some reason, I wasn't receiving notifications on this.

                         

                        To update on my status VMware made a hot patch for us that alleviated one of the issues we had with vSGA.

                         

                        We haven't seen the problem on vGPU since updating to ESXi U1b and the driver version NVIDIA-vGPU-kepler-VMware_ESXi_6.0_Host_Driver_352.70-1OEM.600.0.0.2494585

                         

                        The vSGA problem is supposed to be fixed on Update 2

                         

                        VMware ESXi 6.0 Update 2 Release Notes

                        Hostd randomly stops responding on hosts with 3D acceleration

                        Hostd might randomly stop responding on ESXi hosts with 3D acceleration. Messages similar to the following are logged in the hostd log:

                         

                        Crash Report build=2809209
                        Signal 11 received, si_code 1, si_errno 0
                        Bad access at 34
                        eip 0xnnnnnnn esp 0xnnnnnnnn ebp 0xnnnnnnnn
                        eax 0xnnnnnnnn ebx 0xnnnnnnn ecx 0x47 edx 0x0 esi 0x30 edi 0xnnnnnnnn


                        Hostd might stop responding when the ESXi host with 3D hardware is put on maintenance mode and vMotion is triggered

                        On an ESXi host that has 3D hardware, when hostd detects power state change of a vm, a function is called to check the vm state. In the case of vMotion, the source VM is powered off before being unregistered on the source host, the ManagedObjectNotFound exception is displayed and the hostd might stop responding.



                        Update 1b fixed another issue that we had and was causing PSODs:


                        VMKernel log file is flooded with warnings in the VM page fault path and might result in the host to fail

                        Attempts to power on VMs with higher display resolution or a multiple monitor setup might cause several warning messages similar to the following to be written to the vmkernel.log file and might cause the host to fail due to excessive load of logging:

                         

                        XXXX-XX-XXTXX:XX:XX.XXXZ cpuXX:XXXXXXX)WARNING: VmMemPf: vm XXXXXXX: 654: COW copy failed: pgNum=0xXXXXX, mpn=0x3fffffffff

                        XXXX-XX-XXTXX:XX:XX.XXXZ cpuXX:XXXXXXX)WARNING: VmMemPf: vm XXXXXXX: 654: COW copy failed: pgNum=0xXXXXX, mpn=0x3fffffffff

                        • 9. Re: Host stability Issues with nvidia drivers
                          Fpuma Novice

                          Unfortunately we have the PSOD errors...

                          Lampje went first at 1.20 am when nobody is working, 1 hour 40 minutes later goofy

                          Usually faster after each other

                          • 10. Re: Host stability Issues with nvidia drivers
                            krd Hot Shot
                            VMware Employees

                            NMI PSODs often indicate a hardware issue.  I suggest you open a problem report to help identify the failing component.  The only time I've seen a similar issue related to graphics, the cards were overheating.  You can issue nvidia-smi periodically to verify the temperatures are ok.

                             

                            Regards,

                            Kurt

                            VMware graphics team

                            • 11. Re: Host stability Issues with nvidia drivers
                              Fpuma Novice

                              Hardware issues on multiple hosts at the same time are rare

                              The drac points at the K1

                              • 12. Re: Host stability Issues with nvidia drivers
                                krd Hot Shot
                                VMware Employees

                                PCI bus error from NVIDIA device is quite rare.  I suggest you monitor temperature.  Perhaps the cards are not receiving sufficient airflow?

                                • 13. Re: Host stability Issues with nvidia drivers
                                  Fpuma Novice

                                  Hello

                                   

                                  Above is someone with :

                                  We also have 2 K1 cards per server and have had about 24 servers of which 8 already had a PSOD.

                                  Vmware tells us it's an issue with hardware, HP tells us we need some sort of patch for it. Thing is that it does crash with a "recursive panic on same CPU" whatever that may be. But 8 out of 24 is to much to call it a hardware

                                   

                                  We have 6 servers on different sites temperature is between 31 and 38 degrees.

                                  The one with passthroug is working fine but isn't used much

                                   

                                  Dell en vmware are working on it, maybe it's a nvidia driver or a vmware bug(releasing memory ?), it's on r720's and r730's

                                  • 14. Re: Host stability Issues with nvidia drivers
                                    krd Hot Shot
                                    VMware Employees

                                    Please post the PSOD info or say if it is Bus Error, NMI, or something else?  Typically if it is Bus Error, the problem is with hardware or driver.

                                    1 2 3 Previous Next