11 Replies Latest reply on Aug 16, 2019 11:05 AM by LukaszDziwisz

    Instant clones slow provisioning

    LukaszDziwisz Enthusiast

      Hello,

       

      I was wondering if maybe someone came across similar issue and might offer an advise on how to troubleshoot it and resolve.  We are using Windows 1809 LTSC, Horizon 7.7, AppVolumes 2.16 and UEM 9.7 and vCenter 6.7 U2. So pretty much all the newest. Our storage is all flash based Pure, however it appears that the Instant clone takes roughly 5  minutes before it becomes available. It is in provisioning state for about 1 min 30 sec and then 3 mins 30 sec in customization state and then becomes available.

       

      I opened a ticket with VMware on it and here is what we are seeing from the log:

       

      2019-05-22 16:07:14 (3444) info CSvmgaService - (svgaservice.cpp, 397 ) shutting down the service

       

      2019-05-22 16:11:54 (4268) debug Svmga:: setservicestatus - (svmga.cpp , 522) service stats set to 2

       

      As you can see there is a 4  minute gap that would account for the long time before machine becomes available.

      Next update from Vmware engineer is as follows:

       

      Furthermore i discussed the case with my technical lead and he mentioned that it either could be some script or some program on VM itself which is causing it to reboot. As we are using clone prep so machine should not reboot however in your case its doing exactly the opposite.

       

      The process C:\Program Files (x86)\Common Files\VMware\View Composer Guest Agent\vmware-svi-ga.exe (TESTIMRW5) has initiated the shutdown of computer TESTIMRW5 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Operating System: Recovery (Planned)

       

      Reason Code: 0x80020002

       

      So the next suggestion is to build brand new image with plain windows, install vmtools and horizon agent only and provision pool. I did that and bam the instant clone only takes 30 seconds before it becomes available.

       

      I don't really want to redo my images as it will be very time consuming and pretty sure will break other things so I was wondering if anyone might have any idea on how to resolve it?

        • 1. Re: Instant clones slow provisioning
          BenFB Expert

          As painful as it is building a new image is the best way to identify the problematic software. We've had to do this on a few occasions.

           

          This would be a great time to at a minimum document and slim down the image. If possible look at using automation so you can produce a new image on demand within minutes.

          • 2. Re: Instant clones slow provisioning
            LukaszDziwisz Enthusiast

            Yeah I understand but unfortunately packaging applications through AppVolumes is not as reliable as putting it on the image therefore we are trying to pack as much as we can on the image and then just supplement it with AppStacks.

             

            Just curious, what do you use for automation and how would this work? Do you mean like create a template at some point in time nad then add applications?

             

            Also, back to original issue, I was able to pinpoint the problem and it appears to be  adding a shared PCI device to the image for the NVidia Grid. So far no agents seem to be breaking it until I get to the point of adding “Shared PCI device” for the NVIDIA grid. As soon as I do that provisioning jumps from 30 seconds to 5 minutes. I have removed the Shared PCI device in Mater Image settings but left drivers installed on the image and I’m back to 30 seconds again. I’m not too sure if this is NVIdia specific or adding a Shared PCI device problem though or problem with Horizon and how it deals with Shared PCI devices in general

            • 3. Re: Instant clones slow provisioning
              MrCheesecake Novice

              Has this always happened or is a recent issue with a pool that's been operating for a while?

               

              Do you have Windows Update disabled on your base image?  I'm wondering if there's a "rogue" update being pulled down during the provisioning process which could cause the reboot.

               

              Along the same lines, I assume you assign your App Volumes to users rather than machines?  If you're assigning them to machines, could one of the apps be kicking off a reboot?

              • 4. Re: Instant clones slow provisioning
                BenFB Expert

                While AppVolumes/FlexApp/etc... are options I was thinking about automation with something like Ansible from Red Hat. You can get to the point where you are deploying a new VM every time using the latest Microsoft WIM and script all of the application installations. This should allow you to have a reproducible parent VM that can be created under an hour.

                • 5. Re: Instant clones slow provisioning
                  cdickerson75 Novice

                  We are seeing the exact same problem.  It appears to be a issue with Windows 10 1809 and Nvidia with instant clones.  If we switch to 1709, no issue.  We opened a ticket with Nvidia, but they aren't responding.

                   

                  -Craig

                  • 6. Re: Instant clones slow provisioning
                    LukaszDziwisz Enthusiast

                    We also opened a ticket with NVIdia and so far they told me to upgrade to version 425.31 Grid 8 as this is apparently first version that supports WIndows 1809. I will give it a try this weekend and post results. Also, we do have a ticket opened with VMware just to see if it's Shared PCI device issue in general

                    • 7. Re: Instant clones slow provisioning
                      LukaszDziwisz Enthusiast

                      It looks like it has been going for a while. At first we had issues with VMFS6 and snapshots which is apparently going to be fixed in futiure releases of vsphere then we ran into other problems with it. Now we are finally getting into a stable state and that seem to be the biggest issue. I tried brand new Master Image with no updates, customizations etc and it happens as well. As for AppVolumes we are attaching per user not per machine. Updates and Module Installer are disabled as well.

                      • 8. Re: Instant clones slow provisioning
                        cdickerson75 Novice

                        Any update on your case with Nvidia?  We can't get them to do anything but "we will talk about it with VMware in our next weekly call"!

                         

                        -Craig

                        • 9. Re: Instant clones slow provisioning
                          LukaszDziwisz Enthusiast

                          Craig,

                           

                          So after further testing we can confirm that the issue is adding Shared PCI device and it doesn't matter if it's Nvidia or not. To validate it I have removed Nvidia driver from my image but left Shared PCI device added in VM settings. The provisioning was still slow. Next test was that we left nvidia drivers installed but removed Shared PCI device from the Image VM and took a snapshot and provisioned pool and bam it took barely 25 seconds for each clone to be available

                           

                          I have  ticket opened with VMware on it and they actually just created a separate Engineering ticket on that issue as well. SO far from what I was told is that the issue appears to be only happening with Instant Clones.

                           

                          I'll post any more updates once I have any

                          • 10. Re: Instant clones slow provisioning
                            Lackero Lurker

                            We are facing the same problem. Is there already a solution?

                            • 11. Re: Instant clones slow provisioning
                              LukaszDziwisz Enthusiast

                              Still working with Vmware and NVidia on the problem. Nvidia is having weekly meeting with VMware on that issue. HEre is what Nvidia says agout it:

                               

                              Per all the meetings between NVIDIA and VWmare, the issue has nothing to do with a Shared PCI Device. VMware has determined Its a timing issue with ESXi 6.7 and Windows (1809).

                              They have already reproduce the issue using a M60 and originally wanted to verify with M10 which we at NVIDIA feel there should be no difference. They also have confirmed it is not reproducible with 6.7 U3, so they believe this will fix the issue with M10 as well.

                              One thing I wanted to  note is that I have not been able to confirm that information with VMware and I asked my case engineer to confirm it but he is out and will be back next week. I'm crossing my fingers on it as it is causing us to build way more machines in pools so that it can keep up as people are logging in in the morning