VMware Horizon Community
LukaszDziwisz
Hot Shot
Hot Shot

Instant clones slow provisioning

Hello,

I was wondering if maybe someone came across similar issue and might offer an advise on how to troubleshoot it and resolve.  We are using Windows 1809 LTSC, Horizon 7.7, AppVolumes 2.16 and UEM 9.7 and vCenter 6.7 U2. So pretty much all the newest. Our storage is all flash based Pure, however it appears that the Instant clone takes roughly 5  minutes before it becomes available. It is in provisioning state for about 1 min 30 sec and then 3 mins 30 sec in customization state and then becomes available.

I opened a ticket with VMware on it and here is what we are seeing from the log:

2019-05-22 16:07:14 (3444) info CSvmgaService - (svgaservice.cpp, 397 ) shutting down the service

2019-05-22 16:11:54 (4268) debug Svmga:: setservicestatus - (svmga.cpp , 522) service stats set to 2

As you can see there is a 4  minute gap that would account for the long time before machine becomes available.

Next update from Vmware engineer is as follows:

Furthermore i discussed the case with my technical lead and he mentioned that it either could be some script or some program on VM itself which is causing it to reboot. As we are using clone prep so machine should not reboot however in your case its doing exactly the opposite.

The process C:\Program Files (x86)\Common Files\VMware\View Composer Guest Agent\vmware-svi-ga.exe (TESTIMRW5) has initiated the shutdown of computer TESTIMRW5 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Operating System: Recovery (Planned)

Reason Code: 0x80020002

So the next suggestion is to build brand new image with plain windows, install vmtools and horizon agent only and provision pool. I did that and bam the instant clone only takes 30 seconds before it becomes available.

I don't really want to redo my images as it will be very time consuming and pretty sure will break other things so I was wondering if anyone might have any idea on how to resolve it?

30 Replies
BenFB
Virtuoso
Virtuoso

As painful as it is building a new image is the best way to identify the problematic software. We've had to do this on a few occasions.

This would be a great time to at a minimum document and slim down the image. If possible look at using automation so you can produce a new image on demand within minutes.

Reply
0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

Yeah I understand but unfortunately packaging applications through AppVolumes is not as reliable as putting it on the image therefore we are trying to pack as much as we can on the image and then just supplement it with AppStacks.

Just curious, what do you use for automation and how would this work? Do you mean like create a template at some point in time nad then add applications?

Also, back to original issue, I was able to pinpoint the problem and it appears to be  adding a shared PCI device to the image for the NVidia Grid. So far no agents seem to be breaking it until I get to the point of adding “Shared PCI device” for the NVIDIA grid. As soon as I do that provisioning jumps from 30 seconds to 5 minutes. I have removed the Shared PCI device in Mater Image settings but left drivers installed on the image and I’m back to 30 seconds again. I’m not too sure if this is NVIdia specific or adding a Shared PCI device problem though or problem with Horizon and how it deals with Shared PCI devices in general

Reply
0 Kudos
MrCheesecake
Enthusiast
Enthusiast

Has this always happened or is a recent issue with a pool that's been operating for a while?

Do you have Windows Update disabled on your base image?  I'm wondering if there's a "rogue" update being pulled down during the provisioning process which could cause the reboot.

Along the same lines, I assume you assign your App Volumes to users rather than machines?  If you're assigning them to machines, could one of the apps be kicking off a reboot?

Reply
0 Kudos
BenFB
Virtuoso
Virtuoso

While AppVolumes/FlexApp/etc... are options I was thinking about automation with something like Ansible from Red Hat. You can get to the point where you are deploying a new VM every time using the latest Microsoft WIM and script all of the application installations. This should allow you to have a reproducible parent VM that can be created under an hour.

Reply
0 Kudos
cdickerson75
Enthusiast
Enthusiast

We are seeing the exact same problem.  It appears to be a issue with Windows 10 1809 and Nvidia with instant clones.  If we switch to 1709, no issue.  We opened a ticket with Nvidia, but they aren't responding.

-Craig

Reply
0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

We also opened a ticket with NVIdia and so far they told me to upgrade to version 425.31 Grid 8 as this is apparently first version that supports WIndows 1809. I will give it a try this weekend and post results. Also, we do have a ticket opened with VMware just to see if it's Shared PCI device issue in general

Reply
0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

It looks like it has been going for a while. At first we had issues with VMFS6 and snapshots which is apparently going to be fixed in futiure releases of vsphere then we ran into other problems with it. Now we are finally getting into a stable state and that seem to be the biggest issue. I tried brand new Master Image with no updates, customizations etc and it happens as well. As for AppVolumes we are attaching per user not per machine. Updates and Module Installer are disabled as well.

Reply
0 Kudos
cdickerson75
Enthusiast
Enthusiast

Any update on your case with Nvidia?  We can't get them to do anything but "we will talk about it with VMware in our next weekly call"!

-Craig

Reply
0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

Craig,

So after further testing we can confirm that the issue is adding Shared PCI device and it doesn't matter if it's Nvidia or not. To validate it I have removed Nvidia driver from my image but left Shared PCI device added in VM settings. The provisioning was still slow. Next test was that we left nvidia drivers installed but removed Shared PCI device from the Image VM and took a snapshot and provisioned pool and bam it took barely 25 seconds for each clone to be available

I have  ticket opened with VMware on it and they actually just created a separate Engineering ticket on that issue as well. SO far from what I was told is that the issue appears to be only happening with Instant Clones.

I'll post any more updates once I have any

Reply
0 Kudos
Lackero
Contributor
Contributor

We are facing the same problem. Is there already a solution?

Reply
0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

Still working with Vmware and NVidia on the problem. Nvidia is having weekly meeting with VMware on that issue. HEre is what Nvidia says agout it:

Per all the meetings between NVIDIA and VWmare, the issue has nothing to do with a Shared PCI Device. VMware has determined Its a timing issue with ESXi 6.7 and Windows (1809).

They have already reproduce the issue using a M60 and originally wanted to verify with M10 which we at NVIDIA feel there should be no difference. They also have confirmed it is not reproducible with 6.7 U3, so they believe this will fix the issue with M10 as well.

One thing I wanted to  note is that I have not been able to confirm that information with VMware and I asked my case engineer to confirm it but he is out and will be back next week. I'm crossing my fingers on it as it is causing us to build way more machines in pools so that it can keep up as people are logging in in the morning

BBittner
Enthusiast
Enthusiast

After a (really) short test, it seems so, that after 6.7 U3 it is faster than before.

Now our clone needs around 1 min 40 sec to be available.

Reply
0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

1:40 is pretty good. We  are still at 2:40. It used to be 5 minutes before. What was your time before upgrade?

Reply
0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

Here is the latest response from VMware:

After reviewing the provide logs and collaborating with the VMware R&D Team, we have concluded the following:

From the agent logs the vm starts customization ~2019-09-19 13:18:36. It requires a shutdown to attach the vGPU device and then gets powered back again. There is a gap of almost 2mins, as depicted  here:

2019-09-19 13:18:48,083 [3984] DEBUG Guest  - ["Guest.cpp", 1142] Clone Trust Verification Succeeded
2019-09-19 13:18:48,094 [3984] DEBUG Guest  - ["Guest.cpp", 2763] Set guestinfo.AgentCustomizationFlags to (3), CUSTOMIZATION_FLAG_SHUTDOWN_REQUIRED | CUSTOMIZATION_FLAG_SHUTDOWN_REQUESTED
2019-09-19 13:18:48,094 [3984] DEBUG Guest  - ["Guest.cpp", 1162] Set shutdown requested flag
2019-09-19 13:18:48,101 [3984] DEBUG Guest  - ["Guest.cpp", 1164] Total Clone Customization Time Pre-Shutdown: 17375ms
2019-09-19 13:18:48,101 [3984] DEBUG Guest  - ["Guest.cpp", 1168] Initiating required agent shutdown
........
2019-09-19 13:20:35,855 [4100] INFO  CSvmGaService  - ["svmGaService.cpp", 70] Initialized the Service <<<<--- 2mins gap
........
2019-09-19 13:20:40,180 [4100] DEBUG Guest  - ["Guest.cpp", 4877] Set guestinfo.clone.CustomizationState to success


From the Server logs for the same VM, after the creation succeeds, we wait for the VM to shutdown. That shutting down takes ~1.5 mins and explains for the bulk of delay that we see.

2019-09-19T13:18:18.085-06:00 DEBUG (1330-1C78) <WFE-02> [Task] CreateInstantCloneAction of Task=[CreateSingleCloneTask:7608c281-e606-46f7-857d-b27ba4d883c5:28620], requestId=0b1d3f21-5fd4-4d75-9728-8f5f22cc5f2a succeeded.
2019-09-19T13:18:18.087-06:00 DEBUG (1330-0260) <WFE-19> [Task] WaitThenPowerOffCloneAction of Task=[CreateSingleCloneTask:7608c281-e606-46f7-857d-b27ba4d883c5:28620], requestId=0b1d3f21-5fd4-4d75-9728-8f5f22cc5f2a is being executed.
.................
2019-09-19T13:19:52.282-06:00 DEBUG (1330-0260) <WFE-19> [VcSubsystem25] $Profile: @LEVEL = VC_API & @START_TIME = 2019-09-19T13:19:52.280-0600 & @END_TIME = 2019-09-19T13:19:52.282-0600 & @END2END_MS = 2 & @RUNNING_MS = 2 & @WAITING_MS = 0 & @DETAILS = {REQUEST_ID=0b1d3f21-5fd4-4d75-9728-8f5f22cc5f2a, API_NAME=retrieveProperties, API_VARIANT=RetrieveProperties, VC_URL=https://vcenter.domain:443/sdk}
2019-09-19T13:19:52.282-06:00 DEBUG (1330-0260) <WFE-19> [Action] $Profile: @LEVEL = ACTION & @START_TIME = 2019-09-19T13:18:18.087-0600 & @END_TIME = 2019-09-19T13:19:52.282-0600 & @END2END_MS = 94195 & @RUNNING_MS = 94195 & @WAITING_MS = 0 & @DETAILS = {RETRY_COUNT=0, REQUEST_ID=0b1d3f21-5fd4-4d75-9728-8f5f22cc5f2a, TASK_ID=7608c281-e606-46f7-857d-b27ba4d883c5, ACTION_NAME=WaitThenPowerOffCloneAction}
===========================
VMware R&D has concluded that with the current workflows, it takes longer to create a vGPU enabled Instant Clone, as compared to a regular Instant Clone. The delays that we are seeing here are expected and shouldn't be concerning.

Reply
0 Kudos
BBittner
Enthusiast
Enthusiast

I can't remember it exactly, but as far as I recall, it was between 4 and 5 Minutes.

Reply
0 Kudos
BBittner
Enthusiast
Enthusiast

> "...The delays that we are seeing here are expected and shouldn't be concerning."

haha, yes not concerning, but disturbing and annoying...

Reply
0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

Are you running all your VMs on VMFS6 now after the update? I will try that and see if I can get better time. DUe to slow boot on VMFS 6 with previous version of vCenter we have provisioned everything on VMFS5

Reply
0 Kudos
BBittner
Enthusiast
Enthusiast

All VMs are on VMFS6.

Reply
0 Kudos
lainalex
Contributor
Contributor

Slow loading w/ later releases of Windows 10, is a known issue with VMFS 6. Move everything back to 5, and it will be fine.

Reply
0 Kudos