VMware Horizon Community
w00005414
Enthusiast
Enthusiast

SERVER_FAULT_FATAL error deploying instant clones in Horizon 8

Hi all. Here is our setup.

3 ESXI hosts running VMware ESXi, 7.0.3, 19193900

1 vCenter Server appliance running 7.0.3.00600

Windows 10 images are both build 1909 and 21H2

2 Horizon Connection Servers running v 8.4.0.19446835

4 simple pools, all floating instant clone pools with 2 of them using FSLogix

We just upgraded to Horizon 8 a month ago and we are moving from one SAN over to another then back so we can encrypt the volumes. Even before the VMware environment upgrade it took an hour to recompose (now Publish) a pool with the building of the storage accelerator index taking the longest time.

Since storage vmotioning back and forth we've been getting "Published Failed" notices after about an hour and a half. We read that when moving the parent image to new storage you should kill the pool, kill any snapshots, then take a new snapshot then Publish the pool but that doesn't seem to help. We are receiving errors like this after about 90 minutes of the pool trying to publish,

Error during Provisioning Initial publish failed: Fault type is SERVER_FAULT_FATAL - Failed to retrieve progress for request Id: b2e76f26-103d-40d0-aa0a-e31bb6b3e697

In some cases we can set the provisioning status to disabled then enabled and that kicks it in the pants and it finishes. Other times it will not.

We've been working with VMware and they had us make sure that the View Agent is running the same version as the Horizon environment (we had the 8.5 Agent installed in an 8.4 environment but down-reving it to 8.4 didn't really help). Vmware also had us move from VMware Tools 11 to version 12, that didn't help either.

We've tried an ADSI addition to the Horizon Broker servers that would extend a provisioning timeout to 180 minutes (described here https://kb.vmware.com/s/article/75019) and we thought it could be an AD Group Policy timeout issue so we first tried pairing down the AD GPO's being applied, then we tried publishing a pool in an AD container with no AD GPO's applied, then we tried this which stops AD Group Policy processing (https://kb.vmware.com/s/article/76469) and none of them seemed to help.

We are seeing this with our older build 1909 Windows 10 pools and with the newer build 21H2 pools we are trying to deploy. We use the VMware Desktop Optimization Tools as part of our Image Prep so the parent images should be fairly streamlined.

Since sometimes the pool eventually builds we do not think it is a KMS licensing issue. We always keep more than 25 virtual desktop clones up an running and pointing at our KMS server.

I thought we may have made head room when we vmotioned the broker servers and vcenter server back onto the same SAN (maybe something about them being apart causing an issue and when they are together maybe it takes advantage of something like VAAI) but we are still seeing the issue.

Anyone else seeing this?

 

Reply
0 Kudos
6 Replies
McBarrette
Contributor
Contributor

Hi w00005414,

I happen to have exactly the same problem, on kind of the same setup ! I'm currently thinking that my golden image is too heavy, but can't be sure...

Did you manage to resolve this issue?

 

Many Thanks

Reply
0 Kudos
SurajRoy
Enthusiast
Enthusiast

Are you using ClonePrep or SysPrep for Customization?

Is the Master image optimize?

Reply
0 Kudos
McBarrette
Contributor
Contributor

Hi SurajRoy,

I am using ClonePrep for Customization and I used OSOT on the Master, plus an NVIDIA Grid GPU setup.

Tags (1)
Reply
0 Kudos
SurajRoy
Enthusiast
Enthusiast

Thank you for the information.

Intermittent issue are always challenge to troubleshoot 🙂

We may have to check the ClonePrep / Customization log inside the agent machine to find the cause of the issue.

The OS Optimization Tool should have taken care of removing the Appex Packages which might cause customization timeout issue.

However, we still need to look at the logs for the actual reason.

Also check in VC if it get the valid IP address and DNS name. If not you can follow the KB: https://kb.vmware.com/s/article/2147129 

Reply
0 Kudos
MIKEFREESTONE1
Contributor
Contributor

we are seeing the same since updating hosts to 7.0.3   on 6.7 they seem to be fine.   VC was already at 7.0.3

 

have tried our normal image and a clean build from windows ISO.   same issue,  get to 94%  stalls then fails after about 1 hour

 

 

Reply
0 Kudos
CASantiagoAPL
Contributor
Contributor

I've experienced this a few times, primarily on what I assume would be considered a 'jumbo' VM - 500+GB disk. It happened to creep up again during our last maintenance which brought be back to looking through everything.

Originally, I had made the change based on the below article, specifically the 'cs-PublishImageTimeoutMins' setting and this seemed to resolve things. It does look like the article was updated this month (JAN 2023).

https://kb.vmware.com/s/article/75019 < "UNKNOWN_FAULT_FATAL - After waiting for 300 seconds internal template VM Instant Clone Creation Error (76469)"

One thing that I did happen to notice a few months ago was restarting the Connection Server(s) happened to kick-things. Before I found the article was updated to reflect another ADAM DB change to vCenter, I restarted the Connection Servers in my environment and then attempted to re-publish the image. It was successful...

Reply
0 Kudos