johnsongrantr
Contributor
Contributor

Composed clones taking hours to complete

Jump to solution

When I compose clones I have to set the timeout cap to roughly 2 hours (HKLM\system\currentcontrolset\services\VMware-viewcomposer-ga\execscripttimeout set to 7,200,000) or else I get errors. The default is like 20 min as the cap so something is not normal. I will describe my environment, connections and master image.

3 Dell PowerEdge R720's servers running esxi 6.0 as hosts

1 host dedicated to 3x virtual windows server 2016 for vcenter 6, Microsoft sql 2012, and horizion view 6 10gb ram per server, plenty of local hdd dedicated, no apparent bottlenecks in cpu utilization on host.

2 hosts dedicated to composed clones and master image. iscsi storage with 2x 1TB luns connected between all 3 hosts where the master image is stored. Performance when creating and maintaining the master image seems acceptable when stored and ran from the iscsi marginal difference when brought to the local store on a single host.

master image is a windows 10 1709 image on a lazy zero provisioned virtual IDE drive (we have weird internal security that prevents us from using virtual scsi). HDD is provisioned for 80GB, with approximately 25GB in use after OS and applications are applied.

====================

I finish my updates, release the ip, shut down the vm, take a snapshot. Very snappy, takes 2-3 seconds.

I generate the resource pool, set my settings for composed clones, have them automatically join the domain with quickprep.

Choose my storage options, set one resource pool per host, each pool will use local storage on the host it's on for both the replica and composed clones.

I enable provisioning and then watch vsphere for activity 1 hour into it and it's showing 'clone virtual machine' at 63%, this has always been the situation with every image I make.

Performance is extremely lack luster even if only 3 composed clones are eventually online and a single user logged in through a zero client. 5 min login times generating profiles etc.

Things I have done seem to be impacting this provisioning time with mixed results. Taking the master to local storage from iscsi seems to make it go slightly faster, as does thin provisioning, cleaning old snapshots of the master and only having a single recent snapshot for the clones seems to drastically cut down this 'clone virtual machine' progress, but still takes over an hour, nowhere near the 20min mark.

What am I doing wrong here when the timeout is supposed to cap at 20 min as a high end?

0 Kudos
1 Solution

Accepted Solutions
BenFB
Commander
Commander

It sounds like you are saturating the local storage or the cache on the HBA. A successful VDI deployment really needs SSD. During a clone/provisioning operation monitor the host with esxtop and I bet you will see high DAVG numbers.

http://www.running-system.com/vsphere-6-esxtop-quick-overview-for-troubleshooting/

I'd also recommend building your parent VMs with thin provisioned hard disks instead of thick. When the parent is cloned to the replica it will create them as thin but I expect that is adding some time which you saw. Having a lot of snapshots will also increase the time.

View solution in original post

0 Kudos
6 Replies
BenFB
Commander
Commander

It sounds like you are saturating the local storage or the cache on the HBA. A successful VDI deployment really needs SSD. During a clone/provisioning operation monitor the host with esxtop and I bet you will see high DAVG numbers.

http://www.running-system.com/vsphere-6-esxtop-quick-overview-for-troubleshooting/

I'd also recommend building your parent VMs with thin provisioned hard disks instead of thick. When the parent is cloned to the replica it will create them as thin but I expect that is adding some time which you saw. Having a lot of snapshots will also increase the time.

View solution in original post

0 Kudos
johnsongrantr
Contributor
Contributor

Thank you for the reply. I will try running esxtop and report back next week. I'm not very familiar with the tool, although I've seen it referenced a couple times when researching how to identify bottlenecks. I will probably will need assistance to determine if the numbers seen are something to worry about or not.

I'll change it back to thin, it does make the provisioning faster, but it's still taking an hour+. I read thin provisioned disk effect running performance due to always having to inflate the file. That might be the least of my worries at the moment.

Anyway, I'll collect some stats and post back. Thanks agian.

0 Kudos
BenFB
Commander
Commander

In general that is true but all replica and linked clones are created as thin so you might as well build the parent the same way.

Look at the link I posted. The chart will give you some guidance on when a metric is too high and what the result of that is. Feel free to post screenshots and I can try to help.

0 Kudos
johnsongrantr
Contributor
Contributor

I ran ESXTOP on some of the hosts during provisioning, under DISK on the vmhba I have the iscsi on I'm seeing averages of DAVG of 30 sometimes peaking at 50, my KAVG is average of 40 and peaking at 80 and GAVG which is a combination of the 2 average of 70 and peaking at 130 which is consistent with the description of those fields per the document.

I'm going to try to keep everything local, and thin provision as you suggested and see if I see any improvement. If I do I'm going to only use the ISCSI as a intermittent storage for transferring VM between hosts and for archive purposes rather than using for production vdi clients and provisioning. Unfortunately I don't have any SSD SAS drives, but if I did, I would totally use them.

0 Kudos
BenFB
Commander
Commander

I'm a little surprised to see the KAVG is so high. Can you explain the disk layout, how they are connected and what they are being used for?

0 Kudos
johnsongrantr
Contributor
Contributor

I migrated the VM to the local HDD and converted it to thin provision, and DAVG is around 25 with peaks at 30 and KAVG is .01 and GAVG is 25 with peaks of 30 so "normal" based off the charts. Progress is also significantly faster. I think that was my problem.

The drive it's on now is on a local to the host. I think it's like 6-7 1TB 7200rpm SAS drives in raid 5 for 5TB storage with 1-2 hot spares.

The setup with the iSCSI (where the latency is seen) is 2x 1TB luns connected through a SBA instead of a HBA due to not having enough riser cards in the server. I dedicated 1 of 4 1Gig NICs to the SBA connection and it runs to a switch in the rack down to another switch in that row where our netapp is on. I'm not much of a SAN guy so I'm speaking as generally as I can about it.

0 Kudos