Hello to everybody....
I'm opening this new discussion cause I need assistance in understanding how vm consolidation work in vCloud.
I'm coming from a Lab manager world where linked clone, configuration and Library concepts were clear for me... but now I'm a bit confused on vCloud.
Just a quick explanation of how I use vCloud.
We've a vm (Red hat Enterprise Linux 5 and Sybase DBMS installed on it, size of the all the vmdks is 1,2 TB) that has its own vAPP (that is called Catalog vAPP). We create a vCloud Template of this vAPP and we use this template to deploy multiple copies of this vAPP on other Organizations, using linked clone feature.
Every nights the Sybase virtual machine inside the Catalog vAPP has database datas updated from a production a system thru database load feature (J-1) and every morning we deploy this vm into another template (named with the today date) and deployed again into the Organizations mentioned before.
On Saturday (chain lenght 6), we delete all the templates created during week and delete all the vAPPs except the Catalog one. After that we consolidate it (Catalog vAPP). This process used to last 10 hours on Lab Manager environment but on vCloud env It takes (the consolidation of Catalog vAPP) days (last time 96 hours).
my question is: How the consolidation process works?
How is It changed since Lab Manager?
JFI: This vCloud farm (composed by 6x ESXi Host, each one with 2x Intel X5675 Cpu and 96 GB of RAM) is connected thru Fiber channel redundant connection to a dedicated EMC NS480 Storage using a datastore composed by several META-LUNs. Taking a look at the EMC Analyzer software, there is no evidence of bottleneck on Storage Side and, with Cisco software, even of Fiber channel switches too.
I do agree consolidate is very slow. It rebuilds the VM so it has to read all the data in each VMDK and rewrite it as well as reviewing all the snapshot changes. I am not sure why lab manager would have been faster at this.
One trick we have used in the past to "speed" this up (not fast, but faster) it to have the catalog item in a VDC that does not allow fast provision - and the deploy copies go to a VDCX t hat does (using different datastores and storage profiles). This means the item in the catalog is not linked, and it does take a while for the first spin-up (you would need to test in your environment, but in ours a vapp of that size would take 30 min or so); but the links from this after the first deploy are fast. By doing this - when you place back in the catalog it will have to do a clone operation that is also doing a consolidate - but for some reason it sis much faster.
Maybe Another option - have you looking into using VAAI? If your sotrage vendor supports it you might be able to get vaai clones (on some providers same as linked) - but handled by the storage layer. This can help keep the files a little cleaner.
thank you for your answer.
I'll try to follow your tips disabling the fast provisioning feature....
Regarding VAAI, I've a supported storage (EMC NS480) that works with FC VAAI (on vsphere) but It's funny that on vCloud I need to use my storage with the NFS connections... why??????
VAAI for NFS is a specific feature for vCloud to talk directly to ESXi to initiate the VAAI commands as required. This is different than VAAI for VMFS based volumes. Please don't confuse the two, they work quite differently.
vCloud Director can use any VMFS/NFS based solution, generally speaking, but the VAAI integration varies with the two.
When you issue the 'Consolidate' feature, vCloud Director issues a 'Promote Virtual Disk' API call to vCenter. vCenter finds a host and passes the promote command down to an ESXi host to perform an action.
You can find more details here: Linked Virtual Machines - Look at the Promote section.
In this sense, vCloud Director is handing off the command to vCenter completely. So I would focus on the host performing the Consolidation (aka Promotion) and see what can be tuned at that level.
thank you for this explanation.
There is something that is not clear for me (see the image below). For some kind of reason, when the promote task starts, the disk latency start to spike up and then It stop when the process starts to massive read and write (around 200 MB/s) on disks. This recursive behaviour is repeated several times during the consolidation process.
JFI, the datastore is dedicated to this activity and no other jobs were running during this job running
If you want to really investigate the latency spike, you would be better served by asking around in the ESXi/Storage forum.
The number of latency spikes is probably about the same number as chain length for that specific machine you are consolidating. Consolation, depending on many factors, could be iterative rather than a 1 shot.
Say the chain is A - B - D - F ( i know these are not sequential, but it'll help)
First step, consolidate A + B = C, now the chain is
C - D - F
Second is C + D = E
E - F
E + F = G
G is the final set of disks independent of the linked clone chain.