VMware Cloud Community
davidcrowder
Enthusiast
Enthusiast
Jump to solution

Consolidation failure

We have been trialing Dell Rapid-Recovery for ESXi backups, and occasionally experience consolidation failures.  Any advice on how to track down why, so that we can fix it?

This server is running ESXi 5.5 build 2068190.

Here is some info from hostd.log:

2016-07-21T09:00:16.847Z [52E80B70 info 'Vimsvc.TaskManager' opID=hostd-854b user=root] Task Created : haTask-4-vim.VirtualMachine.consolidateDisks-344057184

2016-07-21T09:00:16.848Z [51080B70 info 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx' opID=hostd-854b user=root] State Transition (VM_STATE_ON -> VM_STATE_CONSOLIDATE_ALL_DISKS)

...

(Lots of verbose messages that do not appear to have anything to do with consolidation)

...

2016-07-21T09:00:18.740Z [4F5C1B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Consolidate Disks translated error to vim.fault.FileLocked

2016-07-21T09:00:18.740Z [4F5C1B70 info 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Consolidate Disks failed: vim.fault.FileLocked

2016-07-21T09:00:18.740Z [4F5C1B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Consolidate Disks message: An error occurred while consolidating disks: Failed to lock the file.

-->

2016-07-21T09:00:18.740Z [4F181B70 info 'Vimsvc.ha-eventmgr'] Event 7495 : Virtual machine domain1 disks consolidation failed on vsphere1 in cluster vsphere1 in ha-datacenter.

2016-07-21T09:00:18.742Z [4F181B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Time to gather Snapshot information ( read from disk,  build tree): 1 msecs. needConsolidate is true.

2016-07-21T09:00:18.742Z [4F181B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Snapshot property update: Configure will be invalidated for:

2016-07-21T09:00:18.758Z [4F181B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Time to gather config: 15 (msecs)

This is the third time it has done this, each time with a different guest / vmdk.  I'm at a loss on how to proceed.  The only fix that has worked on the prior occasions was to reboot the ESXi host and do a manual consolidation.

The vSphere Client and command-line tools all give similar errors when attempting to consolidate without a reboot, stating that the they are unable to access the file since it is locked.  Even restarting the hostd daemon is not sufficient to allow consolidation to proceed -- nothing but a full host reboot.

Any ideas how to proceed?

Thanks in advance

0 Kudos
21 Replies
daphnissov
Immortal
Immortal
Jump to solution

This type of behavior is, unfortunately, an all-too-common problem not with ESXi or VMware's logic, but with backup vendors writing poor software and not properly implementing steps to ensure their proxies are releasing disks when they should. I have experienced this countless times with backup software vendors, and although they're quick to point fingers at VMware, the issue is actually on their side. The only vendor's product I've found to perform due diligence and clean up after itself--even when it has experienced a failure or interruption--is Veeam Backup & Replication with a feature introduced a couple versions ago called Snapshot Hunter. What tends to happen in these cases is the backup software requests a snapshot of a VM through their VDDK libraries which each product carries. Once the snapshot is confirmed, the software, via a proxy or directly, adds and mounts the base disk and begins to read the changed blocks via the CBT driver. If something occurs with the software where it is interrupted prematurely, the software aborts but it fails to undo what it last did or even check if that disk is still mounted. This manifests to other systems or attempts to consolidate as a lock held. When another system has a lock on a disk, snapshots cannot be removed. When attempts are made to do so, the metadata descriptor is deleted but not the delta files. When this occurs, a consolidation is normally needed, yet because a lock is still held even the consolidation fails. The only way forward is to identify what system holds the lock and remove it, usually by removing the virtual disk from the configuration of a proxy (in the case of hot-add mode being employed).

Backup vendors need to feel pressure from customers to fix this broken and poor behavior of their products, because it can lead to serious issues including outages due to full datastores, degraded performance, and other unwanted effects. If vendors are loath to comply or continue to insist on finger pointing, it may be time to switch your product for a more reliable one.

0 Kudos
davidcrowder
Enthusiast
Enthusiast
Jump to solution

Original poster here.  I had forgotten about this...

We had to spend several hours across many calls with Dell/Quest tier-2 / tier-3 support.  In the end, a combination of factors seems to have done the trick for us.

1)  We increased the time-outs in our backup software, giving it more time before it "gave up".

2)  We decreased the number of backups allowed to run simultaneously, and then scheduled them to run in a staggered fashion so that only one VM per host should be backing up at any one time.

3)  Updated versions of ESXi and Rapid Recovery.

Somewhere along the way, between updating everything and fine-tuning the backup software, it seems to have smoothed out.  I believe all three of the above were necessary steps.

Hope that helps anyone else.

Thanks!

0 Kudos