VMware Cloud Community
davidcrowder
Enthusiast
Enthusiast
Jump to solution

Consolidation failure

We have been trialing Dell Rapid-Recovery for ESXi backups, and occasionally experience consolidation failures.  Any advice on how to track down why, so that we can fix it?

This server is running ESXi 5.5 build 2068190.

Here is some info from hostd.log:

2016-07-21T09:00:16.847Z [52E80B70 info 'Vimsvc.TaskManager' opID=hostd-854b user=root] Task Created : haTask-4-vim.VirtualMachine.consolidateDisks-344057184

2016-07-21T09:00:16.848Z [51080B70 info 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx' opID=hostd-854b user=root] State Transition (VM_STATE_ON -> VM_STATE_CONSOLIDATE_ALL_DISKS)

...

(Lots of verbose messages that do not appear to have anything to do with consolidation)

...

2016-07-21T09:00:18.740Z [4F5C1B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Consolidate Disks translated error to vim.fault.FileLocked

2016-07-21T09:00:18.740Z [4F5C1B70 info 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Consolidate Disks failed: vim.fault.FileLocked

2016-07-21T09:00:18.740Z [4F5C1B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Consolidate Disks message: An error occurred while consolidating disks: Failed to lock the file.

-->

2016-07-21T09:00:18.740Z [4F181B70 info 'Vimsvc.ha-eventmgr'] Event 7495 : Virtual machine domain1 disks consolidation failed on vsphere1 in cluster vsphere1 in ha-datacenter.

2016-07-21T09:00:18.742Z [4F181B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Time to gather Snapshot information ( read from disk,  build tree): 1 msecs. needConsolidate is true.

2016-07-21T09:00:18.742Z [4F181B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Snapshot property update: Configure will be invalidated for:

2016-07-21T09:00:18.758Z [4F181B70 verbose 'Vmsvc.vm:/vmfs/volumes/548c3c98-2367a4b4-9aa2-0025908c25f8/domain1/domain1.vmx'] Time to gather config: 15 (msecs)

This is the third time it has done this, each time with a different guest / vmdk.  I'm at a loss on how to proceed.  The only fix that has worked on the prior occasions was to reboot the ESXi host and do a manual consolidation.

The vSphere Client and command-line tools all give similar errors when attempting to consolidate without a reboot, stating that the they are unable to access the file since it is locked.  Even restarting the hostd daemon is not sufficient to allow consolidation to proceed -- nothing but a full host reboot.

Any ideas how to proceed?

Thanks in advance

Reply
0 Kudos
1 Solution

Accepted Solutions
davidcrowder
Enthusiast
Enthusiast
Jump to solution

Original poster here.  I had forgotten about this...

We had to spend several hours across many calls with Dell/Quest tier-2 / tier-3 support.  In the end, a combination of factors seems to have done the trick for us.

1)  We increased the time-outs in our backup software, giving it more time before it "gave up".

2)  We decreased the number of backups allowed to run simultaneously, and then scheduled them to run in a staggered fashion so that only one VM per host should be backing up at any one time.

3)  Updated versions of ESXi and Rapid Recovery.

Somewhere along the way, between updating everything and fine-tuning the backup software, it seems to have smoothed out.  I believe all three of the above were necessary steps.

Hope that helps anyone else.

Thanks!

View solution in original post

Reply
0 Kudos
21 Replies
firestartah
Virtuoso
Virtuoso
Jump to solution

Error looks to be "Failed to lock the file."

have a look at this https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=10051

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful". Gregg http://thesaffageek.co.uk
Reply
0 Kudos
davidcrowder
Enthusiast
Enthusiast
Jump to solution

firestartah:

Thank you for the post.

I have looked into that article.  Unfortunately, it largely does not apply.  We are not a large datacenter; our ESXi servers are stand-alone.  So, the first 3/4 of that article, which is focused on determining which vSphere server has the vmdk locked, do not apply.

After determining which server it is, the advice basically boils down to "restart the host".  I already know to do that...

My goal is to discover why this is happening and how to fix it so that I can trust Rapid Recovery & ESXi to always successfully consolidate after backups.

Thanks

Reply
0 Kudos
virtualg_uk
Leadership
Leadership
Jump to solution

Try restarting all management agents instead of rebooting.

To restart all management agents on the host, run the command:
services.sh restart

Restarting the Management agents on an ESXi (1003490) | VMware KB

Typical troubleshooting steps I try when this happens:

  • Try to vMotion the VM to another host
  • Another option is if the backup software uses hot-add (vRanger etc) then look at the VM settings of the VM doing the backups and see if the VM with the error has one of it's disk attached to the backup VM. Detatch if required and try to consolidate again.
  • Try to create a new snapshot and then delete all snapshots
  • You can try to restart your backup server in case this has somehow locked the VMDKs

Graham | User Moderator | https://virtualg.uk
Reply
0 Kudos
davidcrowder
Enthusiast
Enthusiast
Jump to solution

grba:

Thank you for the reply.  services.sh restart is a better method than a full host reboot.

Unfortunately, this is for a small shop.  The license level is Essentials.  There are not enough servers to have the spare capacity to do vMotion, even if the license level supported;  these guests are stuck where they are.

I have tried creating other snapshots, and then using Delete All.  It fails without restarting the host (or all the management services, at least).

I have verified that it is not the backup system locking the files.  It's something in ESXi, itself... although I haven't a clue how to track that one down.

So, while knowing I don't have to restart the host every time is a positive thing, it still leaves us in the situation where simply using our backup software can leave us in a state where our guests crash as they run out of disk space.  Not good.

The only real, permanent solution is to find out why ESXi is failing to clean up snapshots when told -- why they're in a locked state -- and fix that, so that we can move forward.

I'm considering using updating that to the latest build of 5.5... but I hate running host updates on otherwise perfectly functional systems without clearly knowing it's the necessary fix.

I appreciate your help.

Thank you

Reply
0 Kudos
VMBoy79
Contributor
Contributor
Jump to solution

Hi David,

Please consider the size of the disk as well while using the VADP mode for backup. If it is more than 1 TB sometimes and you use LAN for VADP then the consolidation gets fail because of time out issue.

For clearing the locked files we fix it by restarting the management agent of the host.. Please let us know if upgrading the built fix the issue...

Reply
0 Kudos
virtualg_uk
Leadership
Leadership
Jump to solution

Did you see this option:

  • Another option is if the backup software uses hot-add (vRanger etc) then look at the VM settings of the VM doing the backups and see if the VM with the error has one of it's disks attached to the backup VM. Detatch if required and try to consolidate again.

Hot-add can cause the vmdk to get locked and you will not be able to consolidate if another VM has the disk attached to it. Although it does not explain why a reboot resolved the problem.


Try the above and let us know how you get on.


Graham | User Moderator | https://virtualg.uk
Reply
0 Kudos
davidcrowder
Enthusiast
Enthusiast
Jump to solution

VMBoy79:

Thank you for your reply.

Please consider the size of the disk as well while using the VADP mode for backup. If it is more than 1 TB sometimes and you use LAN for VADP then the consolidation gets fail because of time out issue.

Dell Rapid Recovery is a VADP solution, utilizing CBT.  Some of the vmdk's are more than 1 TB, while others are significantly less.  It fails to consolidate, randomly, on either.  Size does not appear to be an issue.

Decreasing the number of simultaneous backups being run on a single ESXi host seems to lower the likelihood of a consolidation failure.  However, it has not out-right eliminated this from occurring; even running just a single backup during off-peak hours can sometimes result in a "stuck" snapshot.

Reply
0 Kudos
davidcrowder
Enthusiast
Enthusiast
Jump to solution

grba:

Thank you for your reply.

Hot-add can cause the vmdk to get locked and you will not be able to consolidate if another VM has the disk attached to it. Although it does not explain why a reboot resolved the problem.

Dell Rapid Recovery is a VADP backup solution.  It does not use hot-add.

Reply
0 Kudos
davidcrowder
Enthusiast
Enthusiast
Jump to solution

I plan on using the ISO to upgrade to the latest version of 5.5 this weekend.

Before doing so, I'd like to ask:  Has anyone had any trouble with this?  Especially going from an early version of 5.5 all the way to Update 3?

Thanks

Reply
0 Kudos
VMBoy79
Contributor
Contributor
Jump to solution

David,

We have ESXi 5.5 update 3 in our environment, Still we are seeing issue with disk consolidation sometimes. but however, the no of lock file issues is once in a week. We are fixing it by restarting the management agents.

Reply
0 Kudos
davidcrowder
Enthusiast
Enthusiast
Jump to solution

VMBoy79,

That is unfortunate.  Because of the amount of data some of these VMs write, and that most of them are using thick provisioning, we could easily find ourselves running out of space on our datastores;  this is one bug we cannot leave unfixed.

If anyone has any ideas for a permanent fix, something where I'm not reacting to the problem, but a solution that will actually stop this consolidation issue from happening, I would appreciate it very very much.

Thanks

Reply
0 Kudos
VMBoy79
Contributor
Contributor
Jump to solution

Currently we have changed the backup strategy in order to come across the above issue. We have shortlisted the VM having more than 1 TB hard disk; we are running the VADP backup for only the OS drive; and running file level backup for all the other drives.

This drastically bring down the consolidation issues ..

Reply
0 Kudos
PhoenixStores
Contributor
Contributor
Jump to solution

I had this exact same problem today and came across this post.  I was able to resolve the issue however by performing a storage vMotion.  If you have more than one datastore with the space to hold the VM, you can perform a storage vMotion to move the VM to a different datastore, which automatically successfully consolidates your VM.  You can then storage vMotion the VM back to its original location.

Reply
0 Kudos
CHTIOUI
Contributor
Contributor
Jump to solution

We had the same problem, we have Netvault as a backup tool installed on a physical server, all attempts to consolidate the disks of a VM failed, we restarted the backup server and the consolidation is executed successfully

Reply
0 Kudos
virtualDD
Enthusiast
Enthusiast
Jump to solution

Just to give my 2 cents here. we have seen this issue on many occasions and it affects even the latest build of esxi (6.5). On customer in particular still has this issue. Due to the snapshots not being deleted they suffered severe performance issues on the applications running on those vms. The customer used commvault for backup which uses "proxy" vms to do the backup. In every case the lock was on one of the backup servers.

Did you try to restart the backup server/vm and then try to consolidate again?

this issue seems to affect most backup solutions out there and it does not seem to matter which method they use (hot-add or the vapd)

we are still trying to find a permanent solution for the customer i mentioned and the backup vendor is investigating as well but so far there is no permanent fix.

Reply
0 Kudos
dekoshal
Hot Shot
Hot Shot
Jump to solution

Please confirm if you see snapshot related .ctk files remaining in the datastore. For example: vmname-000001-ctk.vmdk. in the vmfolder after successful backup or consolidation.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

Reply
0 Kudos
DynegyIT
Contributor
Contributor
Jump to solution

We have the exact same issue using Rapid Recovery and ESXi. Restarting HOSTD generally fixes the lock, but it also takes the VMPlayer consoles offline which then have to be restarted as well. Not a great solution. Another issue is that occasionally after restarting HOSTD, our VCenter box will not reconnect to the host and has to be removed and added back in.

Reply
0 Kudos
E98P4
Contributor
Contributor
Jump to solution

We had the very same issue.  It was by sheer accident and the fear that too many failed snapshot consolidations would continue building up in the datastore that I paused protection of our backup solution (Rapid Recovery) from continuing to backup our database.

Once I paused protection on the backup server, I then went one more step and rebooted the backup server.  I then kept Rapid Recovery in a paused state overnight.

Again, I shut down the backup out of fear that the jobs would just continue to fail once VMware attempted consolidations.  They would just keep building up to the point that I was losing about 100GB per night and failed with error messages indicating that the File Was Locked.

Rapid Recovery apparently still had a linkage to the file causing it to lock down when VMware attempted to Consolidate.  The behavior mimics a file being locked down when an application is open.

Once the protection for the database was paused and there was no active jobs running that temporarily used datastore space, VMware ended up triggering off its regular maintainance schedule for Consolidation.  There were several unconsolidated jobs that had build up and hoarded hundreds of gigabytes of unrecovered data space.

The VMware consolidation ran on its own. but it took about 5 hours to complete.  When I came into my office the next morning, I found that all of the consolidations had run properly and we had recovered the lost data space.  VMware had successfully completed of the process of release the disk lease, removed snapshot and configured virtual machine.  I attribute this to just turning off the protection overnight.  Little did I know that it was this file reliance which ultimately caused the log jam as to why these jobs would not consolidate. 

Now as to why this seems to be randomly happening is another question, which I hope that it can be identified by developers.  It is unnerving to have to keep watch on the datastore when this occurs, but pausing the backup solution software thus breaking the file lock down and allowing VMware to do its normal consolidation process helped us to recover the lost data space.  

Reply
0 Kudos
maxchilenet2017
Enthusiast
Enthusiast
Jump to solution

You need free space for consolidation, at least twice the size of the machine,

try to do it with the machine turned off

Reply
0 Kudos