VMware Cloud Community
BTI_MRatcliffe
Contributor
Contributor

Virtual Machine Consolidation Needed / Runaway snapshot files

I have a CentOS guest that states "Virtual Machine Consolidation Needed".  When I attempt to run consolidation I receive the error "Unable to access file <unspecified filename>".  Also, when I browse the datastore, within the folder for this server, there are 64 vmdk files for each drive.  Currently there are only 2 drives on this server.  Each of the 64 vmdk files range in size from a couple hundred MB to several GB however they all are provisioned for the actual size of the disk.  In addition, for each vmdk, there is a corresponding -ctk file. There are not snapshots listed in the snapshot manager.  I am unable to clone or clone to template.  I am able to snapshot the machine however this just created more vmdk files and even after the snapshot has been deleted the nab_web-0000xx.vmdk file will remain and the disk file will stay on the nab_web-0000xx.vmdk file (Currently nab_web-000064.vmdk).  These total number of files continue to grow anytime a snapshot is made or a clone attempt is made.  Any help with this would be greatly appreciated.

nab_web.jpg

nab_web2.jpg

nab_web3.jpg

Reply
0 Kudos
16 Replies
a_p_
Leadership
Leadership

Issues like this usually occur with image based backup applications in place, where the backup application didn't finish a job and still locks one of the snapshots. Which backup application do you use, and does it also leverage backup proxies?

You may want to run RVTools to find out whether another VM has one of the VM's virtual disk files still mounted.

André

Reply
0 Kudos
BTI_MRatcliffe
Contributor
Contributor

We use Backup Exec 2012 however currently this machine is being excluded from VM backups because the backups will fail due to this problem.  Not being able to back the VM up is one of our main concern regarding this problem.  I have confirmed no other guest has these disk mounted.

Reply
0 Kudos
JPM300
Commander
Commander

Was the VM ever in the VM Backups in Backup Exec 2012?  Backup Exec has a really bad habit of removing items from the GUI but when you look at the TEXT mode on what is acutally being backed up it is still in the list.  I wonder if the system is still trying to backup this system as if the backups are not creating the large amount of snapshots i'm not sure as to what else would be.  The only other thing I can think of is if you have Change Control impemented with Vmware Configuration manager however this requires VMware Tools to be installed as of 5.5 or a seperate agent in 5.1 and lower, so I doubt this is the case.

You could look at your tasks/events on that system and see what user initated the snapshot, maybe that will lead you to the application that is causing it?

Reply
0 Kudos
a_p_
Leadership
Leadership

Please take a look at VMware KB: Unable to delete the virtual machine snapshot due to locked files and follow the steps to see whether this solves the issue.

André

Reply
0 Kudos
BTI_MRatcliffe
Contributor
Contributor

This is not a locked file issue and the article states "Currently, there is no resolution" and suggest consolidation (which does not work).  Was there something specific you suggest trying?

Reply
0 Kudos
BTI_MRatcliffe
Contributor
Contributor

No it was never backed up using Backup Exec 2012.   As mentioned above, attempting to clone or clone to template (which fail) cause additional vmdk file to be created.  In addition, taking a snapshot (which I can successfully delete via the snapshot manager) creates additional vmdk files.

Reply
0 Kudos
a_p_
Leadership
Leadership

Well, at least one of the files seems to be either locked or inaccessible in any way. Do you see any hints on which file is causing the issue and the reason for this, i.e. which error is reported in either the VM's vmware.log or in the vmkernel log?

André

Reply
0 Kudos
BTI_MRatcliffe
Contributor
Contributor

Here are a few parts of the vmware.log at the time of a consolidation that may be related/relevant. 

2014-06-05T17:20:10.489Z| vcpu-0| DISKLIB-CTK   : Could not open change tracking file "/vmfs/volumes/4a8598db-085c22d9-9ca6-00219b91ddc8/nab_web/nab_web-ctk.vmdk": Change tracking invalid or disk in use.

2014-06-05T17:20:10.492Z| vcpu-0| DISKLIB-CTK   : Re-initializing change tracking.

2014-06-05T17:20:10.492Z| vcpu-0| DISKLIB-CTK   : Auto blocksize for size 54525952 is 128.

2014-06-05T17:20:10.498Z| vcpu-0| DISKLIB-CBT   : Initializing ESX kernel change tracking for fid 327999057.

2014-06-05T17:20:10.498Z| vcpu-0| DISKLIB-CBT   : Successfuly created cbt node 138cde51-cbt.

2014-06-05T17:20:10.498Z| vcpu-0| DISKLIB-CBT   : Opening cbt node /vmfs/devices/cbt/138cde51-cbt

2014-06-05T17:20:10.498Z| vcpu-0| DISKLIB-LIB   : Opened "/vmfs/volumes/4a8598db-085c22d9-9ca6-00219b91ddc8/nab_web/nab_web.vmdk" (flags 0x20a, type vmfs).

2014-06-05T17:20:10.499Z| vcpu-0| DISKLIB-CBT   : Shutting down change tracking for untracked fid 505961863.

2014-06-05T17:20:10.499Z| vcpu-0| DISKLIB-CBT   : Successfully disconnected CBT node.

2014-06-05T17:20:15.208Z| SnapshotVMXCombiner| DISKLIB-VMFS_SPARSE : VmfsSparseExtentCombine: failed: for 66 level and start 0 Input/output error.

2014-06-05T17:20:15.208Z| SnapshotVMXCombiner| DISKLIB-CTK   : End Combine

2014-06-05T17:20:15.209Z| SnapshotVMXCombiner| DISKLIB-CTK   : Attempting unlink of (null)

2014-06-05T17:20:15.209Z| SnapshotVMXCombiner| SnapshotVMXCombineFinalCb: Done with combine of 67 links, starting from 1 in 2759350 usec with error 0x50009: Input/output error

2014-06-05T17:20:15.330Z| vcpu-0| SnapshotVMXNeedConsolidateIteration : Invalid consolidateRate (4294967296.000000 MBps), not taking a helper snapshot

2014-06-05T17:20:15.330Z| vcpu-0| SnapshotVMXNeedConsolidateIteration: Another iteration of helper snapshot is not needed.

2014-06-05T17:20:16.405Z| vcpu-0| Foundry operation failed with system error: Input/output error (5), translated to 7

2014-06-05T17:20:16.405Z| vcpu-0| SnapshotVMXConsolidateOnlineCB: Destroying thread 5

Reply
0 Kudos
a_p_
Leadership
Leadership

To rule out issues with changed block tracking files, you can simple delete all the ...-ctk.vmdk files. You don't need them for snapshot consolidation. Looking at the large number of snapshots you already have, you may run into issues with the maximum supported snapshot depth (see http://kb.vmware.com/kb/1004545).

Anyway, cloning reported an issue with snapshot nab_web-000059.vmdk, maybe there's an issue with this file!? Please post (attach the file using the  Advanced Editor) the complete vmware.log file to see whether something can be found in it which helps solving the issue.

André

Reply
0 Kudos
BTI_MRatcliffe
Contributor
Contributor

Attached is the entire vmware.log.

Reply
0 Kudos
a_p_
Leadership
Leadership

Assuming you can afford some downtime and you do have sufficient free disk space on the datastore to clone the virtual disks, we could try some things to see what happens and how to workaround the issues. Although you posted an error with cloning in your initial post I'd like you to try to manually clone the two virtual disks from the command line. The commands will not alter the original files, but only create clones which we can rename later to match the original names.

  1. cleanly shut down the VM (do not suspend)
  2. open an SSH/putty session or the ESXi Shell and go to the VM's directory
  3. clone the virtual disks running: (assuming 000067 are still the current snapshot numbers for the disks in the VM's .vmx file)
    vmkfstools -i nab_web-000067.vmdk nab_web_clone.vmdk
    vmkfstools -i nab_web_1-000067.vmdk nab_web_1_clone.vmdk


Once done please post the results.


André


Reply
0 Kudos
BTI_MRatcliffe
Contributor
Contributor

Followed your suggestions:  Shutdown the server, SSH to the host, and ran the following.  Received error "Failed to clone disk: Bad file descriptor (589833).

/vmfs/volumes/4a8598db-085c22d9-9ca6-00219b91ddc8/nab_web # vmkfstools -i nab_web-000069.vmdk nab_web_clone.vmdk

Destination disk format: VMFS zeroedthick

Cloning disk 'nab_web-000069.vmdk'...

Clone: 100% done.

Failed to clone disk: Bad file descriptor (589833).

I also tried to clone from other various vmdks such as 000008.vmdk and as far back as 000001.vmdk with the same results.

/vmfs/volumes/4a8598db-085c22d9-9ca6-00219b91ddc8/nab_web # vmkfstools -i nab_web-000001.vmdk nab_web_clone.vmdk

Destination disk format: VMFS zeroedthick

Cloning disk 'nab_web-000001.vmdk'...

Clone: 100% done.Failed to clone disk: Bad file descriptor (589833).

/vmfs/volumes/4a8598db-085c22d9-9ca6-00219b91ddc8/nab_web #

Reply
0 Kudos
a_p_
Leadership
Leadership

Maybe the verbose option gives some hints? Please run the vmkfstools command again appending -v 10 to the command line.

André

Reply
0 Kudos
imfaisal87
Enthusiast
Enthusiast

I have gone through this nightmare. The main .vmdk file disk is being used somewhere in your vCenter

In my case it was a backup solution but as you are saying it is never backed up but still recheck maybe backup virtual appliance may have locked the disk if backup exec is virtual appliance.

Further if you have multiple datastore, check summary tab of VM and see which datastore the VM is part of, maybe you will find abnormal datastore attached with your VM, from which you might be able to identify from where your main disk is locked.

Regards

Reply
0 Kudos
bggb29
Expert
Expert

Can you migrate the machine when it is powered off, and if you can do all the snaps follow.

Is your storage capable of cloning the system ( not a vcenter clone)

Did anyone ever use a vdp appliance to backup the system ?

What disk is the active disk according to the vmx file.

With the system powered off can you use the cli to consolidate the snaps ?

Reply
0 Kudos
gallycool
Enthusiast
Enthusiast

Hello

Below mentioned in the process followed by the backup operation to take virtual machine backup.

When a backup is triggered the backup solution take a snapshot of the disk in this process a new delta file is created and all the I/O from that time is written on to the delta file.

The virtual machine is mounted on to the delta file and all the other base disks are unmounted from the virtual machine and mounted over the backup proxy server and a backup is done.

Once the backup is completed the backup solution has to remove the mounted disk and then have to delete the snapshot and then have to resume the operations from the base disk.

If the backup operation fails or not release the disk from proxy then you may not be able to consolidate the disk and it says locked.

Please check the lock on the file mentioned in the error message as below.

vmfstools -D file.vmdk.

The output will be the mac address of the host or the system which is having a lock over the disk.

Check the system which is having the lock and then reboot the system.

By this process the lock will be released.

Second possibility could e a ctk file corruption.

These ctk file are change tracking file and are created after a full backup.

From next backup this files will be updated with the changes from the full backup.

So when a incremental backup is done it will not check the disk but checks the ctk file for the bloack changes.

This ctk file gets corrupted if the disk is expanded over a size of 128 GB or if the backup is failed in some cases.

If the ctk file gets corrupted then we have to delete the ctk files and then have to consolidate and this can be done when the virtual machine is powered on.

After the deletion we can consolidate.

If the above doesn't work then have to power of the virtual machine and consolidate.

If this doesn't work the last option could be a reboot of host on which we have the virtual machine.

Please let me know if you still have any queries.

Thanks

Sam

Reply
0 Kudos