VMware Cloud Community
EKroboter
Enthusiast
Enthusiast

Cannot delete snaphsots, but can create new ones.

First of all, I want to thank everyone here for all the help and resources, without all your help I wouldn't have been able to make the switch to ESXi.

Now, some background before I dwell into my current issue. I'm the sysadmin for a medium sized company, our infrastructure uses two ESXi hosts (no vCenter, just two standalone hosts) with about half a dozen VMs on each (domain controllers, print server, UniFi controller, ESET Remote Adminitrator, SQL server, etc.). Both hosts are exactly alike (HPE ProLiant ML110 gen9's, 48 GB of RAM on each, a RAID1 for the main datastore plus an SSD for the swap datastore).

I backup all the VMs about once a week using Veeam Backup & Replication Free Edition, VeeamZipping them into packages on external storage. This has worked without any problems so far.

I don't usually take snapshots of the VMs except when making important changes, such as prior to installing a new SQL instance. When I do take snapshots though, I make sure to delete them and consolidate the disks after I confirm that everything is working. I don't want to store any cruft, I want clean and lean VMs.

Yesterday I upgrade both hosts to ESXi 6.5U1 (from 6.5). Everything went OK, no error messages whatsoever. There was one oversight though: I forgot to delete some snapshots from one VM prior to the update.

All the VMs work, except for one  in which I cannot remove any snapshots. Windows boots just fine and everything works, but I receive errors while trying to delete all the snapshots and when trying to consolidate the disks. I read about something regarding the CDROM drive that could impact the process if they're connected to an ISO image. It was, but the problem persisted even after removing the mount.

I tried to create a new snapshot to see If it'd worked. It did; but I cannot delete it. I VeeamZipped the VM to see if I could, and I was able to back it up. Veeam did ended the job with a warning saying that it wasn't able to delete the temporary snapshot it created for the job, which I then confirmed in the Snapshot Manager in the ESXi UI.

The message I received when trying to delete all snapshots is the following:

Failed - A general system error occurred: vim.fault.GenericVmConfigFault

And the one I get when trying to consolidate:

Failed - Unable to access file since it is locked

An error occurred while consolidating disks: One of the disks in this virtual machine is already in use by a virtual machine or by a snapshot.

Things I tried so far:

  1. Tried removing the snapshots with the VM on and off. No luck.
  2. Rebooting the ESXi host with the troubled VM. No luck.
  3. Restarting my workstation, which runs VeeamZip, to see if it had anything to do with it. No luck.

I read several articles online regarding this problem and quite frankly I'm a bit overwhelmed. I don't know where to start, everyone seems to be having this problem in different scenarios and setups, none of which apply to mine.

Here are some screenshots if it's of any help:

The VM in question:

VM details.png

The error popup:

Error popup.png

And the description:

Error description.png

The current snapshots:

Snapshots.png

Contents of the datastore:

Datastore ls.png

I will appreciate all the help I can get. Thank you Smiley Happy

29 Replies
daphnissov
Immortal
Immortal

What is interesting about your screenshot is you appear to have 2 disks on this VM and 4 snapshots on one disk but only *3* on the other. First thing to check is to make sure neither of those drives (VMDKs) are mounted to any other VMs on that host. Second, if you haven't already, update your ESXi embedded host client to the latest version posted to the fling site. You can update your client directly from a web browser using instructions in the fling. Third, attempt a "delete all snapshots" once again and record the time you initiated that operation. Then, pull the vmware.log file from the VMs home directory and upload it to your thread.

EKroboter
Enthusiast
Enthusiast

Yes, I noticed that as well. Hard Disk 1 has four vmdks and Hard Disk 2 only three. I can only assume this is because at one point, I tried to increase the size of the disk from 80 to 120 GB. The VM was turned off at the time. There was no change applied, I figured it was some UI issue.  Hard Disk 1 is still 80 GB, it's the C: drive in the VM containing the OS (thin provisiones) and Hard Disk 2 is a thick provisioned 120 GB second drive to host SQL databases.

I did took a snapshot prior to try to expand the disk. I thought about reverting back to it but I assumed it’d fail as well, so I didn’t

No other VM in the host has any vmdks from this VM mounted.

The current Client version I have is 1.21.0. The Fling you mentioned is now at v1.24.

Client version.png

So, I downloaded the vib file, renamed to esxiui.vib (for faster typing), put in the /tmp directory and then did:

esxcli software vib install -v /tmp/esxui.vib

That returned the following result:

Screen Shot 2017-12-09 at 17.01.44.png

And now I'm on v1.24:

Screen Shot 2017-12-09 at 17.02.53.png

I then tried to delete all the snapshots one more time, and heres the VMware.log file that I downloaded immediately after from the VM home folder (/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL)

Reply
0 Kudos
daphnissov
Immortal
Immortal

In the home directory, perform ls -lah and post the output here.

Reply
0 Kudos
daphnissov
Immortal
Immortal

Please also run the following command from the VM's home directory and attach the output file which will appear in the same directory named "vmdkChain".

find . -maxdepth 1 -iname '*.vmdk' -not -name '*-sesparse.vmdk' -not -name '*-flat.vmdk' -not -name '*-ctk.vmdk' -exec cat {} +>vmdkChain

Reply
0 Kudos
EKroboter
Enthusiast
Enthusiast

Here's the output of ls -lah:

total 182033984

drwxr-xr-x    1 root     root       80.0K Dec  9 02:40 .

drwxr-xr-t    1 root     root       76.0K Nov 19 20:57 ..

-rw-------    1 root     root       31.7K Nov 18 13:35 EKR-SVR02-SQL-Snapshot23.vmsn

-rw-------    1 root     root       31.7K Dec  9 01:57 EKR-SVR02-SQL-Snapshot25.vmsn

-rw-------    1 root     root       31.7K Dec  9 02:05 EKR-SVR02-SQL-Snapshot26.vmsn

-rw-r--r--    1 root     root          13 Nov  3 19:20 EKR-SVR02-SQL-aux.xml

-rw-------    1 root     root        8.5K Dec  9 16:07 EKR-SVR02-SQL.nvram

-rw-------    1 root     root        1.6K Dec  9 02:05 EKR-SVR02-SQL.vmsd

-rwx------    1 root     root        4.0K Dec  9 14:46 EKR-SVR02-SQL.vmx

-rw-------    1 root     root           0 Dec  9 02:39 EKR-SVR02-SQL.vmx.lck

-rw-------    1 root     root        3.1K Dec  9 01:56 EKR-SVR02-SQL.vmxf

-rwx------    1 root     root        4.0K Dec  9 14:46 EKR-SVR02-SQL.vmx~

-rw-------    1 root     root        5.0M Dec  9 02:03 EKR-SVR02-SQL0-000001-ctk.vmdk

-rw-------    1 root     root        8.8G Dec  9 02:03 EKR-SVR02-SQL0-000001-sesparse.vmdk

-rw-------    1 root     root         481 Dec  9 01:58 EKR-SVR02-SQL0-000001.vmdk

-rw-------    1 root     root        5.0M Nov 18 13:34 EKR-SVR02-SQL0-000002-ctk.vmdk

-rw-------    1 root     root       80.0G Nov 18 13:34 EKR-SVR02-SQL0-000002-flat.vmdk

-rw-------    1 root     root         637 Nov 18 00:19 EKR-SVR02-SQL0-000002.vmdk

-rw-------    1 root     root        5.0M Dec  9 01:57 EKR-SVR02-SQL0-000003-ctk.vmdk

-rw-------    1 root     root      326.0M Dec  9 01:57 EKR-SVR02-SQL0-000003-sesparse.vmdk

-rw-------    1 root     root         427 Dec  9 01:57 EKR-SVR02-SQL0-000003.vmdk

-rw-------    1 root     root        5.0M Dec  9 02:40 EKR-SVR02-SQL0-000004-ctk.vmdk

-rw-------    1 root     root      537.2M Dec  9 22:06 EKR-SVR02-SQL0-000004-sesparse.vmdk

-rw-------    1 root     root         427 Dec  9 02:39 EKR-SVR02-SQL0-000004.vmdk

-rw-------    1 root     root        7.5M Dec  9 02:03 EKR-SVR02-SQL_1-000001-ctk.vmdk

-rw-------    1 root     root        2.2G Dec  9 02:03 EKR-SVR02-SQL_1-000001-sesparse.vmdk

-rw-------    1 root     root         477 Dec  9 01:58 EKR-SVR02-SQL_1-000001.vmdk

-rw-------    1 root     root        7.5M Dec  9 01:57 EKR-SVR02-SQL_1-000002-ctk.vmdk

-rw-------    1 root     root      487.0M Dec  9 01:57 EKR-SVR02-SQL_1-000002-sesparse.vmdk

-rw-------    1 root     root         430 Dec  9 01:57 EKR-SVR02-SQL_1-000002.vmdk

-rw-------    1 root     root        7.5M Dec  9 02:40 EKR-SVR02-SQL_1-000003-ctk.vmdk

-rw-------    1 root     root      536.0M Dec  9 22:06 EKR-SVR02-SQL_1-000003-sesparse.vmdk

-rw-------    1 root     root         430 Dec  9 02:40 EKR-SVR02-SQL_1-000003.vmdk

-rw-------    1 root     root        7.5M Nov 18 13:34 EKR-SVR02-SQL_1-ctk.vmdk

-rw-------    1 root     root      120.0G Nov 18 13:34 EKR-SVR02-SQL_1-flat.vmdk

-rw-------    1 root     root         599 Nov 18 00:19 EKR-SVR02-SQL_1.vmdk

-rw-------    1 root     root      462.8K Nov 17 23:41 vmware-27.log

-rw-------    1 root     root      318.8K Nov 18 13:34 vmware-28.log

-rw-------    1 root     root      540.8K Nov 25 12:20 vmware-29.log

-rw-------    1 root     root      431.0K Dec  8 14:53 vmware-30.log

-rw-------    1 root     root      397.1K Dec  9 01:56 vmware-31.log

-rw-------    1 root     root      324.5K Dec  9 02:03 vmware-32.log

-rw-------    1 root     root      526.9K Dec  9 20:28 vmware.log

And attached is the output for find

Hope I did everything right.

Reply
0 Kudos
daphnissov
Immortal
Immortal

Ok, after looking at your log and your metadata chain, something has happened to the snapshot sequence IDs. It appears for each disk you have there is one snapshot that is not being actively referenced. For disk 0 this would be EKR-SVR02-SQL0-000003.vmdk (and the -flat and -ctk files that correspond) and for disk 1 this would be EKR-SVR02-SQL_1-000002.vmdk and its accompanying files. Both of these have date stamps of Dec 9 01:57. By looking at the log for the VM, disklib is not invoking these files but is for all the others.

2017-12-09T20:12:51.309Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000004-sesparse.vmdk" : open successful (21) size = 563277824, hd = 0. Type 19

2017-12-09T20:12:51.309Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000004-sesparse.vmdk" : closed.

2017-12-09T20:12:51.310Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000001-sesparse.vmdk" : open successful (21) size = 9495916544, hd = 0. Type 19

2017-12-09T20:12:51.310Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000001-sesparse.vmdk" : closed.

2017-12-09T20:12:51.310Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000002-flat.vmdk" : open successful (21) size = 85899345920, hd = 0. Type 3

2017-12-09T20:12:51.310Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000002-flat.vmdk" : closed.

2017-12-09T20:12:51.311Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-000003-sesparse.vmdk" : open successful (21) size = 562040832, hd = 0. Type 19

2017-12-09T20:12:51.311Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-000003-sesparse.vmdk" : closed.

2017-12-09T20:12:51.311Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-000001-sesparse.vmdk" : open successful (21) size = 2372390912, hd = 0. Type 19

2017-12-09T20:12:51.312Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-000001-sesparse.vmdk" : closed.

2017-12-09T20:12:51.312Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-flat.vmdk" : open successful (21) size = 128849018880, hd = 0. Type 3

2017-12-09T20:12:51.312Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-flat.vmdk" : closed.

Normally, in a healthy snapshot chain, all disks should be invoked in the reverse sequence ending with the base -flat extent file, but we don't see that with yours.

When I look at your disk metadata which I had you generate with the last file, I can see these orphaned disks don't have valid references to anything else in the chain. What's also interesting is that they appear to have a forward reference to the next delta one minute in the future.

I also see only three snapshot descriptors.

-rw-------    1 root     root       31.7K Nov 18 13:35 EKR-SVR02-SQL-Snapshot23.vmsn

-rw-------    1 root     root       31.7K Dec  9 01:57 EKR-SVR02-SQL-Snapshot25.vmsn

-rw-------    1 root     root       31.7K Dec  9 02:05 EKR-SVR02-SQL-Snapshot26.vmsn

And the Dec 9 01:57 time stamp appears for the errant descriptor as well. The following error appears in the log file related to this each time you try to commit.

2017-12-09T20:12:51.312Z| vmx| I125: SNAPSHOT: SnapshotDiskTreeAddFromSnapshot: Trying to add snapshot EKR-SVR02-SQL-Snapshot26.vmsn to disk /vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000001.vmdk which already has snapshot EKR-SVR02-SQL-Snapshot25.vmsn.

So it seems, somehow, a snapshot got created but never was referenced by the chain and isn't referenced even now.

Before proceeding, I know you said you had a VeeamZIP, but anytime you start messing with disks and their extents, you need to be positive you have a good backup.

Do not pass go and do not collect $200 if you think in any way, shape, or form that you do not have a good, solid backup.

That said, if you do, let's see if it can correct itself. Delete EKR-SVR02-SQL-Snapshot25.vmsn first with rm -f EKR-SVR02-SQL-Snapshot25.vmsn.

The VMSN files are just metadata for the memory points, which, since you didn't capture the memory state in any of the snapshots, essentially have no data. Delete this file and attempt to delete all snapshots once again. If that fails, repeat the ls -lah and attach a new vmware.log file.

Reply
0 Kudos
daphnissov
Immortal
Immortal

Also, I should have asked earlier, but please attach vmware-30, 31, and 32.log. I'd like to see what lead to this behavior.

Reply
0 Kudos
EKroboter
Enthusiast
Enthusiast

Wow, that was extremely thorough of you. Thank you. I will try your suggestions and post back but here are the logs you requested.

As for the backups, I do have a recent VeeamZip of the VM that ended with a warning, not an error. That warning was simply stating that Veeam wasn't able to delete the snapshot afterwards.

As a precaution, I also did a full bare metal backup from within Windows. Just to be safe.

Reply
0 Kudos
EKroboter
Enthusiast
Enthusiast

I deleted the Snapshot25.vmsn file, but the problem remains. Here's the new output and the vmware.log

total 182051328

drwxr-xr-x    1 root     root       80.0K Dec  9 23:13 .

drwxr-xr-t    1 root     root       76.0K Nov 19 20:57 ..

-rw-------    1 root     root       31.7K Nov 18 13:35 EKR-SVR02-SQL-Snapshot23.vmsn

-rw-------    1 root     root       31.7K Dec  9 02:05 EKR-SVR02-SQL-Snapshot26.vmsn

-rw-r--r--    1 root     root          13 Nov  3 19:20 EKR-SVR02-SQL-aux.xml

-rw-------    1 root     root        8.5K Dec  9 23:14 EKR-SVR02-SQL.nvram

-rw-------    1 root     root        1.6K Dec  9 02:05 EKR-SVR02-SQL.vmsd

-rwx------    1 root     root        4.0K Dec  9 23:13 EKR-SVR02-SQL.vmx

-rw-------    1 root     root           0 Dec  9 23:13 EKR-SVR02-SQL.vmx.lck

-rw-------    1 root     root        3.1K Dec  9 01:56 EKR-SVR02-SQL.vmxf

-rwx------    1 root     root        3.9K Dec  9 23:13 EKR-SVR02-SQL.vmx~

-rw-------    1 root     root        5.0M Dec  9 02:03 EKR-SVR02-SQL0-000001-ctk.vmdk

-rw-------    1 root     root        8.8G Dec  9 02:03 EKR-SVR02-SQL0-000001-sesparse.vmdk

-rw-------    1 root     root         481 Dec  9 01:58 EKR-SVR02-SQL0-000001.vmdk

-rw-------    1 root     root        5.0M Nov 18 13:34 EKR-SVR02-SQL0-000002-ctk.vmdk

-rw-------    1 root     root       80.0G Nov 18 13:34 EKR-SVR02-SQL0-000002-flat.vmdk

-rw-------    1 root     root         637 Nov 18 00:19 EKR-SVR02-SQL0-000002.vmdk

-rw-------    1 root     root        5.0M Dec  9 01:57 EKR-SVR02-SQL0-000003-ctk.vmdk

-rw-------    1 root     root      326.0M Dec  9 01:57 EKR-SVR02-SQL0-000003-sesparse.vmdk

-rw-------    1 root     root         427 Dec  9 01:57 EKR-SVR02-SQL0-000003.vmdk

-rw-------    1 root     root        5.0M Dec  9 23:14 EKR-SVR02-SQL0-000004-ctk.vmdk

-rw-------    1 root     root      537.2M Dec  9 23:15 EKR-SVR02-SQL0-000004-sesparse.vmdk

-rw-------    1 root     root         427 Dec  9 23:13 EKR-SVR02-SQL0-000004.vmdk

-rw-------    1 root     root        7.5M Dec  9 02:03 EKR-SVR02-SQL_1-000001-ctk.vmdk

-rw-------    1 root     root        2.2G Dec  9 02:03 EKR-SVR02-SQL_1-000001-sesparse.vmdk

-rw-------    1 root     root         477 Dec  9 01:58 EKR-SVR02-SQL_1-000001.vmdk

-rw-------    1 root     root        7.5M Dec  9 01:57 EKR-SVR02-SQL_1-000002-ctk.vmdk

-rw-------    1 root     root      487.0M Dec  9 01:57 EKR-SVR02-SQL_1-000002-sesparse.vmdk

-rw-------    1 root     root         430 Dec  9 01:57 EKR-SVR02-SQL_1-000002.vmdk

-rw-------    1 root     root        7.5M Dec  9 23:14 EKR-SVR02-SQL_1-000003-ctk.vmdk

-rw-------    1 root     root      552.0M Dec  9 23:15 EKR-SVR02-SQL_1-000003-sesparse.vmdk

-rw-------    1 root     root         430 Dec  9 23:13 EKR-SVR02-SQL_1-000003.vmdk

-rw-------    1 root     root        7.5M Nov 18 13:34 EKR-SVR02-SQL_1-ctk.vmdk

-rw-------    1 root     root      120.0G Nov 18 13:34 EKR-SVR02-SQL_1-flat.vmdk

-rw-------    1 root     root         599 Nov 18 00:19 EKR-SVR02-SQL_1.vmdk

-rw-------    1 root     root      318.8K Nov 18 13:34 vmware-28.log

-rw-------    1 root     root      540.8K Nov 25 12:20 vmware-29.log

-rw-------    1 root     root      431.0K Dec  8 14:53 vmware-30.log

-rw-------    1 root     root      397.1K Dec  9 01:56 vmware-31.log

-rw-------    1 root     root      324.5K Dec  9 02:03 vmware-32.log

-rw-------    1 root     root      559.3K Dec  9 23:03 vmware-33.log

-rw-------    1 root     root      268.9K Dec  9 23:14 vmware.log

Reply
0 Kudos
daphnissov
Immortal
Immortal

Could you please re-attach these logs in a file that doesn't have commas? It's failing to download and I'm just guessing it doesn't like commas.

Reply
0 Kudos
daphnissov
Immortal
Immortal

Try and create a new directory in that VM's home directory (mkdir backup) and move EKR-SVR02-SQL-Snapshot26.vmsn into it with mv EKR-SVR02-SQL-Snapshot26.vmsn backup/EKR-SVR02-SQL-Snapshot26.vmsn. Try the delete all again and see if it likes that. VMSN files shouldn't effect the removal or consolidation process, but they make it so you can't revert to that state. I probably should have said to move Snapshot25.vmsn into that backup directory before but I figured you're not going to revert.

Reply
0 Kudos
EKroboter
Enthusiast
Enthusiast

Sure, here it is without commas.

Reply
0 Kudos
EKroboter
Enthusiast
Enthusiast

I moved snapshot26, but it still isn't able to delete al the snapshots. Consolidation fails as well. This is getting serious.

The VM still works perfectly fine though.

Reply
0 Kudos
daphnissov
Immortal
Immortal

Do a cat EKR-SVR02-SQL.vmsd and paste the output.

Reply
0 Kudos
daphnissov
Immortal
Immortal

I also see that you have the advanced option snapshot.redoNotWithParent = "TRUE" set on this VM. This option is used to specify an alternate location where snapshot delta files reside. In your case, it isn't specifying an alternate location (workingDir = "."). I don't think this is the cause of any trouble, but it's unusual to see.

Reply
0 Kudos
EKroboter
Enthusiast
Enthusiast

Here the output for cat:

.encoding = "UTF-8"

snapshot.lastUID = "26"

snapshot.current = "26"

snapshot0.uid = "23"

snapshot0.filename = "EKR-SVR02-SQL-Snapshot23.vmsn"

snapshot0.displayName = "Instalacion BCM"

snapshot0.description = "Previo a la instalacion de la instancia SQLOutlookBCM"

snapshot0.createTimeHigh = "351809"

snapshot0.createTimeLow = "-328381373"

snapshot0.numDisks = "2"

snapshot0.disk0.fileName = "EKR-SVR02-SQL0-000002.vmdk"

snapshot0.disk0.node = "scsi0:0"

snapshot0.disk1.fileName = "EKR-SVR02-SQL_1.vmdk"

snapshot0.disk1.node = "scsi0:1"

snapshot.numSnapshots = "3"

snapshot1.uid = "25"

snapshot1.filename = "EKR-SVR02-SQL-Snapshot25.vmsn"

snapshot1.parent = "23"

snapshot1.displayName = "Expansión de disco C"

snapshot1.description = "Previo a la expansión del disco C de 80 a 120 GB"

snapshot1.createTimeHigh = "352222"

snapshot1.createTimeLow = "-1626735197"

snapshot1.numDisks = "2"

snapshot1.disk0.fileName = "EKR-SVR02-SQL0-000001.vmdk"

snapshot1.disk0.node = "scsi0:0"

snapshot1.disk1.fileName = "EKR-SVR02-SQL_1-000001.vmdk"

snapshot1.disk1.node = "scsi0:1"

snapshot2.uid = "26"

snapshot2.filename = "EKR-SVR02-SQL-Snapshot26.vmsn"

snapshot2.parent = "25"

snapshot2.displayName = "VEEAM BACKUP TEMPORARY SNAPSHOT"

snapshot2.description = "Please do not delete this snapshot. It is being used by Veeam Backup."

snapshot2.createTimeHigh = "352222"

snapshot2.createTimeLow = "-1157136709"

snapshot2.numDisks = "2"

snapshot2.disk0.fileName = "EKR-SVR02-SQL0-000001.vmdk"

snapshot2.disk0.node = "scsi0:0"

snapshot2.disk1.fileName = "EKR-SVR02-SQL_1-000001.vmdk"

snapshot2.disk1.node = "scsi0:1"

[root@EKR-ESXi01:/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL]

Reply
0 Kudos
EKroboter
Enthusiast
Enthusiast

I honestly don't know why this is set to TRUE. The is the first time I have a broken snapshot chain since I started using ESXi almost two years ago.  You're far more experienced than I am so I truly appreciate all your help, and I'm learning a lot in the process.

One thing I'm planning for if nothing else works, is turning off the VM, unregister it from the host and rename the dir to EKR-SVR02-SQL-bak. Then, restore the VeeamZip backup into the original location. It is my understanding that the VeeamZip file only contains the latest snapshot and does not archive old ones in any way (a sort of clone from a snapshot).

Reply
0 Kudos
daphnissov
Immortal
Immortal

Ok, this is interesting and represents a problem.

snapshot1.uid = "25"

snapshot1.filename = "EKR-SVR02-SQL-Snapshot25.vmsn"

snapshot1.parent = "23"

snapshot1.displayName = "Expansión de disco C"

snapshot1.description = "Previo a la expansión del disco C de 80 a 120 GB"

snapshot1.createTimeHigh = "352222"

snapshot1.createTimeLow = "-1626735197"

snapshot1.numDisks = "2"

snapshot1.disk0.fileName = "EKR-SVR02-SQL0-000001.vmdk"

snapshot1.disk0.node = "scsi0:0"

snapshot1.disk1.fileName = "EKR-SVR02-SQL_1-000001.vmdk"

snapshot1.disk1.node = "scsi0:1"

snapshot2.uid = "26"

snapshot2.filename = "EKR-SVR02-SQL-Snapshot26.vmsn"

snapshot2.parent = "25"

snapshot2.displayName = "VEEAM BACKUP TEMPORARY SNAPSHOT"

snapshot2.description = "Please do not delete this snapshot. It is being used by Veeam Backup."

snapshot2.createTimeHigh = "352222"

snapshot2.createTimeLow = "-1157136709"

snapshot2.numDisks = "2"

snapshot2.disk0.fileName = "EKR-SVR02-SQL0-000001.vmdk"

snapshot2.disk0.node = "scsi0:0"

snapshot2.disk1.fileName = "EKR-SVR02-SQL_1-000001.vmdk"

snapshot2.disk1.node = "scsi0:1"

Lines 9 and 11 and again at 21 and 23 represent the snapshot metadata descriptors that are created when a snapshot takes place. You have two per snapshot instance because you have two disks. Normally, the files should be sequential where the first snapshot ID references the base disk like in the following example that represents a normal descriptor:

[root@localhost:/vmfs/volumes/5a206362-e1f90f81-dc4e-0050568f2f00/qvgtaaq] cat qvgtaaq.vmsd

.encoding = "UTF-8"

snapshot.lastUID = "3"

snapshot.current = "3"

snapshot0.uid = "1"

snapshot0.filename = "qvgtaaq-Snapshot1.vmsn"

snapshot0.displayName = "test1"

snapshot0.createTimeHigh = "352241"

snapshot0.createTimeLow = "133930234"

snapshot0.numDisks = "1"

snapshot0.disk0.fileName = "qvgtaaq.vmdk"

snapshot0.disk0.node = "scsi0:0"

snapshot.numSnapshots = "3"

snapshot1.uid = "2"

snapshot1.filename = "qvgtaaq-Snapshot2.vmsn"

snapshot1.parent = "1"

snapshot1.displayName = "test2"

snapshot1.createTimeHigh = "352241"

snapshot1.createTimeLow = "142093129"

snapshot1.numDisks = "1"

snapshot1.disk0.fileName = "qvgtaaq-000001.vmdk"

snapshot1.disk0.node = "scsi0:0"

snapshot2.uid = "3"

snapshot2.filename = "qvgtaaq-Snapshot3.vmsn"

snapshot2.parent = "2"

snapshot2.displayName = "test3"

snapshot2.createTimeHigh = "352241"

snapshot2.createTimeLow = "-509770998"

snapshot2.numDisks = "1"

snapshot2.disk0.fileName = "qvgtaaq-000002.vmdk"

snapshot2.disk0.node = "scsi0:0"

You can see in lines 11, 21, and 30 what I mean. In your case, you have two different snapshots yet they reference the same disks. That should not be possible. I don't exactly know what the cause was since the logs you attached don't say, but it is apparent from a directory listing that something occurred on November 18 at 13:34 hours.

If we examine the collection of disk metadata I had you generate with the find command, we can confirm what the kernel knows about the disk chain. I have taken your output and reordered it according to the chaining sequence that your two disks use. You'll notice each of your disks has four VMDKs but the chain is only valid with three of the four. I'll call out the oddball in bold and indented to set it apart. I delineate your two disks with octothorpes and either "SQL0" or "SQL_1" because these are the names of the VMDKs associated with each ID. Pay attention to the CID and parentCID values for each disk descriptor file.

##############SQL0#######################

# Disk DescriptorFile

version=3

encoding="UTF-8"

CID=6ff46ce1

parentCID=ffffffff

isNativeSnapshot="no"

createType="vmfs"

# Extent description

RW 167772160 VMFS "EKR-SVR02-SQL0-000002-flat.vmdk"

# Change Tracking File

changeTrackPath="EKR-SVR02-SQL0-000002-ctk.vmdk"

# The Disk Data Base

#DDB

ddb.adapterType = "lsilogic"

ddb.geometry.cylinders = "10443"

ddb.geometry.heads = "255"

ddb.geometry.sectors = "63"

ddb.longContentID = "345ddc5328509426567b75216ff46ce1"

ddb.thinProvisioned = "1"

ddb.toolsInstallType = "1"

ddb.toolsVersion = "10272"

ddb.uuid = "60 00 C2 90 ef 73 45 bd-dd f5 8c 2e 9e a7 41 4e"

ddb.virtualHWVersion = "4"

# Disk DescriptorFile

version=3

encoding="UTF-8"

CID=91dadebe

parentCID=6ff46ce1

isNativeSnapshot="no"

createType="seSparse"

parentFileNameHint="EKR-SVR02-SQL0-000002.vmdk"

# Extent description

RW 167772160 SESPARSE "EKR-SVR02-SQL0-000001-sesparse.vmdk"

# Change Tracking File

changeTrackPath="EKR-SVR02-SQL0-000001-ctk.vmdk"

# The Disk Data Base

#DDB

ddb.grain = "8"

ddb.longContentID = "f178022005ee4fc6d6e2550491dadebe"

ddb.toolsInstallType = "1"

ddb.toolsVersion = "10279"

                            # Disk DescriptorFile

                            version=3

                            encoding="UTF-8"

                            CID=a36b113d

                            parentCID=a36b113d

                            isNativeSnapshot="no"

                            createType="seSparse"

                            parentFileNameHint="EKR-SVR02-SQL0-000001.vmdk"

                            # Extent description

                            RW 167772160 SESPARSE "EKR-SVR02-SQL0-000003-sesparse.vmdk"

                            # Change Tracking File

                            changeTrackPath="EKR-SVR02-SQL0-000003-ctk.vmdk"

                            # The Disk Data Base

                            #DDB

                            ddb.grain = "8"

                            ddb.longContentID = "08103fbe84947c11a9466c8aa36b113d"

# Disk DescriptorFile

version=3

encoding="UTF-8"

CID=b5f97da1

parentCID=91dadebe

isNativeSnapshot="no"

createType="seSparse"

parentFileNameHint="EKR-SVR02-SQL0-000001.vmdk"

# Extent description

RW 167772160 SESPARSE "EKR-SVR02-SQL0-000004-sesparse.vmdk"

# Change Tracking File

changeTrackPath="EKR-SVR02-SQL0-000004-ctk.vmdk"

# The Disk Data Base

#DDB

ddb.grain = "8"

ddb.longContentID = "ff3e1a80442ecccf1a13724ab5f97da1"

##############SQL_1#######################

# Disk DescriptorFile

version=3

encoding="UTF-8"

CID=54df96f8

parentCID=ffffffff

isNativeSnapshot="no"

createType="vmfs"

# Extent description

RW 251658240 VMFS "EKR-SVR02-SQL_1-flat.vmdk"

# Change Tracking File

changeTrackPath="EKR-SVR02-SQL_1-ctk.vmdk"

# The Disk Data Base

#DDB

ddb.adapterType = "lsilogic"

ddb.geometry.cylinders = "15665"

ddb.geometry.heads = "255"

ddb.geometry.sectors = "63"

ddb.longContentID = "3e40cf9631f190f71f4b192654df96f8"

ddb.toolsInstallType = "1"

ddb.toolsVersion = "10272"

ddb.uuid = "60 00 C2 92 96 61 04 e5-bd 1e ca 54 ad bd 89 3c"

ddb.virtualHWVersion = "4"

# Disk DescriptorFile

version=3

encoding="UTF-8"

CID=fe05af20

parentCID=54df96f8

isNativeSnapshot="no"

createType="seSparse"

parentFileNameHint="EKR-SVR02-SQL_1.vmdk"

# Extent description

RW 251658240 SESPARSE "EKR-SVR02-SQL_1-000001-sesparse.vmdk"

# Change Tracking File

changeTrackPath="EKR-SVR02-SQL_1-000001-ctk.vmdk"

# The Disk Data Base

#DDB

ddb.grain = "8"

ddb.longContentID = "1b6fb5b15bef96ffde00d424fe05af20"

ddb.toolsInstallType = "1"

ddb.toolsVersion = "10279"

                    # Disk DescriptorFile

                    version=3

                    encoding="UTF-8"

                    CID=85020c58

                    parentCID=85020c58

                    isNativeSnapshot="no"

                    createType="seSparse"

                    parentFileNameHint="EKR-SVR02-SQL_1-000001.vmdk"

                    # Extent description

                    RW 251658240 SESPARSE "EKR-SVR02-SQL_1-000002-sesparse.vmdk"

                    # Change Tracking File

                    changeTrackPath="EKR-SVR02-SQL_1-000002-ctk.vmdk"

                    # The Disk Data Base

                    #DDB

                    ddb.grain = "8"

                    ddb.longContentID = "3dfb354351db08de3d1d734f85020c58"

# Disk DescriptorFile

version=3

encoding="UTF-8"

CID=f18c47fc

parentCID=fe05af20

isNativeSnapshot="no"

createType="seSparse"

parentFileNameHint="EKR-SVR02-SQL_1-000001.vmdk"

# Extent description

RW 251658240 SESPARSE "EKR-SVR02-SQL_1-000003-sesparse.vmdk"

# Change Tracking File

changeTrackPath="EKR-SVR02-SQL_1-000003-ctk.vmdk"

# The Disk Data Base

#DDB

ddb.grain = "8"

ddb.longContentID = "913f8203de5e0ca33d2337cbf18c47fc"

Each VMDK has a CID and a parentCID associated with it. These IDs serve to reference the delta VMDK as well as the relationship to which it belongs. For convenience, I've compared all CIDs and ordered the chain per distinct disk in order of its precedence so it's easier to follow. Let's take a look at the first one, for example.

# Disk DescriptorFile

version=3

encoding="UTF-8"

CID=6ff46ce1

parentCID=ffffffff

isNativeSnapshot="no"

createType="vmfs"

# Extent description

RW 167772160 VMFS "EKR-SVR02-SQL0-000002-flat.vmdk"

# Change Tracking File

changeTrackPath="EKR-SVR02-SQL0-000002-ctk.vmdk"

There is a CID value and a parentCID. For base disks, the parentCID value equals ffffffff. This just means there is no other parent; the chain begins here. The CID is an identifier that refers to this disk itself and is unique. If we look at the next delta in the chain we see this:

# Disk DescriptorFile

version=3

encoding="UTF-8"

CID=91dadebe

parentCID=6ff46ce1

isNativeSnapshot="no"

createType="seSparse"

parentFileNameHint="EKR-SVR02-SQL0-000002.vmdk"

# Extent description

RW 167772160 SESPARSE "EKR-SVR02-SQL0-000001-sesparse.vmdk"

# Change Tracking File

changeTrackPath="EKR-SVR02-SQL0-000001-ctk.vmdk"

You can see on line 5 that the parentCID for this disk corresponds to the CID for the first disk. But this second disk has its own CID. Continuing on to the third disk in the chain:

# Disk DescriptorFile

version=3

encoding="UTF-8"

CID=b5f97da1

parentCID=91dadebe

isNativeSnapshot="no"

createType="seSparse"

parentFileNameHint="EKR-SVR02-SQL0-000001.vmdk"

# Extent description

RW 167772160 SESPARSE "EKR-SVR02-SQL0-000004-sesparse.vmdk"

# Change Tracking File

changeTrackPath="EKR-SVR02-SQL0-000004-ctk.vmdk"

# The Disk Data Base

#DDB

ddb.grain = "8"

ddb.longContentID = "ff3e1a80442ecccf1a13724ab5f97da1"

This disk, on line 5, has a parentCID that corresponds to the CID of the previous disk. You can also see the values specified with the key "parentFileNameHint" which tell you which file name to which this disk points. This is how a snapshot chain is formed. In the case of the outlier for disk0, however, we have this:

# Disk DescriptorFile

version=3

encoding="UTF-8"

CID=a36b113d

parentCID=a36b113d

isNativeSnapshot="no"

createType="seSparse"

parentFileNameHint="EKR-SVR02-SQL0-000001.vmdk"

# Extent description

RW 167772160 SESPARSE "EKR-SVR02-SQL0-000003-sesparse.vmdk"

# Change Tracking File

changeTrackPath="EKR-SVR02-SQL0-000003-ctk.vmdk"

# The Disk Data Base

#DDB

ddb.grain = "8"

ddb.longContentID = "08103fbe84947c11a9466c8aa36b113d"

You notice that the CID and parentCID do not correspond to any other CIDs in the chain. Also, you see they are both identical. Looking back at the vmware.log file, we can see which disks are invoked as part of this unbroken chain.

2017-12-09T23:19:48.680Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000004-sesparse.vmdk" : open successful (21) size = 580059136, hd = 0. Type 19

2017-12-09T23:19:48.680Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000004-sesparse.vmdk" : closed.

2017-12-09T23:19:48.681Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000001-sesparse.vmdk" : open successful (21) size = 9495916544, hd = 0. Type 19

2017-12-09T23:19:48.681Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000001-sesparse.vmdk" : closed.

2017-12-09T23:19:48.682Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000002-flat.vmdk" : open successful (21) size = 85899345920, hd = 0. Type 3

2017-12-09T23:19:48.682Z| vmx| I125: DISKLIB-VMFS  : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000002-flat.vmdk" : closed.

The order is confirmed as 4 -> 1 -> 2. The SE Sparse disk types indicate you're either running this VM on a VMFS-6 datastore, or the datastore is formatted over 2TB in size. The -flat disk is the base disk extent (where the actual data resides).

Anyhow, to get to the point, it's not letting you delete the snapshots because the snapshot descriptor file (VMSD) has conflicting information about two of the three snapshots and it won't let you clobber one with the other. That's why even though you remove the VMSN file it complains because the descriptor has a file for that position already.

Now, what to do about it. I've not seen this exact situation before, so I cannot provide precise guidance. I do have a suggestion which I *think* will resolve the issue, but it's not something I can test in my lab because of how specific your issue is. Before you act on anything I'm about to suggest to you, I recommend that you test out that VeeamZIP before you find yourself actually needing it. You can do this by restoring that VM with a different name and disconnecting the vNIC before powering it on. When VeeamZIP runs, it should capture a consolidated view of the VM and not all those snapshot files. At least I believe that's the case. I haven't actually verified this. In any case, do a test restore before proceeding in order to validate your data. I'm not sure what else runs on this, but I can infer from the name that it's a SQL server. If so, for an additional level of protection I would personally do a stand-alone backup of the important databases on this machine and offload them somewhere else in your estate. Any other data that is of import should be treated similarly.

With all those precautions and caveats in mind, this is what I think will work.

First, a validation that the outlier disk is truly unused.

  1. Shutdown this VM.
  2. SSH to the host and change to the VM's home directory.
  3. Make a backup directory.
  4. Move (or copy if you have the space) the derelict snapshots to this backup directory. For disk0, these are the 000003 files, and for disk1 these are the 000002 files. You can probably figure this out, but for disk0 you can move them with mv EKR-SVR02-SQL0-000003* <backup_dir>. Rinse and repeat for disk1.
  5. Perform ls -lh in the VM's home directory to ensure those files have been moved out.
  6. Run vmkfstools -e EKR-SVR02-SQL0-000004.vmdk within the VM's home directory. This will validate the chain is good. If it is, you'll see Disk chain is consistent. Repeat for EKR-SVR02-SQL_1-000003.vmdk to validate both chains.
  7. Power on the VM.

If those files are truly unused, the VM will power on just fine and return to normal operation. If it doesn't and it complains about them, honestly I'd have no idea why that would be the case at that point. But if it does, power it down, and move those files back, then power on and open a case with VMware support. If it does *not* complain, this is validation that those files are indeed derelict and do not participate any longer in the VM's disk chain. Assuming this is true and it powers on and everything is good, there are two possible routes that eventually lead to the same destination.

Once again, out of an abundance of caution, understand I have not done this and cannot properly test it internally. Please triple check you have validated your backup data is good as I take no responsibility for any corruption or data loss here.

We need to either wipe out the VMSD file and let the kernel consolidate disks on its own or alter the VMSD file to manually point it at the extents in use. If my theory is correct, either one should work.

Option 1:  Remove the VMSD file.

Pretty simple. Delete the VMSD file from the VM's home directory after you've copied it to your backup directory. Once it's deleted, do a consolidation operation. If it's successful, you should see a consolidate operation kick off that will collapse those disks back into the base. You should be left with EKR-SVR02-SQL0-000002.vmdk and EKR-SVR02-SQL_1.vmdk (plus their -flat and -ctk files).

Option 2: Alter the VMSD file to correct the snapshot chain manually.

This is more involved but allows you to perform a snapshot "delete all" from the GUI. It's possible the VM will need to be powered off then powered on to re-read the file. Of that I'm not absolutely certain.

In any case, edit the VMSD file (after taking a backup) and replace the last entry for snapshot2 with the last snapshots in the chain for both disks. Again, as a reminder, for disk0 this file would be EKR-SVR02-SQL0-000004.vmdk. For disk1 this file would be EKR-SVR02-SQL_1-000003.vmdk. Once saved, try a delete all operation. It should now succeed.

Reply
0 Kudos
EKroboter
Enthusiast
Enthusiast

Wow. This is absolutely the best support I’ve ever received in any forum. Your insight and advice has been very helpful. I’ve been reading up on your reply and can say that understand almost everything, and I yet can’t understand how the snapshot chain was broken.

Anyhow, I need to get this corrected by tomorrow so I’m going to unregister the VM and restore the backup in its place. I’m keeping the old one as I want to try and learn how to fix it properly if I ever experience the same issue (or if the backup restore fails).

I will have some time tomorrow after I checked everything is working correctly to move the VM to another host and perform the fix you suggest.  This is actually the first time I mess with the VM files so I wouldn’t want to do it on a production server. This VM in particular is running a SQL server with our company’s ERP software so you can imagine the outcry if tomorrow everyone comes in to work to find our invoicing and management software doesn’t work.

Not to worry though, apart from the VeeamZip I did a full bare metal backup, plus three copies of every database on the server.

I won’t lose data, I will lose some time but that’s something I can deal with.

Reply
0 Kudos