First of all, I want to thank everyone here for all the help and resources, without all your help I wouldn't have been able to make the switch to ESXi.
Now, some background before I dwell into my current issue. I'm the sysadmin for a medium sized company, our infrastructure uses two ESXi hosts (no vCenter, just two standalone hosts) with about half a dozen VMs on each (domain controllers, print server, UniFi controller, ESET Remote Adminitrator, SQL server, etc.). Both hosts are exactly alike (HPE ProLiant ML110 gen9's, 48 GB of RAM on each, a RAID1 for the main datastore plus an SSD for the swap datastore).
I backup all the VMs about once a week using Veeam Backup & Replication Free Edition, VeeamZipping them into packages on external storage. This has worked without any problems so far.
I don't usually take snapshots of the VMs except when making important changes, such as prior to installing a new SQL instance. When I do take snapshots though, I make sure to delete them and consolidate the disks after I confirm that everything is working. I don't want to store any cruft, I want clean and lean VMs.
Yesterday I upgrade both hosts to ESXi 6.5U1 (from 6.5). Everything went OK, no error messages whatsoever. There was one oversight though: I forgot to delete some snapshots from one VM prior to the update.
All the VMs work, except for one in which I cannot remove any snapshots. Windows boots just fine and everything works, but I receive errors while trying to delete all the snapshots and when trying to consolidate the disks. I read about something regarding the CDROM drive that could impact the process if they're connected to an ISO image. It was, but the problem persisted even after removing the mount.
I tried to create a new snapshot to see If it'd worked. It did; but I cannot delete it. I VeeamZipped the VM to see if I could, and I was able to back it up. Veeam did ended the job with a warning saying that it wasn't able to delete the temporary snapshot it created for the job, which I then confirmed in the Snapshot Manager in the ESXi UI.
The message I received when trying to delete all snapshots is the following:
Failed - A general system error occurred: vim.fault.GenericVmConfigFault
And the one I get when trying to consolidate:
Failed - Unable to access file since it is locked
An error occurred while consolidating disks: One of the disks in this virtual machine is already in use by a virtual machine or by a snapshot.
Things I tried so far:
I read several articles online regarding this problem and quite frankly I'm a bit overwhelmed. I don't know where to start, everyone seems to be having this problem in different scenarios and setups, none of which apply to mine.
Here are some screenshots if it's of any help:
The VM in question:
The error popup:
And the description:
The current snapshots:
Contents of the datastore:
I will appreciate all the help I can get. Thank you
What is interesting about your screenshot is you appear to have 2 disks on this VM and 4 snapshots on one disk but only *3* on the other. First thing to check is to make sure neither of those drives (VMDKs) are mounted to any other VMs on that host. Second, if you haven't already, update your ESXi embedded host client to the latest version posted to the fling site. You can update your client directly from a web browser using instructions in the fling. Third, attempt a "delete all snapshots" once again and record the time you initiated that operation. Then, pull the vmware.log file from the VMs home directory and upload it to your thread.
Yes, I noticed that as well. Hard Disk 1 has four vmdks and Hard Disk 2 only three. I can only assume this is because at one point, I tried to increase the size of the disk from 80 to 120 GB. The VM was turned off at the time. There was no change applied, I figured it was some UI issue. Hard Disk 1 is still 80 GB, it's the C: drive in the VM containing the OS (thin provisiones) and Hard Disk 2 is a thick provisioned 120 GB second drive to host SQL databases.
I did took a snapshot prior to try to expand the disk. I thought about reverting back to it but I assumed it’d fail as well, so I didn’t
No other VM in the host has any vmdks from this VM mounted.
The current Client version I have is 1.21.0. The Fling you mentioned is now at v1.24.
So, I downloaded the vib file, renamed to esxiui.vib (for faster typing), put in the /tmp directory and then did:
esxcli software vib install -v /tmp/esxui.vib
That returned the following result:
And now I'm on v1.24:
I then tried to delete all the snapshots one more time, and heres the VMware.log file that I downloaded immediately after from the VM home folder (/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL)
In the home directory, perform ls -lah and post the output here.
Please also run the following command from the VM's home directory and attach the output file which will appear in the same directory named "vmdkChain".
find . -maxdepth 1 -iname '*.vmdk' -not -name '*-sesparse.vmdk' -not -name '*-flat.vmdk' -not -name '*-ctk.vmdk' -exec cat {} +>vmdkChain
Here's the output of ls -lah:
total 182033984
drwxr-xr-x 1 root root 80.0K Dec 9 02:40 .
drwxr-xr-t 1 root root 76.0K Nov 19 20:57 ..
-rw------- 1 root root 31.7K Nov 18 13:35 EKR-SVR02-SQL-Snapshot23.vmsn
-rw------- 1 root root 31.7K Dec 9 01:57 EKR-SVR02-SQL-Snapshot25.vmsn
-rw------- 1 root root 31.7K Dec 9 02:05 EKR-SVR02-SQL-Snapshot26.vmsn
-rw-r--r-- 1 root root 13 Nov 3 19:20 EKR-SVR02-SQL-aux.xml
-rw------- 1 root root 8.5K Dec 9 16:07 EKR-SVR02-SQL.nvram
-rw------- 1 root root 1.6K Dec 9 02:05 EKR-SVR02-SQL.vmsd
-rwx------ 1 root root 4.0K Dec 9 14:46 EKR-SVR02-SQL.vmx
-rw------- 1 root root 0 Dec 9 02:39 EKR-SVR02-SQL.vmx.lck
-rw------- 1 root root 3.1K Dec 9 01:56 EKR-SVR02-SQL.vmxf
-rwx------ 1 root root 4.0K Dec 9 14:46 EKR-SVR02-SQL.vmx~
-rw------- 1 root root 5.0M Dec 9 02:03 EKR-SVR02-SQL0-000001-ctk.vmdk
-rw------- 1 root root 8.8G Dec 9 02:03 EKR-SVR02-SQL0-000001-sesparse.vmdk
-rw------- 1 root root 481 Dec 9 01:58 EKR-SVR02-SQL0-000001.vmdk
-rw------- 1 root root 5.0M Nov 18 13:34 EKR-SVR02-SQL0-000002-ctk.vmdk
-rw------- 1 root root 80.0G Nov 18 13:34 EKR-SVR02-SQL0-000002-flat.vmdk
-rw------- 1 root root 637 Nov 18 00:19 EKR-SVR02-SQL0-000002.vmdk
-rw------- 1 root root 5.0M Dec 9 01:57 EKR-SVR02-SQL0-000003-ctk.vmdk
-rw------- 1 root root 326.0M Dec 9 01:57 EKR-SVR02-SQL0-000003-sesparse.vmdk
-rw------- 1 root root 427 Dec 9 01:57 EKR-SVR02-SQL0-000003.vmdk
-rw------- 1 root root 5.0M Dec 9 02:40 EKR-SVR02-SQL0-000004-ctk.vmdk
-rw------- 1 root root 537.2M Dec 9 22:06 EKR-SVR02-SQL0-000004-sesparse.vmdk
-rw------- 1 root root 427 Dec 9 02:39 EKR-SVR02-SQL0-000004.vmdk
-rw------- 1 root root 7.5M Dec 9 02:03 EKR-SVR02-SQL_1-000001-ctk.vmdk
-rw------- 1 root root 2.2G Dec 9 02:03 EKR-SVR02-SQL_1-000001-sesparse.vmdk
-rw------- 1 root root 477 Dec 9 01:58 EKR-SVR02-SQL_1-000001.vmdk
-rw------- 1 root root 7.5M Dec 9 01:57 EKR-SVR02-SQL_1-000002-ctk.vmdk
-rw------- 1 root root 487.0M Dec 9 01:57 EKR-SVR02-SQL_1-000002-sesparse.vmdk
-rw------- 1 root root 430 Dec 9 01:57 EKR-SVR02-SQL_1-000002.vmdk
-rw------- 1 root root 7.5M Dec 9 02:40 EKR-SVR02-SQL_1-000003-ctk.vmdk
-rw------- 1 root root 536.0M Dec 9 22:06 EKR-SVR02-SQL_1-000003-sesparse.vmdk
-rw------- 1 root root 430 Dec 9 02:40 EKR-SVR02-SQL_1-000003.vmdk
-rw------- 1 root root 7.5M Nov 18 13:34 EKR-SVR02-SQL_1-ctk.vmdk
-rw------- 1 root root 120.0G Nov 18 13:34 EKR-SVR02-SQL_1-flat.vmdk
-rw------- 1 root root 599 Nov 18 00:19 EKR-SVR02-SQL_1.vmdk
-rw------- 1 root root 462.8K Nov 17 23:41 vmware-27.log
-rw------- 1 root root 318.8K Nov 18 13:34 vmware-28.log
-rw------- 1 root root 540.8K Nov 25 12:20 vmware-29.log
-rw------- 1 root root 431.0K Dec 8 14:53 vmware-30.log
-rw------- 1 root root 397.1K Dec 9 01:56 vmware-31.log
-rw------- 1 root root 324.5K Dec 9 02:03 vmware-32.log
-rw------- 1 root root 526.9K Dec 9 20:28 vmware.log
And attached is the output for find
Hope I did everything right.
Ok, after looking at your log and your metadata chain, something has happened to the snapshot sequence IDs. It appears for each disk you have there is one snapshot that is not being actively referenced. For disk 0 this would be EKR-SVR02-SQL0-000003.vmdk (and the -flat and -ctk files that correspond) and for disk 1 this would be EKR-SVR02-SQL_1-000002.vmdk and its accompanying files. Both of these have date stamps of Dec 9 01:57. By looking at the log for the VM, disklib is not invoking these files but is for all the others.
2017-12-09T20:12:51.309Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000004-sesparse.vmdk" : open successful (21) size = 563277824, hd = 0. Type 19
2017-12-09T20:12:51.309Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000004-sesparse.vmdk" : closed.
2017-12-09T20:12:51.310Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000001-sesparse.vmdk" : open successful (21) size = 9495916544, hd = 0. Type 19
2017-12-09T20:12:51.310Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000001-sesparse.vmdk" : closed.
2017-12-09T20:12:51.310Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000002-flat.vmdk" : open successful (21) size = 85899345920, hd = 0. Type 3
2017-12-09T20:12:51.310Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000002-flat.vmdk" : closed.
2017-12-09T20:12:51.311Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-000003-sesparse.vmdk" : open successful (21) size = 562040832, hd = 0. Type 19
2017-12-09T20:12:51.311Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-000003-sesparse.vmdk" : closed.
2017-12-09T20:12:51.311Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-000001-sesparse.vmdk" : open successful (21) size = 2372390912, hd = 0. Type 19
2017-12-09T20:12:51.312Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-000001-sesparse.vmdk" : closed.
2017-12-09T20:12:51.312Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-flat.vmdk" : open successful (21) size = 128849018880, hd = 0. Type 3
2017-12-09T20:12:51.312Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL_1-flat.vmdk" : closed.
Normally, in a healthy snapshot chain, all disks should be invoked in the reverse sequence ending with the base -flat extent file, but we don't see that with yours.
When I look at your disk metadata which I had you generate with the last file, I can see these orphaned disks don't have valid references to anything else in the chain. What's also interesting is that they appear to have a forward reference to the next delta one minute in the future.
I also see only three snapshot descriptors.
-rw------- 1 root root 31.7K Nov 18 13:35 EKR-SVR02-SQL-Snapshot23.vmsn
-rw------- 1 root root 31.7K Dec 9 01:57 EKR-SVR02-SQL-Snapshot25.vmsn
-rw------- 1 root root 31.7K Dec 9 02:05 EKR-SVR02-SQL-Snapshot26.vmsn
And the Dec 9 01:57 time stamp appears for the errant descriptor as well. The following error appears in the log file related to this each time you try to commit.
2017-12-09T20:12:51.312Z| vmx| I125: SNAPSHOT: SnapshotDiskTreeAddFromSnapshot: Trying to add snapshot EKR-SVR02-SQL-Snapshot26.vmsn to disk /vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000001.vmdk which already has snapshot EKR-SVR02-SQL-Snapshot25.vmsn.
So it seems, somehow, a snapshot got created but never was referenced by the chain and isn't referenced even now.
Before proceeding, I know you said you had a VeeamZIP, but anytime you start messing with disks and their extents, you need to be positive you have a good backup.
Do not pass go and do not collect $200 if you think in any way, shape, or form that you do not have a good, solid backup.
That said, if you do, let's see if it can correct itself. Delete EKR-SVR02-SQL-Snapshot25.vmsn first with rm -f EKR-SVR02-SQL-Snapshot25.vmsn.
The VMSN files are just metadata for the memory points, which, since you didn't capture the memory state in any of the snapshots, essentially have no data. Delete this file and attempt to delete all snapshots once again. If that fails, repeat the ls -lah and attach a new vmware.log file.
Also, I should have asked earlier, but please attach vmware-30, 31, and 32.log. I'd like to see what lead to this behavior.
Wow, that was extremely thorough of you. Thank you. I will try your suggestions and post back but here are the logs you requested.
As for the backups, I do have a recent VeeamZip of the VM that ended with a warning, not an error. That warning was simply stating that Veeam wasn't able to delete the snapshot afterwards.
As a precaution, I also did a full bare metal backup from within Windows. Just to be safe.
I deleted the Snapshot25.vmsn file, but the problem remains. Here's the new output and the vmware.log
total 182051328
drwxr-xr-x 1 root root 80.0K Dec 9 23:13 .
drwxr-xr-t 1 root root 76.0K Nov 19 20:57 ..
-rw------- 1 root root 31.7K Nov 18 13:35 EKR-SVR02-SQL-Snapshot23.vmsn
-rw------- 1 root root 31.7K Dec 9 02:05 EKR-SVR02-SQL-Snapshot26.vmsn
-rw-r--r-- 1 root root 13 Nov 3 19:20 EKR-SVR02-SQL-aux.xml
-rw------- 1 root root 8.5K Dec 9 23:14 EKR-SVR02-SQL.nvram
-rw------- 1 root root 1.6K Dec 9 02:05 EKR-SVR02-SQL.vmsd
-rwx------ 1 root root 4.0K Dec 9 23:13 EKR-SVR02-SQL.vmx
-rw------- 1 root root 0 Dec 9 23:13 EKR-SVR02-SQL.vmx.lck
-rw------- 1 root root 3.1K Dec 9 01:56 EKR-SVR02-SQL.vmxf
-rwx------ 1 root root 3.9K Dec 9 23:13 EKR-SVR02-SQL.vmx~
-rw------- 1 root root 5.0M Dec 9 02:03 EKR-SVR02-SQL0-000001-ctk.vmdk
-rw------- 1 root root 8.8G Dec 9 02:03 EKR-SVR02-SQL0-000001-sesparse.vmdk
-rw------- 1 root root 481 Dec 9 01:58 EKR-SVR02-SQL0-000001.vmdk
-rw------- 1 root root 5.0M Nov 18 13:34 EKR-SVR02-SQL0-000002-ctk.vmdk
-rw------- 1 root root 80.0G Nov 18 13:34 EKR-SVR02-SQL0-000002-flat.vmdk
-rw------- 1 root root 637 Nov 18 00:19 EKR-SVR02-SQL0-000002.vmdk
-rw------- 1 root root 5.0M Dec 9 01:57 EKR-SVR02-SQL0-000003-ctk.vmdk
-rw------- 1 root root 326.0M Dec 9 01:57 EKR-SVR02-SQL0-000003-sesparse.vmdk
-rw------- 1 root root 427 Dec 9 01:57 EKR-SVR02-SQL0-000003.vmdk
-rw------- 1 root root 5.0M Dec 9 23:14 EKR-SVR02-SQL0-000004-ctk.vmdk
-rw------- 1 root root 537.2M Dec 9 23:15 EKR-SVR02-SQL0-000004-sesparse.vmdk
-rw------- 1 root root 427 Dec 9 23:13 EKR-SVR02-SQL0-000004.vmdk
-rw------- 1 root root 7.5M Dec 9 02:03 EKR-SVR02-SQL_1-000001-ctk.vmdk
-rw------- 1 root root 2.2G Dec 9 02:03 EKR-SVR02-SQL_1-000001-sesparse.vmdk
-rw------- 1 root root 477 Dec 9 01:58 EKR-SVR02-SQL_1-000001.vmdk
-rw------- 1 root root 7.5M Dec 9 01:57 EKR-SVR02-SQL_1-000002-ctk.vmdk
-rw------- 1 root root 487.0M Dec 9 01:57 EKR-SVR02-SQL_1-000002-sesparse.vmdk
-rw------- 1 root root 430 Dec 9 01:57 EKR-SVR02-SQL_1-000002.vmdk
-rw------- 1 root root 7.5M Dec 9 23:14 EKR-SVR02-SQL_1-000003-ctk.vmdk
-rw------- 1 root root 552.0M Dec 9 23:15 EKR-SVR02-SQL_1-000003-sesparse.vmdk
-rw------- 1 root root 430 Dec 9 23:13 EKR-SVR02-SQL_1-000003.vmdk
-rw------- 1 root root 7.5M Nov 18 13:34 EKR-SVR02-SQL_1-ctk.vmdk
-rw------- 1 root root 120.0G Nov 18 13:34 EKR-SVR02-SQL_1-flat.vmdk
-rw------- 1 root root 599 Nov 18 00:19 EKR-SVR02-SQL_1.vmdk
-rw------- 1 root root 318.8K Nov 18 13:34 vmware-28.log
-rw------- 1 root root 540.8K Nov 25 12:20 vmware-29.log
-rw------- 1 root root 431.0K Dec 8 14:53 vmware-30.log
-rw------- 1 root root 397.1K Dec 9 01:56 vmware-31.log
-rw------- 1 root root 324.5K Dec 9 02:03 vmware-32.log
-rw------- 1 root root 559.3K Dec 9 23:03 vmware-33.log
-rw------- 1 root root 268.9K Dec 9 23:14 vmware.log
Could you please re-attach these logs in a file that doesn't have commas? It's failing to download and I'm just guessing it doesn't like commas.
Try and create a new directory in that VM's home directory (mkdir backup) and move EKR-SVR02-SQL-Snapshot26.vmsn into it with mv EKR-SVR02-SQL-Snapshot26.vmsn backup/EKR-SVR02-SQL-Snapshot26.vmsn. Try the delete all again and see if it likes that. VMSN files shouldn't effect the removal or consolidation process, but they make it so you can't revert to that state. I probably should have said to move Snapshot25.vmsn into that backup directory before but I figured you're not going to revert.
I moved snapshot26, but it still isn't able to delete al the snapshots. Consolidation fails as well. This is getting serious.
The VM still works perfectly fine though.
Do a cat EKR-SVR02-SQL.vmsd and paste the output.
I also see that you have the advanced option snapshot.redoNotWithParent = "TRUE" set on this VM. This option is used to specify an alternate location where snapshot delta files reside. In your case, it isn't specifying an alternate location (workingDir = "."). I don't think this is the cause of any trouble, but it's unusual to see.
Here the output for cat:
.encoding = "UTF-8"
snapshot.lastUID = "26"
snapshot.current = "26"
snapshot0.uid = "23"
snapshot0.filename = "EKR-SVR02-SQL-Snapshot23.vmsn"
snapshot0.displayName = "Instalacion BCM"
snapshot0.description = "Previo a la instalacion de la instancia SQLOutlookBCM"
snapshot0.createTimeHigh = "351809"
snapshot0.createTimeLow = "-328381373"
snapshot0.numDisks = "2"
snapshot0.disk0.fileName = "EKR-SVR02-SQL0-000002.vmdk"
snapshot0.disk0.node = "scsi0:0"
snapshot0.disk1.fileName = "EKR-SVR02-SQL_1.vmdk"
snapshot0.disk1.node = "scsi0:1"
snapshot.numSnapshots = "3"
snapshot1.uid = "25"
snapshot1.filename = "EKR-SVR02-SQL-Snapshot25.vmsn"
snapshot1.parent = "23"
snapshot1.displayName = "Expansión de disco C"
snapshot1.description = "Previo a la expansión del disco C de 80 a 120 GB"
snapshot1.createTimeHigh = "352222"
snapshot1.createTimeLow = "-1626735197"
snapshot1.numDisks = "2"
snapshot1.disk0.fileName = "EKR-SVR02-SQL0-000001.vmdk"
snapshot1.disk0.node = "scsi0:0"
snapshot1.disk1.fileName = "EKR-SVR02-SQL_1-000001.vmdk"
snapshot1.disk1.node = "scsi0:1"
snapshot2.uid = "26"
snapshot2.filename = "EKR-SVR02-SQL-Snapshot26.vmsn"
snapshot2.parent = "25"
snapshot2.displayName = "VEEAM BACKUP TEMPORARY SNAPSHOT"
snapshot2.description = "Please do not delete this snapshot. It is being used by Veeam Backup."
snapshot2.createTimeHigh = "352222"
snapshot2.createTimeLow = "-1157136709"
snapshot2.numDisks = "2"
snapshot2.disk0.fileName = "EKR-SVR02-SQL0-000001.vmdk"
snapshot2.disk0.node = "scsi0:0"
snapshot2.disk1.fileName = "EKR-SVR02-SQL_1-000001.vmdk"
snapshot2.disk1.node = "scsi0:1"
[root@EKR-ESXi01:/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL]
I honestly don't know why this is set to TRUE. The is the first time I have a broken snapshot chain since I started using ESXi almost two years ago. You're far more experienced than I am so I truly appreciate all your help, and I'm learning a lot in the process.
One thing I'm planning for if nothing else works, is turning off the VM, unregister it from the host and rename the dir to EKR-SVR02-SQL-bak. Then, restore the VeeamZip backup into the original location. It is my understanding that the VeeamZip file only contains the latest snapshot and does not archive old ones in any way (a sort of clone from a snapshot).
Ok, this is interesting and represents a problem.
snapshot1.uid = "25"
snapshot1.filename = "EKR-SVR02-SQL-Snapshot25.vmsn"
snapshot1.parent = "23"
snapshot1.displayName = "Expansión de disco C"
snapshot1.description = "Previo a la expansión del disco C de 80 a 120 GB"
snapshot1.createTimeHigh = "352222"
snapshot1.createTimeLow = "-1626735197"
snapshot1.numDisks = "2"
snapshot1.disk0.fileName = "EKR-SVR02-SQL0-000001.vmdk"
snapshot1.disk0.node = "scsi0:0"
snapshot1.disk1.fileName = "EKR-SVR02-SQL_1-000001.vmdk"
snapshot1.disk1.node = "scsi0:1"
snapshot2.uid = "26"
snapshot2.filename = "EKR-SVR02-SQL-Snapshot26.vmsn"
snapshot2.parent = "25"
snapshot2.displayName = "VEEAM BACKUP TEMPORARY SNAPSHOT"
snapshot2.description = "Please do not delete this snapshot. It is being used by Veeam Backup."
snapshot2.createTimeHigh = "352222"
snapshot2.createTimeLow = "-1157136709"
snapshot2.numDisks = "2"
snapshot2.disk0.fileName = "EKR-SVR02-SQL0-000001.vmdk"
snapshot2.disk0.node = "scsi0:0"
snapshot2.disk1.fileName = "EKR-SVR02-SQL_1-000001.vmdk"
snapshot2.disk1.node = "scsi0:1"
Lines 9 and 11 and again at 21 and 23 represent the snapshot metadata descriptors that are created when a snapshot takes place. You have two per snapshot instance because you have two disks. Normally, the files should be sequential where the first snapshot ID references the base disk like in the following example that represents a normal descriptor:
[root@localhost:/vmfs/volumes/5a206362-e1f90f81-dc4e-0050568f2f00/qvgtaaq] cat qvgtaaq.vmsd
.encoding = "UTF-8"
snapshot.lastUID = "3"
snapshot.current = "3"
snapshot0.uid = "1"
snapshot0.filename = "qvgtaaq-Snapshot1.vmsn"
snapshot0.displayName = "test1"
snapshot0.createTimeHigh = "352241"
snapshot0.createTimeLow = "133930234"
snapshot0.numDisks = "1"
snapshot0.disk0.fileName = "qvgtaaq.vmdk"
snapshot0.disk0.node = "scsi0:0"
snapshot.numSnapshots = "3"
snapshot1.uid = "2"
snapshot1.filename = "qvgtaaq-Snapshot2.vmsn"
snapshot1.parent = "1"
snapshot1.displayName = "test2"
snapshot1.createTimeHigh = "352241"
snapshot1.createTimeLow = "142093129"
snapshot1.numDisks = "1"
snapshot1.disk0.fileName = "qvgtaaq-000001.vmdk"
snapshot1.disk0.node = "scsi0:0"
snapshot2.uid = "3"
snapshot2.filename = "qvgtaaq-Snapshot3.vmsn"
snapshot2.parent = "2"
snapshot2.displayName = "test3"
snapshot2.createTimeHigh = "352241"
snapshot2.createTimeLow = "-509770998"
snapshot2.numDisks = "1"
snapshot2.disk0.fileName = "qvgtaaq-000002.vmdk"
snapshot2.disk0.node = "scsi0:0"
You can see in lines 11, 21, and 30 what I mean. In your case, you have two different snapshots yet they reference the same disks. That should not be possible. I don't exactly know what the cause was since the logs you attached don't say, but it is apparent from a directory listing that something occurred on November 18 at 13:34 hours.
If we examine the collection of disk metadata I had you generate with the find command, we can confirm what the kernel knows about the disk chain. I have taken your output and reordered it according to the chaining sequence that your two disks use. You'll notice each of your disks has four VMDKs but the chain is only valid with three of the four. I'll call out the oddball in bold and indented to set it apart. I delineate your two disks with octothorpes and either "SQL0" or "SQL_1" because these are the names of the VMDKs associated with each ID. Pay attention to the CID and parentCID values for each disk descriptor file.
##############SQL0#######################
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=6ff46ce1
parentCID=ffffffff
isNativeSnapshot="no"
createType="vmfs"
# Extent description
RW 167772160 VMFS "EKR-SVR02-SQL0-000002-flat.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL0-000002-ctk.vmdk"
# The Disk Data Base
#DDB
ddb.adapterType = "lsilogic"
ddb.geometry.cylinders = "10443"
ddb.geometry.heads = "255"
ddb.geometry.sectors = "63"
ddb.longContentID = "345ddc5328509426567b75216ff46ce1"
ddb.thinProvisioned = "1"
ddb.toolsInstallType = "1"
ddb.toolsVersion = "10272"
ddb.uuid = "60 00 C2 90 ef 73 45 bd-dd f5 8c 2e 9e a7 41 4e"
ddb.virtualHWVersion = "4"
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=91dadebe
parentCID=6ff46ce1
isNativeSnapshot="no"
createType="seSparse"
parentFileNameHint="EKR-SVR02-SQL0-000002.vmdk"
# Extent description
RW 167772160 SESPARSE "EKR-SVR02-SQL0-000001-sesparse.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL0-000001-ctk.vmdk"
# The Disk Data Base
#DDB
ddb.grain = "8"
ddb.longContentID = "f178022005ee4fc6d6e2550491dadebe"
ddb.toolsInstallType = "1"
ddb.toolsVersion = "10279"
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=a36b113d
parentCID=a36b113d
isNativeSnapshot="no"
createType="seSparse"
parentFileNameHint="EKR-SVR02-SQL0-000001.vmdk"
# Extent description
RW 167772160 SESPARSE "EKR-SVR02-SQL0-000003-sesparse.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL0-000003-ctk.vmdk"
# The Disk Data Base
#DDB
ddb.grain = "8"
ddb.longContentID = "08103fbe84947c11a9466c8aa36b113d"
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=b5f97da1
parentCID=91dadebe
isNativeSnapshot="no"
createType="seSparse"
parentFileNameHint="EKR-SVR02-SQL0-000001.vmdk"
# Extent description
RW 167772160 SESPARSE "EKR-SVR02-SQL0-000004-sesparse.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL0-000004-ctk.vmdk"
# The Disk Data Base
#DDB
ddb.grain = "8"
ddb.longContentID = "ff3e1a80442ecccf1a13724ab5f97da1"
##############SQL_1#######################
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=54df96f8
parentCID=ffffffff
isNativeSnapshot="no"
createType="vmfs"
# Extent description
RW 251658240 VMFS "EKR-SVR02-SQL_1-flat.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL_1-ctk.vmdk"
# The Disk Data Base
#DDB
ddb.adapterType = "lsilogic"
ddb.geometry.cylinders = "15665"
ddb.geometry.heads = "255"
ddb.geometry.sectors = "63"
ddb.longContentID = "3e40cf9631f190f71f4b192654df96f8"
ddb.toolsInstallType = "1"
ddb.toolsVersion = "10272"
ddb.uuid = "60 00 C2 92 96 61 04 e5-bd 1e ca 54 ad bd 89 3c"
ddb.virtualHWVersion = "4"
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=fe05af20
parentCID=54df96f8
isNativeSnapshot="no"
createType="seSparse"
parentFileNameHint="EKR-SVR02-SQL_1.vmdk"
# Extent description
RW 251658240 SESPARSE "EKR-SVR02-SQL_1-000001-sesparse.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL_1-000001-ctk.vmdk"
# The Disk Data Base
#DDB
ddb.grain = "8"
ddb.longContentID = "1b6fb5b15bef96ffde00d424fe05af20"
ddb.toolsInstallType = "1"
ddb.toolsVersion = "10279"
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=85020c58
parentCID=85020c58
isNativeSnapshot="no"
createType="seSparse"
parentFileNameHint="EKR-SVR02-SQL_1-000001.vmdk"
# Extent description
RW 251658240 SESPARSE "EKR-SVR02-SQL_1-000002-sesparse.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL_1-000002-ctk.vmdk"
# The Disk Data Base
#DDB
ddb.grain = "8"
ddb.longContentID = "3dfb354351db08de3d1d734f85020c58"
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=f18c47fc
parentCID=fe05af20
isNativeSnapshot="no"
createType="seSparse"
parentFileNameHint="EKR-SVR02-SQL_1-000001.vmdk"
# Extent description
RW 251658240 SESPARSE "EKR-SVR02-SQL_1-000003-sesparse.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL_1-000003-ctk.vmdk"
# The Disk Data Base
#DDB
ddb.grain = "8"
ddb.longContentID = "913f8203de5e0ca33d2337cbf18c47fc"
Each VMDK has a CID and a parentCID associated with it. These IDs serve to reference the delta VMDK as well as the relationship to which it belongs. For convenience, I've compared all CIDs and ordered the chain per distinct disk in order of its precedence so it's easier to follow. Let's take a look at the first one, for example.
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=6ff46ce1
parentCID=ffffffff
isNativeSnapshot="no"
createType="vmfs"
# Extent description
RW 167772160 VMFS "EKR-SVR02-SQL0-000002-flat.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL0-000002-ctk.vmdk"
There is a CID value and a parentCID. For base disks, the parentCID value equals ffffffff. This just means there is no other parent; the chain begins here. The CID is an identifier that refers to this disk itself and is unique. If we look at the next delta in the chain we see this:
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=91dadebe
parentCID=6ff46ce1
isNativeSnapshot="no"
createType="seSparse"
parentFileNameHint="EKR-SVR02-SQL0-000002.vmdk"
# Extent description
RW 167772160 SESPARSE "EKR-SVR02-SQL0-000001-sesparse.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL0-000001-ctk.vmdk"
You can see on line 5 that the parentCID for this disk corresponds to the CID for the first disk. But this second disk has its own CID. Continuing on to the third disk in the chain:
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=b5f97da1
parentCID=91dadebe
isNativeSnapshot="no"
createType="seSparse"
parentFileNameHint="EKR-SVR02-SQL0-000001.vmdk"
# Extent description
RW 167772160 SESPARSE "EKR-SVR02-SQL0-000004-sesparse.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL0-000004-ctk.vmdk"
# The Disk Data Base
#DDB
ddb.grain = "8"
ddb.longContentID = "ff3e1a80442ecccf1a13724ab5f97da1"
This disk, on line 5, has a parentCID that corresponds to the CID of the previous disk. You can also see the values specified with the key "parentFileNameHint" which tell you which file name to which this disk points. This is how a snapshot chain is formed. In the case of the outlier for disk0, however, we have this:
# Disk DescriptorFile
version=3
encoding="UTF-8"
CID=a36b113d
parentCID=a36b113d
isNativeSnapshot="no"
createType="seSparse"
parentFileNameHint="EKR-SVR02-SQL0-000001.vmdk"
# Extent description
RW 167772160 SESPARSE "EKR-SVR02-SQL0-000003-sesparse.vmdk"
# Change Tracking File
changeTrackPath="EKR-SVR02-SQL0-000003-ctk.vmdk"
# The Disk Data Base
#DDB
ddb.grain = "8"
ddb.longContentID = "08103fbe84947c11a9466c8aa36b113d"
You notice that the CID and parentCID do not correspond to any other CIDs in the chain. Also, you see they are both identical. Looking back at the vmware.log file, we can see which disks are invoked as part of this unbroken chain.
2017-12-09T23:19:48.680Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000004-sesparse.vmdk" : open successful (21) size = 580059136, hd = 0. Type 19
2017-12-09T23:19:48.680Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000004-sesparse.vmdk" : closed.
2017-12-09T23:19:48.681Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000001-sesparse.vmdk" : open successful (21) size = 9495916544, hd = 0. Type 19
2017-12-09T23:19:48.681Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000001-sesparse.vmdk" : closed.
2017-12-09T23:19:48.682Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000002-flat.vmdk" : open successful (21) size = 85899345920, hd = 0. Type 3
2017-12-09T23:19:48.682Z| vmx| I125: DISKLIB-VMFS : "/vmfs/volumes/58da71a5-afc838b0-2fb7-1c98ec52f2f8/EKR-SVR02-SQL/EKR-SVR02-SQL0-000002-flat.vmdk" : closed.
The order is confirmed as 4 -> 1 -> 2. The SE Sparse disk types indicate you're either running this VM on a VMFS-6 datastore, or the datastore is formatted over 2TB in size. The -flat disk is the base disk extent (where the actual data resides).
Anyhow, to get to the point, it's not letting you delete the snapshots because the snapshot descriptor file (VMSD) has conflicting information about two of the three snapshots and it won't let you clobber one with the other. That's why even though you remove the VMSN file it complains because the descriptor has a file for that position already.
Now, what to do about it. I've not seen this exact situation before, so I cannot provide precise guidance. I do have a suggestion which I *think* will resolve the issue, but it's not something I can test in my lab because of how specific your issue is. Before you act on anything I'm about to suggest to you, I recommend that you test out that VeeamZIP before you find yourself actually needing it. You can do this by restoring that VM with a different name and disconnecting the vNIC before powering it on. When VeeamZIP runs, it should capture a consolidated view of the VM and not all those snapshot files. At least I believe that's the case. I haven't actually verified this. In any case, do a test restore before proceeding in order to validate your data. I'm not sure what else runs on this, but I can infer from the name that it's a SQL server. If so, for an additional level of protection I would personally do a stand-alone backup of the important databases on this machine and offload them somewhere else in your estate. Any other data that is of import should be treated similarly.
With all those precautions and caveats in mind, this is what I think will work.
First, a validation that the outlier disk is truly unused.
If those files are truly unused, the VM will power on just fine and return to normal operation. If it doesn't and it complains about them, honestly I'd have no idea why that would be the case at that point. But if it does, power it down, and move those files back, then power on and open a case with VMware support. If it does *not* complain, this is validation that those files are indeed derelict and do not participate any longer in the VM's disk chain. Assuming this is true and it powers on and everything is good, there are two possible routes that eventually lead to the same destination.
Once again, out of an abundance of caution, understand I have not done this and cannot properly test it internally. Please triple check you have validated your backup data is good as I take no responsibility for any corruption or data loss here.
We need to either wipe out the VMSD file and let the kernel consolidate disks on its own or alter the VMSD file to manually point it at the extents in use. If my theory is correct, either one should work.
Option 1: Remove the VMSD file.
Pretty simple. Delete the VMSD file from the VM's home directory after you've copied it to your backup directory. Once it's deleted, do a consolidation operation. If it's successful, you should see a consolidate operation kick off that will collapse those disks back into the base. You should be left with EKR-SVR02-SQL0-000002.vmdk and EKR-SVR02-SQL_1.vmdk (plus their -flat and -ctk files).
Option 2: Alter the VMSD file to correct the snapshot chain manually.
This is more involved but allows you to perform a snapshot "delete all" from the GUI. It's possible the VM will need to be powered off then powered on to re-read the file. Of that I'm not absolutely certain.
In any case, edit the VMSD file (after taking a backup) and replace the last entry for snapshot2 with the last snapshots in the chain for both disks. Again, as a reminder, for disk0 this file would be EKR-SVR02-SQL0-000004.vmdk. For disk1 this file would be EKR-SVR02-SQL_1-000003.vmdk. Once saved, try a delete all operation. It should now succeed.
Wow. This is absolutely the best support I’ve ever received in any forum. Your insight and advice has been very helpful. I’ve been reading up on your reply and can say that understand almost everything, and I yet can’t understand how the snapshot chain was broken.
Anyhow, I need to get this corrected by tomorrow so I’m going to unregister the VM and restore the backup in its place. I’m keeping the old one as I want to try and learn how to fix it properly if I ever experience the same issue (or if the backup restore fails).
I will have some time tomorrow after I checked everything is working correctly to move the VM to another host and perform the fix you suggest. This is actually the first time I mess with the VM files so I wouldn’t want to do it on a production server. This VM in particular is running a SQL server with our company’s ERP software so you can imagine the outcry if tomorrow everyone comes in to work to find our invoicing and management software doesn’t work.
Not to worry though, apart from the VeeamZip I did a full bare metal backup, plus three copies of every database on the server.
I won’t lose data, I will lose some time but that’s something I can deal with.