LUN filled with snapshotted vdmk's. Can now only m...

Utterpiffle · ‎04-28-2016

Hi all.

Massive problem here, please help if possible!

We had a failed backup on a file server a few weeks ago. It created a desperate vmdk when doing the backup snapshot and a "server needs consolidation message" appeared. Consolidation failed saying that disconnect was more than 12 seconds or something similar.

There are no flat files. On the datastore, we have:

xx.vmdk

xx-000001.vmdk

xx-ctk.vmdk

xx-000001-ctk-vmdk

via ssh. the 000001.vmdk is tiny and there is a large associated xx-000001-sesparse.vmdk

In the meantime the main file store grew, filled the LUN, and the server paused. We cannot extend the LUN. The server was powered down.

We created a larger LUN, then copied the three vmdk files over, including the ctk-vmdk files.

When trying to add the copied Hard Disk on the server via vCentre, it will only see the original xx.vmdk

How do I force the server to connect to the xx-000001.vmdk?

We have no backup for the last two weeks. The backup server failed completely (new kit on order) are we are now panicking!

Any advice would be most welcome!

Thank you!

Utterpiffle · ‎04-28-2016

OK, I found the correct file via the web vCentre, but it now produces the following message:

Failed to add disk scsi0:4.

The parent virtual disk has been modified since the child was created. The content ID of the parent virtual disk does not match the corresponding parent content ID in the child

Cannot open the disk '/vmfs/volumes/56717d5d-70484310-15b7-001018b3eedc/xx/xx-000001.vmdk' or one of the snapshot disks it depends on.

Failed to power on scsi0:4.

Can I recover from this?

kermic · ‎04-28-2016

The error message says that most likely you have at least CID mismatch.

Every vmdk file, including the base / flat and any snapshot, has a Content ID record in disk descriptor file. For snapshots (sparse disks), since they are always in parent-child relationship, also a Parent CID is recorded in the disk descriptor and points to the next upper element in disk chain (either the CID of the base disk, if this is the first snapshot, or CID of parent snapshot, if you have more than 1 snap in chain).

Check this one: https://kb.vmware.com/kb/1007969

The article describes procedure of detecting and resolving CID mismatch issues. This, however, will not guarantee the data consistency unfortunately. It only repairs the chain so that you could mount and read the snapshotted disk.

From experience - apply extreme caution when performing steps in the article, do not work with the only copy of your files!!! Create a copy of your current files and perform the actions on the copy instead. If you mess things up, you can still create another copy from your "golden egg". The procedure itself is not overwhelmingly complex once you get through the logic of vmdk chains and CIDs, however a single tiny wrong move can result in loosing / corrupting data.

Hopefully I've scared you enough

Good luck!

continuum · ‎04-28-2016

> however a single tiny wrong move can result in loosing / corrupting data.
To add to that ...
The longer the basedisk has been used without a snapshot the larger the corruption of the damaged NTFS.
This difference can be so large that it may be worth to consider copying the missing data twice:
1. right after the snapshot has been reattached - create a new snapshot - next boot from a Windows PE Recovery Iso and extract the missing files.
2. still using the PE run a checkdisk with chkdsk /f /x /r against the partition and when that is done extract the missing files

This two sets of files can both have what you need.
I highly recommend to always boot into a LiveCD first time after a CID-chain repair.
And if you create that additional snapshot before starting the VM you can be on the safe side without extra backup copies.

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Prakas · ‎05-03-2016

I would prefer powering off the original VM and do the disk consolidation. Might take long time, but will give good result.

Utterpiffle · ‎05-04-2016

Well, thank you all for your thoughtful and very helpful replies.

The biggest trouble we have is the size of the VMDK's. The base file is 9TB and snapshot around 1.5TB.

A lack of time was the issue, and with over 1200 users screaming for their data, this was/is a serious problem!

I made a clone of the data first. Mostly so that we couldn't damage the original data set further. This took 36 hours!

I then tried to match the CID's. This worked. Sort of. The datastore was readable and all the files in the snapshot were present. Unfortunately, around 80% of the change data in the snapshot is corrupt.

The current situation is that I have recovered as much as possible to the users, but the VMDK's are now in the capable hands of Kroll data recovery services. I think it will take them a few days to go through the data as there really is a lot of it! I will report back soon, hopefully with a story of success...

Regards.

All

LUN filled with snapshotted vdmk's. Can now only mount original file, so data shows as two weeks old.