Where do I begin.
I feel like I am always a newbie with VMware, despite working in it for a few years. We are running a VSAN environment on 6.5, performing backups with Veeam. About 2 weeks ago, some of our vm's in the backup started throwing errors. Due to some events outside of my control, I just started looking at this today. Veeam support said the error was because the VMX file was corrupt, recommended solution was to shutdown machine, remove from inventory, create new machine using the existing disks, bring it back up. We performed this solution on a non critical machine, and it worked great. Did it to a semi-critical machine, and worked great again. Did it to our Exchange server and.. it wasn't great.
The server came back up, however after a few hours of operation, a large amount of people reported missing about 2 weeks of email. We had the machine up for about 5 hours poking around at logs before I shut it down to focus on the VMware side of things. After a ton of digging on the guest as well as in the host environment, I figured out the root cause- despite there being no snapshots in the snapshot manager, the system was running off of a snapshot due to the failed backup. I made the mistake of mounting the original vmdk files on booting rather than the 000001.vmdk file. My own mistake of making assumptions, thinking those files were somehow orphaned since the snapshot manager listed no snapshots. The previous, successful machines either didn't have a snapshot file, or historical data didn't matter on that guest.
After talking with VMware support, they basically said since the original vmdk's were booted, the damage is done, consider the data lost. They did say I can try to remove the drives from the guest, and try to re-add the snapshot versions, but had little faith that it would work, and warned of a high chance of corruption of both the vmdk and the snapshot vmdk. Since the last shutdown, I've kept the server powered off and have been seeking any type of option to try and get this machine back to life with its current data, and have ran into a brick wall every time. Mostly being cautious on any steps tried from this point due to the corruption warnings, I've copied out all files save for the snapshot files from the original location of the datastore to a different location to mitigate risk of further corruption. The snapshot files however, will simply not budge. Web client copy, SSH copy, vmkfstools -i, nothing will get those files to somewhere else in their original size (though I can download what looks to be the header with WinSCP).
I'm desperately trying to safeguard the snapshot data before doing something that may corrupt the whole guest and get this thing back in an up to date, running condition. Since this is an Exchange server, the files are quite large. Just copying out the files took 3hrs. I'm now attempting a clone as I've read a clone may merge snapshot files automatically, with the hope that it won't impact the original files. If the clone doesn't work, I'd be at the last straw to try to boot off of the snapshots, knowing I may lose everything. Finally I've landed here, seeing some users get success by some of you truly amazing experts here. The final kick in the rear, is our management is getting ready to suffer the data loss just to get the server back on and email flowing, so their patience is thin. Casting out a bottle in the sea here, hoping it comes back with some much needed help in time. Attaching relevant info that I've seen requested in other posts:
Directory ls -lh of original files:
-rw-r--r-- 1 root root 92 Oct 24 2018 CAKEXK01-8d4db6ef.hlog
-rw------- 1 root root 32.6K Nov 15 08:02 CAKEXK01-Snapshot557.vmsn
-rw-r--r-- 1 root root 13 May 8 2019 CAKEXK01-aux.xml
-rw------- 1 root root 8.5K Nov 14 08:12 CAKEXK01.nvram
-rw------- 1 root root 45 Nov 14 08:12 CAKEXK01.vmsd
-rwx------ 1 root root 4.6K Dec 6 21:22 CAKEXK01.vmx
-rw------- 1 root root 3.3K May 17 2018 CAKEXK01.vmxf
-rw------- 1 root root 5.0M Dec 6 21:22 CAKEXK01_3-000001-ctk.vmdk
-rw------- 1 root root 408 Nov 15 08:02 CAKEXK01_3-000001.vmdk
-rw------- 1 root root 600 Dec 7 04:12 CAKEXK01_3.vmdk
-rw------- 1 root root 5.9M Dec 6 21:22 CAKEXK01_4-000001-ctk.vmdk
-rw------- 1 root root 409 Nov 15 08:02 CAKEXK01_4-000001.vmdk
-rw------- 1 root root 576 Dec 7 04:12 CAKEXK01_4.vmdk
-rw------- 1 root root 2.0M Dec 6 21:22 CAKEXK01_5-000001-ctk.vmdk
-rw------- 1 root root 407 Nov 15 08:09 CAKEXK01_5-000001.vmdk
-rw------- 1 root root 598 Dec 7 04:12 CAKEXK01_5.vmdk
drwxr-xr-x 1 root root 280 Dec 7 06:38 bak
-rw------- 1 root root 299.5K May 17 2018 vmware-3.log
-rw------- 1 root root 15.2M Sep 21 2018 vmware-4.log
-rw------- 1 root root 3.0M Oct 18 2018 vmware-5.log
-rw------- 1 root root 393.2K Oct 22 2018 vmware-6.log
-rw------- 1 root root 467.3K Oct 24 2018 vmware-7.log
-rw------- 1 root root 244.0K Oct 24 2018 vmware-8.log
-rw------- 1 root root 45.4M Dec 6 21:22 vmware.log
Directory ls -lh of newly created machine that is pointing to the above vmdk's:
-rw-r--r-- 1 root root 295 Dec 6 21:35 CAKEXK01-35be335f.hlog
-rw------- 1 root root 8.5K Dec 7 05:25 CAKEXK01.nvram
-rw-r--r-- 1 root root 0 Dec 6 21:35 CAKEXK01.vmsd
-rwxr-xr-x 1 root root 3.8K Dec 7 05:25 CAKEXK01.vmx
-rw------- 1 root root 3.1K Dec 6 21:45 CAKEXK01.vmxf
-rw-r--r-- 1 root root 1.0M Dec 7 03:08 vmware-1.log
-rw-r--r-- 1 root root 322.3K Dec 7 05:25 vmware.log
Since the VM has been running on the base disks for 5 hours, you will definitely end up with corrupted data. How much depends on the changes made to the base disks, and the age/size of the snapshots.
What IMO may be the best option - not knowing the size of your environment, and the number of mailboxes - is to basically continue production with the "reset" Exchange server (after some required recovery actions), because these virtual disks are at least in a consistent state, and try to restore missing data from the snapshots using e.g. Veeam's Explorer for Exchange. What I have in mind is to clone the current state including the snapshots to a new - temporary VM - which can then be backed up once, and accessed for the restores.
However, since you are running vSAN this can be tricky, so let me ask you whether you have a non vSAN datastore (VMFS, NFS) to which we could clone the VM?
André
Thank you for the reply Andre.
Unfortunately we do not have a secondary datastore aside from the VSAN datastore. I have not yet attempted to boot from the snapshot files (due to the corruption risk), however due to reasons I can't disclose publicly, another fun little tidbit of this is we likely cannot use Veeam Explorer for Exchange. I wish I could give more info here, but the basic gist is our backup infrastructure is air-gapped from the production environment to such a degree it would make Message Level backup/recovery near impossible.
I wish I had better answers to some of your ideas, but it sounds more and more like we may have to eat the data loss.
I'm not an Exchange expert, so I'm currently not aware of available tools to extract data from the Exchange DB, but at this point I'd at least create a backup/clone of the VM (including the snapshots) that can be used to somehow try, and extract/recover missing data.
To achieve this in your environment - assuming sufficient free disk space - you could:
Once that is done, you could take the reverted VM back into production, and still have the chance to try and recover data from the clone.
As aside note: Veeam lately reported a know issue with CBT based backups after reverting to snapshots, which - from what I understood - VMware is currently working on. To make sure that you have consistent backups of the production machine, I'd suggest you create an Active Full backup asap.
Please fell free to ask, if something that I wrote is unclear.
André
Thanks again for your replies. I think your proposed path forward would have a decent chance at some data recovery if we had better interactivity of features with our backup structure and our production environment. Ultimately, our management cried uncle, and needed to have email flowing again. We brought the server back up on the original vmdk's, accepting the data loss. I really wish we could have tried some of the options suggested here and elsewhere, however they deemed the timeframe and risk too great.