Recovering "inaccessible" VMs on datastore

ovollmer1 · ‎06-26-2017

In short, I am looking for help recovering "inaccessible" VMs.

Background:

My operating theory is that during maintenance on the UPS connected to our SAN rack, I accidentally tripped the power on one of three fiber channel switches. When this happened, roughly 30 VMs were in vMotion through that switch, and as a result of the power loss several datastores hosted on the SAN were disconnected and listed as inaccessible. The datastores did not reconnect to effected hosts as expected (FC connections are redundant, there was always a path from host to storage array). The roughly 30 VMs previously mentioned were unresponsive. Attempting to open VMRC for any of the affected VMs failed with error: "Unable to connect to the MKS: Could not connect to pipe \\.\pipe\vmware-authpipe within retry period." These VMs reported vmtools not running, 0 bytes allocated storage, but with static CPU load (seemed stuck at whatever the last recorded load was). No commands to the affected VMs completed successfully, most timed out or continued to run indefinitely.

I migrated off unaffected VMs to unaffected hosts. This process completed without issue. Afterwards, all affected hosts were restarted and successfully reconnected to the datastores. At this point, I expected the inaccessible VMs would be identified on the datastores and return to normal operation. This did not happen, and the VMs are still listed as inaccessible.

Brief Version/Config Info:

ESXi 5.5.0 update 3 (VMKernel Release Build 3248547)

vCenter Server Appliance 5.5.0.30500 Build 4180648

All hosts are part of a single cluster which has vSphere DRS enabled (automated)

All datastores are part of a single datastore cluster with Storage DRS enabled (automated) and Storage I/O Control, VMFS5

Steps Taken:

Checked logs, noted "nvram write failed" and datastore timeout errors on affected hosts (read Powering on a virtual machine fails with the error: NVRAM write failure (2097213) | VMware KB but was unable to power off/on the affected VMs to allow creation of a new .nvram file)
Verified files still present on datastore, noted presence of *.vmx.lck file for affected VMs (tried steps listed here: can't register/add to inventory a vm because of locked file)
A colleague attempted removing the lock file, but was unsuccessful, I believe they followed this: Investigating virtual machine file locks on ESXi (10051) | VMware KB
Attempted removing affected VM from inventory and re-registering with vSphere (tried steps listed here: How to register/add a VM to the Inventory in vCenter Server (1006160) | VMware KB)
Attempted created new VM and attaching .vmdk from an affected VM, this process was not successful

Questions:

What other actions can I take to restore these VMs? The files exist on the datastore, but something is stopping me at every step. I am not against retrying all steps taken previously to record more specific info on why they failed if this would help.
None of these were backed up--this was a failure in our deployment that I intended to correct later this year. What is the recommended method to backup VMs? I have researched several third-party products but I'm leaning towards VDP (really appreciated this paper: https://www.vmware.com/files/pdf/vsphere/vmware-vsphere-data-protection-overview.pdf ),

Thank you for taking the time out of your day to read this. I was fortunate that the failed VMs were not critical, but one of them is important and I would like to understand why I cannot restore it from the datastore.

- Oliver

TheBobkin · ‎06-26-2017

Hello Oliver,

Genuinely sorry to hear of your troubles! (and welcome to Communities!)

First off, you need to remove the main suspects that prevent a VM from being powered-on or registered:

- From vCenter, are you able to remove the invalid VMs from inventory?

- cd into one of the VMs namespace folders, do you see a .vmx~ file?(Emphasis on the '~') If so, then you can safely delete this (it should not be present if the VM is down).

- Are there any .vswp files? If so then delete these too (also not present when VM is down).

If vmx~ and/or .vswp(s) were present then at this point you should try to register the VM directly on a host and try to power it on using the vim-cmd commands listed in the kb articles you mentioned.

If it fails to register, try disconnecting the host from vCenter and test it then.

If still goosed then you should consider looking for potential file-locks against the .vmdk files from the destination hosts that were not rebooted (unlikely but I think it is possible since these were in the process of vMotion). vmkftools -D is your friend here, follow the steps in kb 10051.

Another thing you could do (assuming there are no locks against the vmdk descriptors or -flats) is simply create dummy VMs and attach the existing vmdk disks to them.

Bob

-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-

ovollmer1 · ‎06-27-2017

Thank you for the welcome and reply!

Are you able to remove the invalid VMs from inventory within vCenter?
Yes. The unregister task completes successfully.
Are .vmx~ files present?
Yes, these files were present but attempts to delete them return an "Invalid argument" error. I looked into this and found the following blog post: Deleting problem files from VMFS data stores | bitpushr's blog, as well as several others detailing the same solution. Attempting the echo "a" > * failed with the same "Invalid argument" error.

It seems the only operation I am able to perform on this object is move (as you can see above I moved it to the dragons directory to mess with it). Perms are -rwxr-xr-x, owner is root, chmod fails with same "Invalid argument" error.
Are there any .vswp files?
Yes, one .vswp file for each inaccessible VM. Here is a representative ls -al output from an affected VM that I have not yet tampered with:
Attempt to register an affected VM directly on a host (may require removing host from vCenter).
Search for file locks against .vmdk files.
Assuming no locks, create dummy VM and attach existing vmdks.

Note: I'll get to the bolded items later today or tomorrow, time permitting. Other things came up that require my attention and I wasn't sure if this draft would save. Thank you for the help so far, I'm learning more and that's a good thing at least

TheBobkin · ‎06-27-2017

Hello Oliver,

So from the screenshot there I can see a few reasons why this may not boot/re-register.

Regarding using 'rm' I think it may just be that you can't delete them without force switch (or possibly not when they are in that namespace folder directly).

Easiest thing to do is just create a temp folder in each VMs namespace folder, use 'mv' to move the unneeded files in there then force delete the sub-folder later.

The files that you can (and should) safely move for that VM in screenshot are both .vswp files, .vmx~ , .vmx.lck and .nvram (probably won't need to move this one but no harm, that or try without).

While still registered you can try reloading the VM (vim-cmd vmsvc/reload <VMs VMID>), if state doesn't change then un-register and re-register the VM.

Old article but the different VM files and their function remains the same:

searchvmware.techtarget.com/tip/Understanding-the-files-that-make-up-a-VMware-virtual-machine

Curiously our only doc that I can find which covers all file types and functions is for Workstation VMs.

Bob

-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-

continuum · ‎06-27-2017

The VM from this screenshot is not startable because the virusscan10-flat.vmdk is missing.
Thats typical for a hard poweroff of an ESXi host.
The next 2 urgent steps are
1. stop all activity on source and target datastore asap.
2. for both datastores create a VMFS-header-dump: see my site:
Create a VMFS-Header-dump using an ESXi-Host in production | VM-Sickbay

If you provide a download for the dump-files I can tell you in about an hour what your options are.
Typically in such a case I create a shell script that uses dd to extract the missing files fragment by fragment.
If the VMFS metadata is unuable - plan B is to scan for guest-os boot sectors and hope for unfragmented eager zeroed vmdks.
In case ESXi 5.1 or earlier was in use an attempt to extract missing files via vmfs-fuse for Linux is worth a trial.
Feel free to call me via skype if the matter is urgent or important.
Ulli

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...