In short, I am looking for help recovering "inaccessible" VMs.
Background:
My operating theory is that during maintenance on the UPS connected to our SAN rack, I accidentally tripped the power on one of three fiber channel switches. When this happened, roughly 30 VMs were in vMotion through that switch, and as a result of the power loss several datastores hosted on the SAN were disconnected and listed as inaccessible. The datastores did not reconnect to effected hosts as expected (FC connections are redundant, there was always a path from host to storage array). The roughly 30 VMs previously mentioned were unresponsive. Attempting to open VMRC for any of the affected VMs failed with error: "Unable to connect to the MKS: Could not connect to pipe \\.\pipe\vmware-authpipe within retry period." These VMs reported vmtools not running, 0 bytes allocated storage, but with static CPU load (seemed stuck at whatever the last recorded load was). No commands to the affected VMs completed successfully, most timed out or continued to run indefinitely.
I migrated off unaffected VMs to unaffected hosts. This process completed without issue. Afterwards, all affected hosts were restarted and successfully reconnected to the datastores. At this point, I expected the inaccessible VMs would be identified on the datastores and return to normal operation. This did not happen, and the VMs are still listed as inaccessible.
Brief Version/Config Info:
ESXi 5.5.0 update 3 (VMKernel Release Build 3248547)
vCenter Server Appliance 5.5.0.30500 Build 4180648
All hosts are part of a single cluster which has vSphere DRS enabled (automated)
All datastores are part of a single datastore cluster with Storage DRS enabled (automated) and Storage I/O Control, VMFS5
Steps Taken:
Questions:
Thank you for taking the time out of your day to read this. I was fortunate that the failed VMs were not critical, but one of them is important and I would like to understand why I cannot restore it from the datastore.
- Oliver
Hello Oliver,
Genuinely sorry to hear of your troubles! (and welcome to Communities!)
First off, you need to remove the main suspects that prevent a VM from being powered-on or registered:
- From vCenter, are you able to remove the invalid VMs from inventory?
- cd into one of the VMs namespace folders, do you see a .vmx~ file?(Emphasis on the '~') If so, then you can safely delete this (it should not be present if the VM is down).
- Are there any .vswp files? If so then delete these too (also not present when VM is down).
If vmx~ and/or .vswp(s) were present then at this point you should try to register the VM directly on a host and try to power it on using the vim-cmd commands listed in the kb articles you mentioned.
If it fails to register, try disconnecting the host from vCenter and test it then.
If still goosed then you should consider looking for potential file-locks against the .vmdk files from the destination hosts that were not rebooted (unlikely but I think it is possible since these were in the process of vMotion). vmkftools -D is your friend here, follow the steps in kb 10051.
Another thing you could do (assuming there are no locks against the vmdk descriptors or -flats) is simply create dummy VMs and attach the existing vmdk disks to them.
Bob
-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-
Thank you for the welcome and reply!
Note: I'll get to the bolded items later today or tomorrow, time permitting. Other things came up that require my attention and I wasn't sure if this draft would save. Thank you for the help so far, I'm learning more and that's a good thing at least
Hello Oliver,
So from the screenshot there I can see a few reasons why this may not boot/re-register.
Regarding using 'rm' I think it may just be that you can't delete them without force switch (or possibly not when they are in that namespace folder directly).
Easiest thing to do is just create a temp folder in each VMs namespace folder, use 'mv' to move the unneeded files in there then force delete the sub-folder later.
The files that you can (and should) safely move for that VM in screenshot are both .vswp files, .vmx~ , .vmx.lck and .nvram (probably won't need to move this one but no harm, that or try without).
While still registered you can try reloading the VM (vim-cmd vmsvc/reload <VMs VMID>), if state doesn't change then un-register and re-register the VM.
Old article but the different VM files and their function remains the same:
searchvmware.techtarget.com/tip/Understanding-the-files-that-make-up-a-VMware-virtual-machine
Curiously our only doc that I can find which covers all file types and functions is for Workstation VMs.
Bob
-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-
The VM from this screenshot is not startable because the virusscan10-flat.vmdk is missing.
Thats typical for a hard poweroff of an ESXi host.
The next 2 urgent steps are
1. stop all activity on source and target datastore asap.
2. for both datastores create a VMFS-header-dump: see my site:
Create a VMFS-Header-dump using an ESXi-Host in production | VM-Sickbay
If you provide a download for the dump-files I can tell you in about an hour what your options are.
Typically in such a case I create a shell script that uses dd to extract the missing files fragment by fragment.
If the VMFS metadata is unuable - plan B is to scan for guest-os boot sectors and hope for unfragmented eager zeroed vmdks.
In case ESXi 5.1 or earlier was in use an attempt to extract missing files via vmfs-fuse for Linux is worth a trial.
Feel free to call me via skype if the matter is urgent or important.
Ulli