VMware Cloud Community
EdOfTheMountai2
Contributor
Contributor

ESXi 6 VMs "inaccessible" after New Years's power failure ?

My ESXi host power must have failed over the holidays. 

  • The VMs that were running over the holidays are grayed-out and display as "( inaccessible )".
  • The *.vmx and related files are still there.
  • The VM that was not running at time of power failure starts fine. 
  • My ESXi host has a single 2TB disk of local storage. 

I can browse and locate the vmx files using the vSphere Datastore Browser.  I tried removing from inventory.  However browsing to and right-mouse clicking the *.vmx data storage displays a disabled (grayed-out) "Add to Inventory".

I also tried ssh to the server and reloading without success:

vim-cmd vmsvc/getallvms

vim-cmd vmsvc/reload <invalidVmNumber>

Ideas and suggestions are much appreciated.

Thanks in advance,

-Ed

ESXi 6.0.0 Build 3029758

I am a ESXi non-expert user.


esxi-inaccesible.png

Message was edited by: Ed Sutton Added vSphere screenshot

0 Kudos
24 Replies
virtualosa
Contributor
Contributor

I see a lock file in the screen shot, looks like the host thinks it's already registered... have you looked at the vmkernel log? Do you see any mentions of locked files?

Maybe this article will help narrow it down. VMware KB: Investigating virtual machine file locks on ESXi/ESX

0 Kudos
EdOfTheMountai2
Contributor
Contributor

When I cat a vmx file in one of the invalid inaccessible VMs, I get an Input/output error ?  Is this is a lock error?

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] ls -al

total 841134104

drwxr-xr-x    1 root     root          2520 Nov 23 14:11 .

drwxr-xr-t    1 root     root          2100 Nov 10 21:42 ..

-rw-------    1 root     root     2147483648 Nov 23 14:11 subsite-jenkins-36234e19.vswp

-rw-------    1 root     root     858993459200 Dec 28 11:43 subsite-jenkins-flat.vmdk

-rw-------    1 root     root          8684 Dec 28 11:43 subsite-jenkins.nvram

-rw-------    1 root     root           507 Nov 23 14:11 subsite-jenkins.vmdk

-rw-r--r--    1 root     root             0 Nov  2 19:15 subsite-jenkins.vmsd

-rwxr-xr-x    1 root     root          2726 Nov 23 14:11 subsite-jenkins.vmx

-rw-------    1 root     root             0 Nov 23 14:11 subsite-jenkins.vmx.lck

-rwxr-xr-x    1 root     root          2641 Nov 23 14:11 subsite-jenkins.vmx~

-rw-r--r--    1 root     root         72424 Nov  2 21:47 vmware-3.log

-rw-r--r--    1 root     root        166649 Nov  2 22:09 vmware-4.log

-rw-r--r--    1 root     root        166665 Nov  2 22:14 vmware-5.log

-rw-r--r--    1 root     root        166499 Nov  2 22:23 vmware-6.log

-rw-r--r--    1 root     root        169750 Nov  5 20:56 vmware-7.log

-rw-r--r--    1 root     root        207988 Nov 20 20:33 vmware-8.log

-rw-r--r--    1 root     root        139993 Dec  8 19:57 vmware.log

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] cat vmware.log

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] cat subsite-jenkins.vmx

cat: can't open 'subsite-jenkins.vmx': Input/output error

0 Kudos
EdOfTheMountai2
Contributor
Contributor

Thank you for the link to http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=10051VMware KB: Investigating virtual machine file locks on ESXi/ESX

Yes there is a ".locker" folder.  I do not know what this means. Could it have locked up all three of the running VMs when the power failed?

What is location for vmkernel log for ESXi 6.0 ?

[root@localhost:/var/log] ls -al

total 116

drwxr-xr-x    1 root    root          512 Jan  4 16:57 .

drwxr-xr-x    1 root    root          512 Jan  4 16:55 ..

-rw-r--r--    1 root    root          7067 Jan  4 19:28 .vmsyslogd.err

-rw-rw-rw-    1 root    root        37965 Jan  4 16:55 boot.gz

-rw-r--r--    1 root    root        23724 Jan  4 16:55 configRP.log

-rw-r--r--    1 root    root          898 Jan  4 16:55 esxcli.log

drwxr-xr-x    1 root    root          512 Jan  4 16:57 ipmi

-rw-r--r--    1 root    root          2086 Jan  4 16:55 jumpstart-stdout.log

-rw-------    1 root    root          2752 Jan  4 16:57 smbios.bin

-rw-------    1 root    root          9361 Jan  4 16:55 sysboot.log

-rw-------    1 root    root            64 Jan  4 19:49 tallylog

drwxr-xr-x    1 root    root          512 Jan  4 16:55 vmware


[root@localhost:/var/log] vim-cmd vmsvc/getallvms

Skipping invalid VM '2'

Skipping invalid VM '4'

Vmid            Name                            File                      Guest OS      Version  Annotation

5      osx-10.11-clean-install  [hdd-2tb] OS X 10.11/OS X 10.11.vmx  darwin14_64Guest  vmx-11


Vmid 5 runs fine.  Vmid 5 was *not* running at time of power failure.


Vmid '2' and '4' were running at time of power failure.  I removed '3' (which was also running at time of power fail) in a failed attempt to re-add to inventory.

0 Kudos
EdOfTheMountai2
Contributor
Contributor

In addition to the ".locker" folder, all three of the VMs that were running at time of power fail have a *.lck file.

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d] find . -name *.lck

./osx-build-01/osx-build-01.vmx.lck

./win7-x64-build-01/win7-x64-build-01.vmx.lck

./subsite-jenkins/subsite-jenkins.vmx.lck


0 Kudos
virtualosa
Contributor
Contributor

They should live there in /var/log but I see it's not there. Have you rebooted your host since the outage? Perhaps that might clear everything. Do you use vcenter? Or is it just a standalone host? Are you redirecting your scratch partition anywhere?

If you run "less /var/log/vmkernel.log" what do you get?

The lock file is the file that ends in vmx.lck, not the .locker folder. The file lives in the same folder as your vmx and vmdk files for each VM.

Could you have been running backups during the outage?

0 Kudos
virtualosa
Contributor
Contributor

Perhaps you can try what this article mentions, VMware KB: Investigating hosted virtual machine lock files deleting the lock file and then try registering it again with the "add to inventory" feature. Also looks like you have snapshot files on that screen shot, so that could be getting in the way as well.

0 Kudos
EdOfTheMountai2
Contributor
Contributor

Thank you for the suggestions.


>Perhaps you can try what this article mentions, http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100385...VMware KB: Investigating hosted virtual machine lock files deleting the lock file and then try registering it again with the "add to inventory" feature.


Input/output error prevents me from deleting any *.vmk.lck file.  I am running in "Maintenance mode" with no VMs running when I ssh'ed as root to try to delete *.vmk.lck files from the /vmfs.

ls -al

-rw-------    1 root     root             0 Nov 23 14:11 subsite-jenkins.vmx.lck

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] rm subsite-jenkins.vmx.lck

rm: can't remove 'subsite-jenkins.vmx.lck': Input/output error


>Also looks like you have snapshot files on that screen shot, so that could be getting in the way as well.

Yes 1 of 3 VM's that were running at time of power-fail have a snapshot.  The other two inaccessible vms have no snapshots.

I do not use vCenter.  I have a single ESXi 6.0.0 host machine using vSphere.  I do not know how to automate backups so no backup was running at time of power-fail.

-Ed

0 Kudos
EdOfTheMountai2
Contributor
Contributor

>Are you redirecting your scratch partition anywhere?


Yes. I am booting ESXi 6 from a USB. I changed ScratchConfig to pint to /vmfs/volumes/


When I boot ESXi, when it is close to 100% done, it displays a black screen with a single word "scratch"


>Have you rebooted your host since the outage? Perhaps that might clear everything.

Yes, several times. Unfortunately no change.

>Do you use vcenter? Or is it just a standalone host? Are you redirecting your scratch partition anywhere?

No.  Just a single standalone host.

>If you run "less /var/log/vmkernel.log" what do you get?

The file vmkernel.log does not exist and I get:

[root@localhost:~] less /var/log/vmkernel.log

WARNING: terminal is not fully functional

/var/log/vmkernel.log: No such file or directory


>The lock file is the file that ends in vmx.lck, not the .locker folder. The file lives in the same folder as your vmx and vmdk files for each VM.

There is one *vmx,lck file found in each of the VMs 

>Could you have been running backups during the outage?

No. Unfortunately I have not figured out how to run automatic backups.

0 Kudos
Sreejesh_D
Virtuoso
Virtuoso

can you try a reboot of the hypervisor?

since its a single hypervisor environment , a reboot should fix the issues with the lock.

0 Kudos
EdOfTheMountai2
Contributor
Contributor

>can you try a reboot of the hypervisor?


I have reboot using vSphere a half-doze times without any change in the inaccessible vms.


I am booting ESXi 6.0 from USB.


I changed the ScratchConfig to a new directory and reboot.  Now I can cat the vmkernel.log file. 


The log file is very large and I do not know what to look for.  I tried grep for "WARN".


[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/scratch/.locker-esx-hv01-subsite.com/log] cat vmkernel.log|grep WARN

0:00:00:00.000 cpu0:1)WARNING: Serial: 813: Serial port com1 failed during initialization.

0:00:00:00.000 cpu0:1)WARNING: Serial: 814: Serial port com1 will be disabled.

0:00:00:00.000 cpu0:1)WARNING: Serial: 813: Serial port com2 failed during initialization.

0:00:00:00.000 cpu0:1)WARNING: Serial: 814: Serial port com2 will be disabled.

0:00:00:06.495 cpu0:32768)WARNING: VMKAcpi: 783: No IPMI PNP id found

2016-01-04T21:45:26.475Z cpu0:33094)WARNING: LinuxSignal: 541: ignored unexpected signal flags 0x2 (sig 17)

2016-01-04T21:45:27.028Z cpu2:33114)WARNING: VMK_PCI: 698: device 0000:00:14.0 failed to allocate 5 MSIX interrupts

2016-01-04T21:45:27.028Z cpu2:33114)WARNING: LinPCI: LinuxPCI_EnableMSIX:862: 0000:00:14.0: Interrupt allocation failed with Not supported

2016-01-04T21:45:27.979Z cpu2:33113)WARNING: LinScsiLLD: scsi_add_host:573: vmkAdapter (usb-storage) sgMaxEntries rounded to 255. Reported size was 65535

2016-01-04T21:45:31.090Z cpu2:33047)WARNING: NetDVS: 658: portAlias is NULL

2016-01-04T21:45:31.323Z cpu0:33211)WARNING: Tcpip: 927: Failed to unset the ip address (error = 49)

2016-01-04T21:45:40.713Z cpu3:33354)WARNING: Supported VMs 64, Max VSAN VMs 400, SystemMemoryInGB 16

2016-01-04T21:45:40.713Z cpu3:33354)WARNING: MaxFileHandles: 1920, Prealloc 1, Prealloc limit: 32 GB, Host scaling factor: 2

2016-01-04T21:45:40.713Z cpu3:33354)WARNING: DOM memory will be preallocated.

2016-01-04T21:46:12.261Z cpu0:33047)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:46:12.663Z cpu2:33405)WARNING: FTCpt: 476: Using IPv6 address to start server listener

2016-01-04T21:46:22.289Z cpu1:32788)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "t10.ATA_____APPLE_HDD_ST2000LM003___________________S341J9CG700310______" state in doubt; requested fast path state update...

2016-01-04T21:46:22.289Z cpu1:34159)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:46:33.650Z cpu0:34502)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: IO was aborted by VMFS via a virt-reset on the device

2016-01-04T21:46:36.255Z cpu0:34502)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:52:29.237Z cpu2:35491)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:52:34.396Z cpu1:35492)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:40.775Z cpu1:32788)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "t10.ATA_____APPLE_HDD_ST2000LM003___________________S341J9CG700310______" state in doubt; requested fast path state update...

2016-01-04T21:54:40.775Z cpu2:34651 opID=3431c904)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:41.164Z cpu0:34651 opID=3431c904)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:53.310Z cpu0:34503 opID=abb97213)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:53.709Z cpu0:34503 opID=abb97213)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:54.120Z cpu0:34503 opID=abb97213)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:54.564Z cpu0:34503 opID=abb97213)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error


-Ed

0 Kudos
virtualosa
Contributor
Contributor

Ok. so you've rebooted but it seems there's a problem with the connection to storage (at least to the scratch partition), but you said this is all local storage, right? Can you browse to the location of the scratch files? See if the vmkernel.log file is in there?

0 Kudos
EdOfTheMountai2
Contributor
Contributor

>Ok. so you've rebooted but it seems there's a problem with the connection to storage (at least to the scratch partition), but you said this is all local storage, right?


Yes, all local spinning hard-drive.


>Can you browse to the location of the scratch files? See if the vmkernel.log file is in there?

Yes.  I could see it in my old ScratchConfig location but when I tried to cat the file, I would get a Input/Output error.

After changing the ScratchConfig location an rebooting I can now open the vmkernel.log file which I attached to this post.


I did not see anything obvious, but then I do not know what to look for.


Thank you for your persistence,,


-Ed


Attachment:

/vmfs/volumes/hdd-2tb/scratch/.locker-esx-hv01-subsite.com/log/vmkernel.log


0 Kudos
virtualosa
Contributor
Contributor

I can't download this log file for some reason. But now that this is better, do you still see the .lck files? Do you see the vmkernel.log file in the /var/log directory as well? These should be links to the scratch partition. If you don't see the lock files, try registering the VMs.

0 Kudos
EdOfTheMountai2
Contributor
Contributor

>I can't download this log file for some reason.


Weird.  I can download log file form this posting on my end.  Must be some kind of security blocker?


>But now that this is better, do you still see the .lck files?


Yes.  The 3 VM's that I cannot get back into inventory all have *.vmx.lck and  *.vswp files


[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] ls -al

total 841134104

drwxr-xr-x    1 root     root          2520 Nov 23 14:11 .

drwxr-xr-t    1 root     root          2240 Jan  4 21:36 ..

-rw-------    1 root     root     2147483648 Nov 23 14:11 subsite-jenkins-36234e19.vswp

-rw-------    1 root     root     858993459200 Dec 28 11:43 subsite-jenkins-flat.vmdk

-rw-------    1 root     root          8684 Dec 28 11:43 subsite-jenkins.nvram

-rw-------    1 root     root           507 Nov 23 14:11 subsite-jenkins.vmdk

-rw-r--r--    1 root     root             0 Nov  2 19:15 subsite-jenkins.vmsd

-rwxr-xr-x    1 root     root          2726 Nov 23 14:11 subsite-jenkins.vmx

-rw-------    1 root     root             0 Nov 23 14:11 subsite-jenkins.vmx.lck

-rwxr-xr-x    1 root     root          2641 Nov 23 14:11 subsite-jenkins.vmx~

-rw-r--r--    1 root     root         72424 Nov  2 21:47 vmware-3.log

-rw-r--r--    1 root     root        166649 Nov  2 22:09 vmware-4.log

-rw-r--r--    1 root     root        166665 Nov  2 22:14 vmware-5.log

-rw-r--r--    1 root     root        166499 Nov  2 22:23 vmware-6.log

-rw-r--r--    1 root     root        169750 Nov  5 20:56 vmware-7.log

-rw-r--r--    1 root     root        207988 Nov 20 20:33 vmware-8.log

-rw-r--r--    1 root     root        139993 Dec  8 19:57 vmware.log

-rw-------    1 root     root     170917888 Nov 23 14:11 vmx-subsite-jenkins-908283417-1.vswp


I cannot remove the *.vmx.lck file.

vmx.lck

rm: can't remove 'subsite-jenkins.vmx.lck': Input/output error

>Do you see the vmkernel.log file in the /var/log directory as well? These should be links to the scratch partition.


Yes I do see symbolic links now.

ls -al vmkernel*

lrwxrwxrwx    1 root     root            25 Jan  4 22:13 vmkernel.log -> /scratch/log/vmkernel.log


>If you don't see the lock files, try registering the VMs.

I still see *.vmx.lck files in the VM directory where the *.vmx is located.


Using vSphere DataStore Browser the "Add to Inventory" option is grayed-out.


add-to-inventory.png


-Ed


0 Kudos
virtualosa
Contributor
Contributor

OK now that you can see the vmkernel log, I would go back to the original article I sent and see if you can find out what's happening.

Also, open the vmware.log file for one of the VMs and see if that tells you anything.

Also, you can try creating a new VM and pointing it to the vmdk file of one of these VMs and see if it will boot up that way. I'd start with the one that doesn't have any snapshots.

Also, do you have any VMs in inventory that show as "unknown"? If so, remove those from inventory, these could be causing the lock files issue.

I'm sorry I'm not being more help, seems like you may have a few things going on.

0 Kudos
virtualosa
Contributor
Contributor

Hi

Did you make any progress here? I was able to download the vmkernel log today, not sure why yesterday I couldn't, but it's looking like you have some problems with your storage.

These lines:

WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: IO was aborted by VMFS via a virt-reset on the device

Check this out.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200948...

Which makes sense why you couldn't see the vmkernel log when it was being saved on this same storage.

Also, WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error points to storage errors....

I'd take a look at the disks, not sure what your configuration is.

0 Kudos
EdOfTheMountai2
Contributor
Contributor

When I use md5sum to test if files are readable I get a number of files that report Input/output errors.

Since these include critical files such as vmdk files, I think resurrecting any corrupted VMs  is not possible.  Do you agree?

All three running VMs report a similar pattern of file input/output errors on files that were likely opened during the power-failure.

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] md5sum *

md5sum: can't open 'subsite-jenkins-36234e19.vswp': Input/output error

md5sum: can't open 'subsite-jenkins-flat.vmdk': Input/output error

md5sum: can't open 'subsite-jenkins.nvram': Input/output error

6bcd4a18e5a1dc0446e15f07ff5256fa  subsite-jenkins.vmdk

d41d8cd98f00b204e9800998ecf8427e  subsite-jenkins.vmsd

md5sum: can't open 'subsite-jenkins.vmx': Input/output error

md5sum: can't open 'subsite-jenkins.vmx.lck': Input/output error

md5sum: can't open 'subsite-jenkins.vmx~': Input/output error

aa4f4598ef8238872dfa3627e169bf2e  vmware-3.log

f2d441d27f534c601a0421803921cd7f  vmware-4.log

ab08a6e8448c23968697b2192fd0ba15  vmware-5.log

cc46ae866e1b68e0a6b9c65764f69062  vmware-6.log

38a061c4cc380d054edd11cda845c797  vmware-7.log

3b1e170646523a51a0a59fccda527ec8  vmware-8.log

md5sum: can't open 'vmware.log': Input/output error

md5sum: can't open 'vmx-subsite-jenkins-908283417-1.vswp': Input/output error

I need to learn how to backup VMs if the vmfs file system is this easily corrupted when running VMs go down during a power-fail.  I need to learn how to make backups regardless of the cause.  My Synology NAS log indicated the UPS tripped a half-dozen times over the holidays and went all the way down at least once. 

I am hoping I can backup ESXi VMs to my Synology NAS.  I can justify a $500 purchase of ESXi 6 Essentials if this gives me VM backup capability?  I've already spent way too much time trying to recover from this power fail.


I also need to figure out how to configure my APS UPS to signal ESXi to shut-down on power-fail if the vmfs file system is this easy to corrupt.  Until then, I will shut down ESXI every night when I go home.


Thank you for your help,


-Ed

0 Kudos
virtualosa
Contributor
Contributor

Yeah bummer! There might be a way to recover but I don't know it off hand. Also I don't know if it's worth the time, don't know what they are, how critical, etc.

If you go with the Essentials Plus edition, you get the Data Protection product which will let you backup but I don't know how much it costs.

Good luck!

0 Kudos
EdOfTheMountai2
Contributor
Contributor

Yes, it is a bummer.  I cannot read or delete the input/output error files from vmfs.  I guess I  need to re-export all VMs and reinstall everything from scratch since there are no tools to fix the vmfs.

Fortunately I do I have the original VMs that I exported to ESXi 6 still running on VMware Workstation and VMware Fusion.  Unfortunately I have probably lost some configuration changes.  I think step 1 will be to backup these VMs running on VMware Workstation and VMware Fusion.

>If you go with the Essentials Plus edition, you get the Data Protection product which will let you backup but I don't know how much it costs.

Essentials Plus is $5,439.  I was hoping that the $560 Essentials would give me an API or something to do backups.  Maybe not.

VMware vSphere Essentials Kits - Official VMware Store

I can see that I *really* need to find an affordable backup solution if want to use ESXi.

Thanks again,

-Ed

0 Kudos