Re: ESXi 6 VMs "inaccessible" after New Years's po...

EdOfTheMountai2 · ‎01-04-2016

My ESXi host power must have failed over the holidays.

The VMs that were running over the holidays are grayed-out and display as "( inaccessible )".
The *.vmx and related files are still there.
The VM that was not running at time of power failure starts fine.
My ESXi host has a single 2TB disk of local storage.

I can browse and locate the vmx files using the vSphere Datastore Browser. I tried removing from inventory. However browsing to and right-mouse clicking the *.vmx data storage displays a disabled (grayed-out) "Add to Inventory".

I also tried ssh to the server and reloading without success:

vim-cmd vmsvc/getallvms

vim-cmd vmsvc/reload <invalidVmNumber>

Ideas and suggestions are much appreciated.

Thanks in advance,

-Ed

ESXi 6.0.0 Build 3029758

I am a ESXi non-expert user.

Message was edited by: Ed Sutton Added vSphere screenshot

virtualosa · ‎01-04-2016

I see a lock file in the screen shot, looks like the host thinks it's already registered... have you looked at the vmkernel log? Do you see any mentions of locked files?

Maybe this article will help narrow it down. VMware KB: Investigating virtual machine file locks on ESXi/ESX

EdOfTheMountai2 · ‎01-04-2016

When I cat a vmx file in one of the invalid inaccessible VMs, I get an Input/output error ? Is this is a lock error?

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] ls -al

total 841134104

drwxr-xr-x 1 root root 2520 Nov 23 14:11 .

drwxr-xr-t 1 root root 2100 Nov 10 21:42 ..

-rw------- 1 root root 2147483648 Nov 23 14:11 subsite-jenkins-36234e19.vswp

-rw------- 1 root root 858993459200 Dec 28 11:43 subsite-jenkins-flat.vmdk

-rw------- 1 root root 8684 Dec 28 11:43 subsite-jenkins.nvram

-rw------- 1 root root 507 Nov 23 14:11 subsite-jenkins.vmdk

-rw-r--r-- 1 root root 0 Nov 2 19:15 subsite-jenkins.vmsd

-rwxr-xr-x 1 root root 2726 Nov 23 14:11 subsite-jenkins.vmx

-rw------- 1 root root 0 Nov 23 14:11 subsite-jenkins.vmx.lck

-rwxr-xr-x 1 root root 2641 Nov 23 14:11 subsite-jenkins.vmx~

-rw-r--r-- 1 root root 72424 Nov 2 21:47 vmware-3.log

-rw-r--r-- 1 root root 166649 Nov 2 22:09 vmware-4.log

-rw-r--r-- 1 root root 166665 Nov 2 22:14 vmware-5.log

-rw-r--r-- 1 root root 166499 Nov 2 22:23 vmware-6.log

-rw-r--r-- 1 root root 169750 Nov 5 20:56 vmware-7.log

-rw-r--r-- 1 root root 207988 Nov 20 20:33 vmware-8.log

-rw-r--r-- 1 root root 139993 Dec 8 19:57 vmware.log

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] cat vmware.log

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] cat subsite-jenkins.vmx

cat: can't open 'subsite-jenkins.vmx': Input/output error

EdOfTheMountai2 · ‎01-04-2016

Thank you for the link to http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=10051VMware KB: Investigating virtual machine file locks on ESXi/ESX

Yes there is a ".locker" folder. I do not know what this means. Could it have locked up all three of the running VMs when the power failed?

What is location for vmkernel log for ESXi 6.0 ?

[root@localhost:/var/log] ls -al

total 116

drwxr-xr-x 1 root root 512 Jan 4 16:57 .

drwxr-xr-x 1 root root 512 Jan 4 16:55 ..

-rw-r--r-- 1 root root 7067 Jan 4 19:28 .vmsyslogd.err

-rw-rw-rw- 1 root root 37965 Jan 4 16:55 boot.gz

-rw-r--r-- 1 root root 23724 Jan 4 16:55 configRP.log

-rw-r--r-- 1 root root 898 Jan 4 16:55 esxcli.log

drwxr-xr-x 1 root root 512 Jan 4 16:57 ipmi

-rw-r--r-- 1 root root 2086 Jan 4 16:55 jumpstart-stdout.log

-rw------- 1 root root 2752 Jan 4 16:57 smbios.bin

-rw------- 1 root root 9361 Jan 4 16:55 sysboot.log

-rw------- 1 root root 64 Jan 4 19:49 tallylog

drwxr-xr-x 1 root root 512 Jan 4 16:55 vmware

[root@localhost:/var/log] vim-cmd vmsvc/getallvms

Skipping invalid VM '2'

Skipping invalid VM '4'

Vmid Name File Guest OS Version Annotation

5 osx-10.11-clean-install [hdd-2tb] OS X 10.11/OS X 10.11.vmx darwin14_64Guest vmx-11

Vmid 5 runs fine. Vmid 5 was *not* running at time of power failure.

Vmid '2' and '4' were running at time of power failure. I removed '3' (which was also running at time of power fail) in a failed attempt to re-add to inventory.

EdOfTheMountai2 · ‎01-04-2016

In addition to the ".locker" folder, all three of the VMs that were running at time of power fail have a *.lck file.

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d] find . -name *.lck

./osx-build-01/osx-build-01.vmx.lck

./win7-x64-build-01/win7-x64-build-01.vmx.lck

./subsite-jenkins/subsite-jenkins.vmx.lck

virtualosa · ‎01-04-2016

They should live there in /var/log but I see it's not there. Have you rebooted your host since the outage? Perhaps that might clear everything. Do you use vcenter? Or is it just a standalone host? Are you redirecting your scratch partition anywhere?

If you run "less /var/log/vmkernel.log" what do you get?

The lock file is the file that ends in vmx.lck, not the .locker folder. The file lives in the same folder as your vmx and vmdk files for each VM.

Could you have been running backups during the outage?

virtualosa · ‎01-04-2016

Perhaps you can try what this article mentions, VMware KB: Investigating hosted virtual machine lock files deleting the lock file and then try registering it again with the "add to inventory" feature. Also looks like you have snapshot files on that screen shot, so that could be getting in the way as well.

EdOfTheMountai2 · ‎01-04-2016

Thank you for the suggestions.

>Perhaps you can try what this article mentions, http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100385...VMware KB: Investigating hosted virtual machine lock files deleting the lock file and then try registering it again with the "add to inventory" feature.

Input/output error prevents me from deleting any *.vmk.lck file. I am running in "Maintenance mode" with no VMs running when I ssh'ed as root to try to delete *.vmk.lck files from the /vmfs.

ls -al

-rw------- 1 root root 0 Nov 23 14:11 subsite-jenkins.vmx.lck

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] rm subsite-jenkins.vmx.lck

rm: can't remove 'subsite-jenkins.vmx.lck': Input/output error

>Also looks like you have snapshot files on that screen shot, so that could be getting in the way as well.

Yes 1 of 3 VM's that were running at time of power-fail have a snapshot. The other two inaccessible vms have no snapshots.

I do not use vCenter. I have a single ESXi 6.0.0 host machine using vSphere. I do not know how to automate backups so no backup was running at time of power-fail.

-Ed

EdOfTheMountai2 · ‎01-04-2016

>Are you redirecting your scratch partition anywhere?

Yes. I am booting ESXi 6 from a USB. I changed ScratchConfig to pint to /vmfs/volumes/

When I boot ESXi, when it is close to 100% done, it displays a black screen with a single word "scratch"

>Have you rebooted your host since the outage? Perhaps that might clear everything.

Yes, several times. Unfortunately no change.

>Do you use vcenter? Or is it just a standalone host? Are you redirecting your scratch partition anywhere?

No. Just a single standalone host.

>If you run "less /var/log/vmkernel.log" what do you get?

The file vmkernel.log does not exist and I get:

[root@localhost:~] less /var/log/vmkernel.log

WARNING: terminal is not fully functional

/var/log/vmkernel.log: No such file or directory

>The lock file is the file that ends in vmx.lck, not the .locker folder. The file lives in the same folder as your vmx and vmdk files for each VM.

There is one *vmx,lck file found in each of the VMs

>Could you have been running backups during the outage?

No. Unfortunately I have not figured out how to run automatic backups.

Sreejesh_D · ‎01-04-2016

can you try a reboot of the hypervisor?

since its a single hypervisor environment , a reboot should fix the issues with the lock.

EdOfTheMountai2 · ‎01-04-2016

>can you try a reboot of the hypervisor?

I have reboot using vSphere a half-doze times without any change in the inaccessible vms.

I am booting ESXi 6.0 from USB.

I changed the ScratchConfig to a new directory and reboot. Now I can cat the vmkernel.log file.

The log file is very large and I do not know what to look for. I tried grep for "WARN".

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/scratch/.locker-esx-hv01-subsite.com/log] cat vmkernel.log|grep WARN

0:00:00:00.000 cpu0:1)WARNING: Serial: 813: Serial port com1 failed during initialization.

0:00:00:00.000 cpu0:1)WARNING: Serial: 814: Serial port com1 will be disabled.

0:00:00:00.000 cpu0:1)WARNING: Serial: 813: Serial port com2 failed during initialization.

0:00:00:00.000 cpu0:1)WARNING: Serial: 814: Serial port com2 will be disabled.

0:00:00:06.495 cpu0:32768)WARNING: VMKAcpi: 783: No IPMI PNP id found

2016-01-04T21:45:26.475Z cpu0:33094)WARNING: LinuxSignal: 541: ignored unexpected signal flags 0x2 (sig 17)

2016-01-04T21:45:27.028Z cpu2:33114)WARNING: VMK_PCI: 698: device 0000:00:14.0 failed to allocate 5 MSIX interrupts

2016-01-04T21:45:27.028Z cpu2:33114)WARNING: LinPCI: LinuxPCI_EnableMSIX:862: 0000:00:14.0: Interrupt allocation failed with Not supported

2016-01-04T21:45:27.979Z cpu2:33113)WARNING: LinScsiLLD: scsi_add_host:573: vmkAdapter (usb-storage) sgMaxEntries rounded to 255. Reported size was 65535

2016-01-04T21:45:31.090Z cpu2:33047)WARNING: NetDVS: 658: portAlias is NULL

2016-01-04T21:45:31.323Z cpu0:33211)WARNING: Tcpip: 927: Failed to unset the ip address (error = 49)

2016-01-04T21:45:40.713Z cpu3:33354)WARNING: Supported VMs 64, Max VSAN VMs 400, SystemMemoryInGB 16

2016-01-04T21:45:40.713Z cpu3:33354)WARNING: MaxFileHandles: 1920, Prealloc 1, Prealloc limit: 32 GB, Host scaling factor: 2

2016-01-04T21:45:40.713Z cpu3:33354)WARNING: DOM memory will be preallocated.

2016-01-04T21:46:12.261Z cpu0:33047)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:46:12.663Z cpu2:33405)WARNING: FTCpt: 476: Using IPv6 address to start server listener

2016-01-04T21:46:22.289Z cpu1:32788)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "t10.ATA_____APPLE_HDD_ST2000LM003___________________S341J9CG700310______" state in doubt; requested fast path state update...

2016-01-04T21:46:22.289Z cpu1:34159)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:46:33.650Z cpu0:34502)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: IO was aborted by VMFS via a virt-reset on the device

2016-01-04T21:46:36.255Z cpu0:34502)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:52:29.237Z cpu2:35491)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:52:34.396Z cpu1:35492)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:40.775Z cpu1:32788)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "t10.ATA_____APPLE_HDD_ST2000LM003___________________S341J9CG700310______" state in doubt; requested fast path state update...

2016-01-04T21:54:40.775Z cpu2:34651 opID=3431c904)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:41.164Z cpu0:34651 opID=3431c904)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:53.310Z cpu0:34503 opID=abb97213)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:53.709Z cpu0:34503 opID=abb97213)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:54.120Z cpu0:34503 opID=abb97213)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

2016-01-04T21:54:54.564Z cpu0:34503 opID=abb97213)WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error

-Ed

virtualosa · ‎01-04-2016

Ok. so you've rebooted but it seems there's a problem with the connection to storage (at least to the scratch partition), but you said this is all local storage, right? Can you browse to the location of the scratch files? See if the vmkernel.log file is in there?

EdOfTheMountai2 · ‎01-04-2016

>Ok. so you've rebooted but it seems there's a problem with the connection to storage (at least to the scratch partition), but you said this is all local storage, right?

Yes, all local spinning hard-drive.

>Can you browse to the location of the scratch files? See if the vmkernel.log file is in there?

Yes. I could see it in my old ScratchConfig location but when I tried to cat the file, I would get a Input/Output error.

After changing the ScratchConfig location an rebooting I can now open the vmkernel.log file which I attached to this post.

I did not see anything obvious, but then I do not know what to look for.

Thank you for your persistence,,

-Ed

Attachment:

/vmfs/volumes/hdd-2tb/scratch/.locker-esx-hv01-subsite.com/log/vmkernel.log

virtualosa · ‎01-04-2016

I can't download this log file for some reason. But now that this is better, do you still see the .lck files? Do you see the vmkernel.log file in the /var/log directory as well? These should be links to the scratch partition. If you don't see the lock files, try registering the VMs.

EdOfTheMountai2 · ‎01-04-2016

>I can't download this log file for some reason.

Weird. I can download log file form this posting on my end. Must be some kind of security blocker?

>But now that this is better, do you still see the .lck files?

Yes. The 3 VM's that I cannot get back into inventory all have *.vmx.lck and *.vswp files

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] ls -al

total 841134104

drwxr-xr-x 1 root root 2520 Nov 23 14:11 .

drwxr-xr-t 1 root root 2240 Jan 4 21:36 ..

-rw------- 1 root root 2147483648 Nov 23 14:11 subsite-jenkins-36234e19.vswp

-rw------- 1 root root 858993459200 Dec 28 11:43 subsite-jenkins-flat.vmdk

-rw------- 1 root root 8684 Dec 28 11:43 subsite-jenkins.nvram

-rw------- 1 root root 507 Nov 23 14:11 subsite-jenkins.vmdk

-rw-r--r-- 1 root root 0 Nov 2 19:15 subsite-jenkins.vmsd

-rwxr-xr-x 1 root root 2726 Nov 23 14:11 subsite-jenkins.vmx

-rw------- 1 root root 0 Nov 23 14:11 subsite-jenkins.vmx.lck

-rwxr-xr-x 1 root root 2641 Nov 23 14:11 subsite-jenkins.vmx~

-rw-r--r-- 1 root root 72424 Nov 2 21:47 vmware-3.log

-rw-r--r-- 1 root root 166649 Nov 2 22:09 vmware-4.log

-rw-r--r-- 1 root root 166665 Nov 2 22:14 vmware-5.log

-rw-r--r-- 1 root root 166499 Nov 2 22:23 vmware-6.log

-rw-r--r-- 1 root root 169750 Nov 5 20:56 vmware-7.log

-rw-r--r-- 1 root root 207988 Nov 20 20:33 vmware-8.log

-rw-r--r-- 1 root root 139993 Dec 8 19:57 vmware.log

-rw------- 1 root root 170917888 Nov 23 14:11 vmx-subsite-jenkins-908283417-1.vswp

I cannot remove the *.vmx.lck file.

vmx.lck

rm: can't remove 'subsite-jenkins.vmx.lck': Input/output error

>Do you see the vmkernel.log file in the /var/log directory as well? These should be links to the scratch partition.

Yes I do see symbolic links now.

ls -al vmkernel*

lrwxrwxrwx 1 root root 25 Jan 4 22:13 vmkernel.log -> /scratch/log/vmkernel.log

>If you don't see the lock files, try registering the VMs.

I still see *.vmx.lck files in the VM directory where the *.vmx is located.

Using vSphere DataStore Browser the "Add to Inventory" option is grayed-out.

-Ed

virtualosa · ‎01-04-2016

OK now that you can see the vmkernel log, I would go back to the original article I sent and see if you can find out what's happening.

Also, open the vmware.log file for one of the VMs and see if that tells you anything.

Also, you can try creating a new VM and pointing it to the vmdk file of one of these VMs and see if it will boot up that way. I'd start with the one that doesn't have any snapshots.

Also, do you have any VMs in inventory that show as "unknown"? If so, remove those from inventory, these could be causing the lock files issue.

I'm sorry I'm not being more help, seems like you may have a few things going on.

virtualosa · ‎01-05-2016

Hi

Did you make any progress here? I was able to download the vmkernel log today, not sure why yesterday I couldn't, but it's looking like you have some problems with your storage.

These lines:

WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: IO was aborted by VMFS via a virt-reset on the device

Check this out.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200948...

Which makes sense why you couldn't see the vmkernel log when it was being saved on this same storage.

Also, WARNING: HBX: 4837: Replay of journal <FB 1506800> on vol 'hdd-2tb' failed: I/O error points to storage errors....

I'd take a look at the disks, not sure what your configuration is.

EdOfTheMountai2 · ‎01-05-2016

When I use md5sum to test if files are readable I get a number of files that report Input/output errors.

Since these include critical files such as vmdk files, I think resurrecting any corrupted VMs is not possible. Do you agree?

All three running VMs report a similar pattern of file input/output errors on files that were likely opened during the power-failure.

[root@localhost:/vmfs/volumes/56325d60-4a871a28-ad1e-38c98618970d/subsite-jenkins] md5sum *

md5sum: can't open 'subsite-jenkins-36234e19.vswp': Input/output error

md5sum: can't open 'subsite-jenkins-flat.vmdk': Input/output error

md5sum: can't open 'subsite-jenkins.nvram': Input/output error

6bcd4a18e5a1dc0446e15f07ff5256fa subsite-jenkins.vmdk

d41d8cd98f00b204e9800998ecf8427e subsite-jenkins.vmsd

md5sum: can't open 'subsite-jenkins.vmx': Input/output error

md5sum: can't open 'subsite-jenkins.vmx.lck': Input/output error

md5sum: can't open 'subsite-jenkins.vmx~': Input/output error

aa4f4598ef8238872dfa3627e169bf2e vmware-3.log

f2d441d27f534c601a0421803921cd7f vmware-4.log

ab08a6e8448c23968697b2192fd0ba15 vmware-5.log

cc46ae866e1b68e0a6b9c65764f69062 vmware-6.log

38a061c4cc380d054edd11cda845c797 vmware-7.log

3b1e170646523a51a0a59fccda527ec8 vmware-8.log

md5sum: can't open 'vmware.log': Input/output error

md5sum: can't open 'vmx-subsite-jenkins-908283417-1.vswp': Input/output error

I need to learn how to backup VMs if the vmfs file system is this easily corrupted when running VMs go down during a power-fail. I need to learn how to make backups regardless of the cause. My Synology NAS log indicated the UPS tripped a half-dozen times over the holidays and went all the way down at least once.

I am hoping I can backup ESXi VMs to my Synology NAS. I can justify a $500 purchase of ESXi 6 Essentials if this gives me VM backup capability? I've already spent way too much time trying to recover from this power fail.

I also need to figure out how to configure my APS UPS to signal ESXi to shut-down on power-fail if the vmfs file system is this easy to corrupt. Until then, I will shut down ESXI every night when I go home.

Thank you for your help,

-Ed

virtualosa · ‎01-05-2016

Yeah bummer! There might be a way to recover but I don't know it off hand. Also I don't know if it's worth the time, don't know what they are, how critical, etc.

If you go with the Essentials Plus edition, you get the Data Protection product which will let you backup but I don't know how much it costs.

Good luck!

EdOfTheMountai2 · ‎01-05-2016

Yes, it is a bummer. I cannot read or delete the input/output error files from vmfs. I guess I need to re-export all VMs and reinstall everything from scratch since there are no tools to fix the vmfs.

Fortunately I do I have the original VMs that I exported to ESXi 6 still running on VMware Workstation and VMware Fusion. Unfortunately I have probably lost some configuration changes. I think step 1 will be to backup these VMs running on VMware Workstation and VMware Fusion.

>If you go with the Essentials Plus edition, you get the Data Protection product which will let you backup but I don't know how much it costs.

Essentials Plus is $5,439. I was hoping that the $560 Essentials would give me an API or something to do backups. Maybe not.

VMware vSphere Essentials Kits - Official VMware Store

I can see that I *really* need to find an affordable backup solution if want to use ESXi.

Thanks again,

-Ed

All

ESXi 6 VMs "inaccessible" after New Years's power failure ?