Re: Locked files in VMs after storage failure

7007VM7007 · ‎11-07-2017

I had a storage failure today in my test environment and 4 of my VMs are showing as inaccessible and they have .lck files in the folder on the datastore. Since this is a test environment I don't have backups of these VMs so is there anything I can do to save these VMs? Before anyone mentions it, I have tried the VMware KB articles on locked files and have Googled this to death but clearly I am missing something!

This is how they look in vCenter:

And this is the folder contents of the VM:

Jitu211003 · ‎11-07-2017

Hi,

I had same kind of issue in my environment where storage went restarted suddenly and then some of the lun becomes un-allocated.

There is a workaround for that, you need to take a graceful reboot of storage once then all your LUN will become accessible as it was.

Unfortunately, i formatted one out of 3 before taking reboot of storage. Still i got 2 LUN intact and healthy and accessible after taking a reboot of storage.

For more visit vmwarediary.com or vmwarediary.in

7007VM7007 · ‎11-08-2017

I've rebooted both hosts and the storage and tis hasn't helped so I am still stuck.

admin · ‎11-08-2017

If the problem now is just the .vmx file with lock, you can create a new virtual machine and point to the old .vmdk.

one_topsy · ‎11-08-2017

I know for VMware workstation, the solution was to delete the .lck file from the directory. When a virtual machine is powered off, it removes the lock files it created, since you had a power outage, it might not have been able to remove the lock file.

Perhaps you can try to backup the lock file for one of the VM's and then delete and see if you can power up?

------ VCP6-DCV | Aspiring VMWare expert

7007VM7007 · ‎11-08-2017

I have tried deleting the lock files and creating a new VM and pointing it to the vmdk but neither worked.

In the vmkernel.log I get these errors on the one host:

2017-11-08T21:06:50.564Z cpu12:67859)FS3J: 3993: Replaying journal at <type 6 addr 15>, gen 43

2017-11-08T21:06:50.568Z cpu12:67859)WARNING: HBX: 5365: Replay of journal <type 6 addr 15> on vol 'RAID10_SM863' failed: Lost previously held disk lock

bhards4 · ‎11-08-2017

Hi ,

For identify the lock issue please follow below Blog link.

Failed to lock the file - Continue...

-Sachin

7007VM7007 · ‎11-09-2017

Thanks for the link. I ran both commands:

[root@esxi1:/var/log] vmfsfilelockinfo -p /vmfs/volumes/RAID10_SM863/NSX-Manager/NSX-Manager-flat.vmdk

vmfsfilelockinfo Version 2.0

Looking for lock owners on "NSX-Manager-flat.vmdk"

"NSX-Manager-flat.vmdk" is locked in Exclusive mode by host having mac address ['0c:c4:7a:c5:59:70']

Trying to use information from VMFS Heartbeat

Host owning the lock on file is 192.168.30.8, lockMode : Exclusive

Total time taken : 4.2072997899958864 seconds.

[root@esxi1:/var/log] vmkfstools -D /vmfs/volumes/RAID10_SM863/NSX-Manager/NSX-Manager-flat.vmdk

Lock [type 10c00001 offset 142090240 v 17, hb offset 3702784

gen 43, mode 1, owner 59fe47e7-5d8de2e2-2f7d-0cc47ac55970 mtime 388852

num 0 gblnum 0 gblgen 0 gblbrk 0]

Addr <4, 51, 1>, gen 2, links 1, type reg, flags 0x9, uid 0, gid 0, mode 600

len 64424509440, nb 14554 tbz 0, cow 0, newSinceEpoch 14554, zla 3, bs 1048576

I knew it was host 192.168.30.8 and have rebooted this host a couple of times already and the lock remains so there are a few VMs and their folders that I cannot delete.

So after running the above commands what else can I do to force delete the VMs and their folders on the datastore?

On host 192.168.30.8 I have these entries in the vmkernel.log:

2017-11-09T08:40:52.124Z cpu10:67892)FS3J: 3993: Replaying journal at <type 6 addr 15>, gen 43

2017-11-09T08:40:52.131Z cpu10:67892)WARNING: HBX: 5365: Replay of journal <type 6 addr 15> on vol 'RAID10_SM863' failed: Lost previously held disk lock

Finikiez · ‎11-09-2017

It looks as a problem with VMFS heartbeat region.

I would recommend to check affected VMFS datastore with VOMA utility (included in ESXi 6.0 and higher versions)

You need to power off all running VMs on that datastore or svmotion them to another datastores.

Then unmount datastore from all ESXi hosts and do

voma -m vmfs -f fix -d /vmfs/devices/disks/naa.id:#

where naa.id - is naa identificator of device and # - is a partition number

usualy parition number is 1 on shared storage

Checking Metadata Consistency with VOMA

7007VM7007 · ‎11-09-2017

Thanks, thats really helpful!

I have storage vmotioned my VMs to another datastore but theres just one problem, I cant unmount the datastore from the one host. I get this error:

There are no VMs left on this datastore but there are corrupt files still (with the lock files I can't get rid of) so how can I forcefully unmount this datastore so I can run voma on it?

Finikiez · ‎11-09-2017

Can you try reboot this ESXi host or power it down for a time when you run VOMA?

7007VM7007 · ‎11-09-2017

I'm in a bit of a catch 22 here.

I need to run VOMA but in order for me to be able to do this I have to unmount the datastore which I can't do due to the corrupt folders on the datastore!

Can I forcefully unmount the datastore so I can run VOMA?

7007VM7007 · ‎11-09-2017

I shutdown host two and when I ran:

voma -m vmfs -f fix -d /vmfs/devices/disks/naa.6589cfc0000006f22a5c1eb41598028b:1

but I get:

Checking if device is actively used by other hosts

Scanning for VMFS-6 host activity (4096 bytes/HB, 1024 HBs).

Running VMFS Checker version 2.1 in fix mode

Initializing LVM metadata, Basic Checks will be done

ERROR: Fix Mode is not yet supported for VMFS-6

VOMA failed to check device : Not Supported

Total Errors Found: 0

Total Errors Fixed: 0

Total Partially Fixed errors: 0

Kindly Consult VMware Support for further assistance

Ok how do I just delete the entire datastore now? I can't even do that...

Finikiez · ‎11-09-2017

It's ok to run VOMA if this host will be powered off.

The error says that filesystem is busy. Probably there is an active process which is using this datastore. You can try to find it using command

sxcli storage core device world list -d naa.id

mysticknight · ‎11-09-2017

Did you try performing a lunreset if it's an iscsi storage device? It should reset the locks. Normally this solves it for me in most cases. Vmkfstools -lunreset naa.xxxxx

From past experience if its a locking issue is detected and if the vms are vmotioned or the esx is rebooted that would be the end of the vm.

I always do the lunreset first before anything else

7007VM7007 · ‎11-10-2017

No I didn't but in the end I managed to detach the LUN and then unmount the datastore. I then formatted the datastore and have started fresh.

Next time I shall try the LUN reset option.

So a warning and FYI: If you are using VMFS6 and you have file system corruption VOMA can't help you!!

mysticknight · ‎11-10-2017

Yes.. I have seen VMFS corruption even on local good disks.. VOMA did nothing..

VMFS6 has major locking issues.. We reverted to VMFS5.. I already have SR opened with VMware and they are working on a fix for the next release or so..

Finikiez · ‎11-10-2017

VOMA can fix only locking issue. If other regions metadata\LVM or something else is corrupted VOMA can't help.

VOMA helped me sevaral times prevously with corrupted heatbeat region.

Can you shed some light on the issue with VMFS6 and locking you observed?

All

Locked files in VMs after storage failure