Re: virtual machine stucks in consolidating snapsh...

Black14987 · ‎02-03-2020

Hi all,

we have an VMWare farm with 6 ESX Hosts VMware ESXi, 6.5.0, 9298722 (5 x ProLiant BL460c Gen9 and 1x ProLiant DL380 Gen10).

In October we had a problem with two virtual machines which stuck in consolidating snapshots.

VMWare sends this message

ServerXXX on VCenterXXX needs snapshot consolidation, but all automatic snapshot consolidation attempts have failed.

Most likely reason is a virtual disk being locked by some external process. Please troubleshoot the locking issue, and initiate snapshot consolidation manually in vSphere Client.

Veeam Backup & Replication 9.5.4.2866

Consolidating snapshots constantly failed on one Disc of the both machines. One of the machines was the fileserver and by trying to consolidate it disconnects the disc with the problem. We tried to completly restore the faulty disc from veeam, but after the next backup, the disc gets the same error. Our workaround was, a complete new virtual disc and a copy of the data from the old disc to the new one. After that the machines runs smothly saturday two week ago. Then the same problem occured again.

Attached you find the VMWare Log of one of the machines. We tried different things, but now we are clueless.

If you need any further information, don`t hesitate to contact me.

Alex_Romeo · ‎02-03-2020

Hi,

try to follow the resolution from this KB

VMware Knowledge Base

ARomeo

Blog: https://www.aleadmin.it/

Black14987 · ‎02-03-2020

But if it is the SAN Transport Virtual Disk API, shouldn`t the problem then occure on every system? But the problem seems to pinned on the both machines. And the problem occures when the backup process is ending and Veeam tries to consolidate the snapshot.

And Veeam Backup is THE global player in backup VMware. Why ist this problem still not solved when it is a problem of Veeam.

Alex_Romeo · ‎02-03-2020

Hi,

From the log file you can see that Veeam is locking the disk (as you have highlighted). I guess you tried to stop the backup from Veeam, restart the problem server and try to manually consolidate the snapshot.

If you can try to do a "Storage vMotion" of the vm and then consolidate the snapshot.

ARomeo

Blog: https://www.aleadmin.it/

Black14987 · ‎02-03-2020

When the Veeam Job failed, we tried to restart the mchine, make a complete vmotion on another machine and another Storage. A Storage Vmotion won`t work. It failes with the message that the disk is locked.

Alex_Romeo · ‎02-03-2020

Hi,

so you are in this case here:

KB1775: Removing Stuck VMDK’s from the Veeam Proxy

ARomeo

Blog: https://www.aleadmin.it/

Black14987 · ‎02-03-2020

In the Logfile the first entry with the Problem ist here I think.

2020-01-30T20:09:17.301Z| Worker#0| I125: OBJLIB-FILEBE : FileBEIoctl: ioctl operation IOCTLCMD_VMFS_DELTADISKS(3033) failed on '/vmfs/devices/cbt/c0c249b6-cbt' : Bad file descriptor (589826)

When I get the Log right, the backup ist completed at that point and veeam gave the VMWare the command to consolidate the snapshot, but failes. I don`t know why, or what this message has it source at.

continuum · ‎02-03-2020

Bad file descriptor is an error message that apparently means that the VMFS metadata for at least one fragment of the file is missing / bad or unaccesable.

As you got this error for a cbt-file you may get away by discarding all CBT-vmdks.

If this works you need to create a full backup next time you run Veeam.

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Black14987 · ‎02-03-2020

What can cause this Problem with the VMFS metadata?

Today we have restored both faulty machines on a complete new Datastore. We hope that the backup this evening will run without an error. But still we want to know the source of the problem.

continuum · ‎02-03-2020

I see this errors with (not listed according to probabilty ! )

- disk is in physical bad state - dieing disks

- you accidentaly filled up the datastore once

- power failure

- I/O error on the disk ( can be caused by bad disks AND VMFS-corruption

For files with valuable content I try to recover by creating a dd-script for each fragment - that sometimes helps - or is at least good enough.

When I encounter Bad file descriptor for cbt-files I usually dont try to recover the cbt.vmdk but just reset the change block tracking

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Black14987 · ‎02-20-2020

Error occured today again. Attached you finde one of the logfiles of the server.

What we`ve done is, we moved the whole faulty disc to another new Datastore. Again the same virtual disc gets faulty and again the system told us the redo protocoll is corrupt.

Black14987 · ‎02-21-2020

can someone explain me what is happening at this point?

2020-02-20T15:14:32.188Z| Worker#0| I125: DISKLIB-VMFS : "/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001-sesparse.vmdk" : open successful (10) size = 12636745728, hd = 16734001. Type 19
2020-02-20T15:14:32.188Z| Worker#0| I125: DISKLIB-DSCPTR: Opened [0]: "SRV-FILE_2-000001-sesparse.vmdk" (0xa)
2020-02-20T15:14:32.188Z| Worker#0| I125: DISKLIB-LINK : Opened '/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001.vmdk' (0xa): seSparse, 5872025600 sectors / 2.7 TB.
2020-02-20T15:14:32.189Z| Worker#0| I125: DISKLIB-VMFS : "/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-flat.vmdk" : open successful (14) size = 3006477107200, hd = 33969992. Type 3
2020-02-20T15:14:32.189Z| Worker#0| I125: DISKLIB-DSCPTR: Opened [0]: "SRV-FILE_2-flat.vmdk" (0xe)
2020-02-20T15:14:32.189Z| Worker#0| I125: DISKLIB-LINK : Opened '/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2.vmdk' (0xe): vmfs, 5872025600 sectors / 2.7 TB.
2020-02-20T15:14:32.189Z| Worker#0| I125: DISKLIB-CHAINESX : ChainESXOpenSubChain:(0) fid = 33969992, extentType = 2
2020-02-20T15:14:32.189Z| Worker#0| I125: DISKLIB-CHAINESX : ChainESXOpenSubChain:(1) fid = 16734001, extentType = 1
2020-02-20T15:14:32.190Z| Worker#0| I125: DISKLIB-LIB_BLOCKTRACK   : Resuming change tracking.
2020-02-20T15:14:32.191Z| Worker#0| I125: DISKLIB-CTK   : Could not open change tracking file "/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001-ctk.vmdk": Change tracking invalid or disk in use.
2020-02-20T15:14:32.191Z| Worker#0| I125: DISKLIB-CTK   : Change tracking invalid; reinitializing.
2020-02-20T15:14:32.191Z| Worker#0| I125: DISKLIB-CTK   : Auto blocksize for size 5872025600 is 4096.
2020-02-20T15:14:32.192Z| Worker#0| I125: OBJLIB-FILEBE : Error creating file '/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001-ctk.vmdk': 3 (The file already exists).
2020-02-20T15:14:32.192Z| Worker#0| I125: DISKLIB-CBT   : Initializing ESX kernel change tracking for fid 20404041.
2020-02-20T15:14:32.193Z| Worker#0| I125: DISKLIB-CBT   : Successfuly created cbt node 1375749-cbt.
2020-02-20T15:14:32.193Z| Worker#0| I125: DISKLIB-CBT   : Opening cbt node /vmfs/devices/cbt/1375749-cbt
2020-02-20T15:14:34.938Z| Worker#0| I125: OBJLIB-FILEBE : FileBEIoctl: ioctl operation IOCTLCMD_VMFS_DELTADISKS(3033) failed on '/vmfs/devices/cbt/1375749-cbt' : Bad file descriptor (589826)
2020-02-20T15:14:34.938Z| Worker#0| I125: DISKLIB-VMFS :VmfsSparseExtentCommonGetAllocatedSectorChunks: ObjLib_Ioctl failed 0x90009
2020-02-20T15:14:34.938Z| Worker#0| I125: DISKLIB-LIB_MISC   : DiskLibGetAllocatedSectorChunksInRangeInt: failed to get allocated sector bitmap with 'Bad file descriptor' (589833).
2020-02-20T15:14:34.938Z| Worker#0| W115: DISKLIB-CBT   : ChangeTrackerESX_MarkAllUsedAreas: Failed to get allocated sectors: Bad file descriptor.
2020-02-20T15:14:34.938Z| Worker#0| I125: DISKLIB-LIB   : Opened "/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001.vmdk" (flags 0xa, type vmfs).
2020-02-20T15:14:34.938Z| Worker#0| I125: DISK: Disk '/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001.vmdk' has UUID '60 00 c2 9f fc ea ac c5-17 3d 67 51 a7 90 97 0c'
2020-02-20T15:14:34.938Z| Worker#0| I125: DISK: OPEN '/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001.vmdk' Geo (365516/255/63) BIOS Geo (0/0/0)
2020-02-20T15:14:34.951Z| Worker#0| I125: UTIL: Change file descriptor limit from soft 4096,hard 4096 to soft 4211,hard 4211.

continuum · ‎02-21-2020

Bad file descriptor means that you need to switch to Linux if you want to continue using that file.

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Black14987 · ‎02-21-2020

With the client os? What does the client os has to do with that?

continuum · ‎02-21-2020

Not the guestOS of course.

I am talking about the hostOS.

Bad file descriptor = ESXi tells you that he does not want to read the specified file.

Either give up that file or try to read it with Linux.

When a change block tracking vmdk is affected you simply discard the cbt vmdks.

Nobody tries to recover cbt vmdks - makes no sense ....

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Black14987 · ‎02-21-2020

What does Bad file descriptor means?

continuum · ‎02-21-2020

To read a file you need to look up its properties like size, name, permissions, owner, type ....

That info is written into the Inode, For every file there is an Inode.

Once you have that info you also need to lookup the location of every fragment of the file on disk.

Here ESXi is very picky.

He checks the list of fragments and if that list is not 100% ok ESXi rather decides to not read the file at all than to risk a small error.

So Bad file descriptor is ESXis way of telling you: I know the file you want. to have but something may be wrong with it.

So I decided to give you nothing at all.

If we could tell ESXi to be less stubborn we could decide ourselves wether the file is still good enough.

But we cant do that so the option is to switch to Linux and use a procedure that tries to read the file regardless of issues with the table of fragments.

Real life analogy:

A mother is cooking a meal while she notices that some ingredients are missing.

So she takes a piece of paper and writes a list:

- 1liter milk

- garlic

- 3 sausages

- 12 potatoes

- 5 noodles

Accidentaly a drop of coffee finds its way on to that sheet of paper and the number 12 in front of potatoes becomes unreadable.

Then she sends ESXi to the shop to buy the items.

ESXi comes back with nothing at all and excuses himself with : shopping list is badly formatted.

Next she sends Linux and he comes back with all the stuff.

As he also did not know the number of potatoes he may bring no potatoes or a hundred ....

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Black14987 · ‎02-21-2020

Thx for your patience and explanation. This brings me to our original problem.

Again and again and again this error comes up on the same virtual disk of the same server We moved the server on another host, moved the disk on another data store and created another virtual disk and moved all data to the new one.

continuum · ‎02-21-2020

You get bad descriptor for a cbt vmdk ?

Then discard cbt- vmdks.

Consolidate snapshots and create a new full backup.

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Black14987 · ‎02-23-2020

But what can cause this error. This problem occures since october four times. It is running for weeks, but then it suddenly happens. Every time with this vmdk.

All

virtual machine stucks in consolidating snapshots, after Veeam backup