Hi all,
we have an VMWare farm with 6 ESX Hosts VMware ESXi, 6.5.0, 9298722 (5 x ProLiant BL460c Gen9 and 1x ProLiant DL380 Gen10).
In October we had a problem with two virtual machines which stuck in consolidating snapshots.
VMWare sends this message
ServerXXX on VCenterXXX needs snapshot consolidation, but all automatic snapshot consolidation attempts have failed.
Most likely reason is a virtual disk being locked by some external process. Please troubleshoot the locking issue, and initiate snapshot consolidation manually in vSphere Client.
Veeam Backup & Replication 9.5.4.2866 |
Consolidating snapshots constantly failed on one Disc of the both machines. One of the machines was the fileserver and by trying to consolidate it disconnects the disc with the problem. We tried to completly restore the faulty disc from veeam, but after the next backup, the disc gets the same error. Our workaround was, a complete new virtual disc and a copy of the data from the old disc to the new one. After that the machines runs smothly saturday two week ago. Then the same problem occured again.
Attached you find the VMWare Log of one of the machines. We tried different things, but now we are clueless.
If you need any further information, don`t hesitate to contact me.
Hi,
try to follow the resolution from this KB
ARomeo
But if it is the SAN Transport Virtual Disk API, shouldn`t the problem then occure on every system? But the problem seems to pinned on the both machines. And the problem occures when the backup process is ending and Veeam tries to consolidate the snapshot.
And Veeam Backup is THE global player in backup VMware. Why ist this problem still not solved when it is a problem of Veeam.
Hi,
From the log file you can see that Veeam is locking the disk (as you have highlighted). I guess you tried to stop the backup from Veeam, restart the problem server and try to manually consolidate the snapshot.
If you can try to do a "Storage vMotion" of the vm and then consolidate the snapshot.
ARomeo
When the Veeam Job failed, we tried to restart the mchine, make a complete vmotion on another machine and another Storage. A Storage Vmotion won`t work. It failes with the message that the disk is locked.
Hi,
so you are in this case here:
KB1775: Removing Stuck VMDK’s from the Veeam Proxy
ARomeo
In the Logfile the first entry with the Problem ist here I think.
2020-01-30T20:09:17.301Z| Worker#0| I125: OBJLIB-FILEBE : FileBEIoctl: ioctl operation IOCTLCMD_VMFS_DELTADISKS(3033) failed on '/vmfs/devices/cbt/c0c249b6-cbt' : Bad file descriptor (589826)
When I get the Log right, the backup ist completed at that point and veeam gave the VMWare the command to consolidate the snapshot, but failes. I don`t know why, or what this message has it source at.
Bad file descriptor is an error message that apparently means that the VMFS metadata for at least one fragment of the file is missing / bad or unaccesable.
As you got this error for a cbt-file you may get away by discarding all CBT-vmdks.
If this works you need to create a full backup next time you run Veeam.
What can cause this Problem with the VMFS metadata?
Today we have restored both faulty machines on a complete new Datastore. We hope that the backup this evening will run without an error. But still we want to know the source of the problem.
I see this errors with (not listed according to probabilty ! )
- disk is in physical bad state - dieing disks
- you accidentaly filled up the datastore once
- power failure
- I/O error on the disk ( can be caused by bad disks AND VMFS-corruption
For files with valuable content I try to recover by creating a dd-script for each fragment - that sometimes helps - or is at least good enough.
When I encounter Bad file descriptor for cbt-files I usually dont try to recover the cbt.vmdk but just reset the change block tracking
can someone explain me what is happening at this point?
2020-02-20T15:14:32.188Z| Worker#0| I125: DISKLIB-VMFS : "/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001-sesparse.vmdk" : open successful (10) size = 12636745728, hd = 16734001. Type 19
2020-02-20T15:14:32.188Z| Worker#0| I125: DISKLIB-DSCPTR: Opened [0]: "SRV-FILE_2-000001-sesparse.vmdk" (0xa)
2020-02-20T15:14:32.188Z| Worker#0| I125: DISKLIB-LINK : Opened '/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001.vmdk' (0xa): seSparse, 5872025600 sectors / 2.7 TB.
2020-02-20T15:14:32.189Z| Worker#0| I125: DISKLIB-VMFS : "/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-flat.vmdk" : open successful (14) size = 3006477107200, hd = 33969992. Type 3
2020-02-20T15:14:32.189Z| Worker#0| I125: DISKLIB-DSCPTR: Opened [0]: "SRV-FILE_2-flat.vmdk" (0xe)
2020-02-20T15:14:32.189Z| Worker#0| I125: DISKLIB-LINK : Opened '/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2.vmdk' (0xe): vmfs, 5872025600 sectors / 2.7 TB.
2020-02-20T15:14:32.189Z| Worker#0| I125: DISKLIB-CHAINESX : ChainESXOpenSubChain:(0) fid = 33969992, extentType = 2
2020-02-20T15:14:32.189Z| Worker#0| I125: DISKLIB-CHAINESX : ChainESXOpenSubChain:(1) fid = 16734001, extentType = 1
2020-02-20T15:14:32.190Z| Worker#0| I125: DISKLIB-LIB_BLOCKTRACK : Resuming change tracking.
2020-02-20T15:14:32.191Z| Worker#0| I125: DISKLIB-CTK : Could not open change tracking file "/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001-ctk.vmdk": Change tracking invalid or disk in use.
2020-02-20T15:14:32.191Z| Worker#0| I125: DISKLIB-CTK : Change tracking invalid; reinitializing.
2020-02-20T15:14:32.191Z| Worker#0| I125: DISKLIB-CTK : Auto blocksize for size 5872025600 is 4096.
2020-02-20T15:14:32.192Z| Worker#0| I125: OBJLIB-FILEBE : Error creating file '/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001-ctk.vmdk': 3 (The file already exists).
2020-02-20T15:14:32.192Z| Worker#0| I125: DISKLIB-CBT : Initializing ESX kernel change tracking for fid 20404041.
2020-02-20T15:14:32.193Z| Worker#0| I125: DISKLIB-CBT : Successfuly created cbt node 1375749-cbt.
2020-02-20T15:14:32.193Z| Worker#0| I125: DISKLIB-CBT : Opening cbt node /vmfs/devices/cbt/1375749-cbt
2020-02-20T15:14:34.938Z| Worker#0| I125: OBJLIB-FILEBE : FileBEIoctl: ioctl operation IOCTLCMD_VMFS_DELTADISKS(3033) failed on '/vmfs/devices/cbt/1375749-cbt' : Bad file descriptor (589826)
2020-02-20T15:14:34.938Z| Worker#0| I125: DISKLIB-VMFS :VmfsSparseExtentCommonGetAllocatedSectorChunks: ObjLib_Ioctl failed 0x90009
2020-02-20T15:14:34.938Z| Worker#0| I125: DISKLIB-LIB_MISC : DiskLibGetAllocatedSectorChunksInRangeInt: failed to get allocated sector bitmap with 'Bad file descriptor' (589833).
2020-02-20T15:14:34.938Z| Worker#0| W115: DISKLIB-CBT : ChangeTrackerESX_MarkAllUsedAreas: Failed to get allocated sectors: Bad file descriptor.
2020-02-20T15:14:34.938Z| Worker#0| I125: DISKLIB-LIB : Opened "/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001.vmdk" (flags 0xa, type vmfs).
2020-02-20T15:14:34.938Z| Worker#0| I125: DISK: Disk '/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001.vmdk' has UUID '60 00 c2 9f fc ea ac c5-17 3d 67 51 a7 90 97 0c'
2020-02-20T15:14:34.938Z| Worker#0| I125: DISK: OPEN '/vmfs/volumes/5e367362-5487fa49-fa80-6cc2173ba4d0/SRV-FILE/SRV-FILE_2-000001.vmdk' Geo (365516/255/63) BIOS Geo (0/0/0)
2020-02-20T15:14:34.951Z| Worker#0| I125: UTIL: Change file descriptor limit from soft 4096,hard 4096 to soft 4211,hard 4211.
Bad file descriptor means that you need to switch to Linux if you want to continue using that file.
With the client os? What does the client os has to do with that?
Not the guestOS of course.
I am talking about the hostOS.
Bad file descriptor = ESXi tells you that he does not want to read the specified file.
Either give up that file or try to read it with Linux.
When a change block tracking vmdk is affected you simply discard the cbt vmdks.
Nobody tries to recover cbt vmdks - makes no sense ....
What does Bad file descriptor means?
To read a file you need to look up its properties like size, name, permissions, owner, type ....
That info is written into the Inode, For every file there is an Inode.
Once you have that info you also need to lookup the location of every fragment of the file on disk.
Here ESXi is very picky.
He checks the list of fragments and if that list is not 100% ok ESXi rather decides to not read the file at all than to risk a small error.
So Bad file descriptor is ESXis way of telling you: I know the file you want. to have but something may be wrong with it.
So I decided to give you nothing at all.
If we could tell ESXi to be less stubborn we could decide ourselves wether the file is still good enough.
But we cant do that so the option is to switch to Linux and use a procedure that tries to read the file regardless of issues with the table of fragments.
Real life analogy:
A mother is cooking a meal while she notices that some ingredients are missing.
So she takes a piece of paper and writes a list:
- 1liter milk
- garlic
- 3 sausages
- 12 potatoes
- 5 noodles
Accidentaly a drop of coffee finds its way on to that sheet of paper and the number 12 in front of potatoes becomes unreadable.
Then she sends ESXi to the shop to buy the items.
ESXi comes back with nothing at all and excuses himself with : shopping list is badly formatted.
Next she sends Linux and he comes back with all the stuff.
As he also did not know the number of potatoes he may bring no potatoes or a hundred ....
Thx for your patience and explanation. This brings me to our original problem.
Again and again and again this error comes up on the same virtual disk of the same server We moved the server on another host, moved the disk on another data store and created another virtual disk and moved all data to the new one.
You get bad descriptor for a cbt vmdk ?
Then discard cbt- vmdks.
Consolidate snapshots and create a new full backup.
But what can cause this error. This problem occures since october four times. It is running for weeks, but then it suddenly happens. Every time with this vmdk.