We are running ESXi 7.0 Update 2 on a Dell server. We have a hardware RAID10 datastore which ESXi says has a reported capacity of 14.43 TB, of which 9.79 TB is provisioned and 4.64 TB is free.
This datastore contains 1 single VM (used for making backups). It has 2 virtual disks. Disk #1 is 16GB thick provisioned and disk #2 is 14TB thin provisioned. The VM reports under "resource consumption" that 9.56TB of the provisioned 14.02TB is used.
So plenty of space on the datastore right? Wrong. Every day within a pretty specific time frame when this VM makes new backups the VM locks up completely with the error message/question which goes:
"There is no more space for virtual disk 'xxxxx_1.vmdk'. You might be able to continue this session by freeing disk space on the relevant volume, and clicking Retry. Click Cancel to terminate this session."
If you click "Retry" the VM comes back up again for a while and the question will reappear. Sometimes immediately, sometimes later. If you do nothing or are not around to click anything, the VM will come online again as well and unfreeze itself after a few seconds or minutes. After the backups are finished (and the VM is not writing anything to it's virtual disk) this problem is also over.
I have read other posts that suggest the problem might have to do with thin provisioning and that the virtual disk commitment might in theory be too big for the datastore. But still it seems very buggy that I would be allowed to overcommit a thin provisioned disk and that it reports this error with an actual 4.64TB free on the datastore. Also I don't think it's really overcommitted because the datastore should be 14.43TB while the provisoned space of the VM is 14.02 TB.
This started happening immediately after the setup of this VM and ESXi host which had brand new disks. So I'm kind of doubting it's a hardware issue.
Since this VM is too large to move, my only option is start over with a thick provisioned disk and see what happens, unless on the off chance somebody can solve this issue for us.
Also posting this in case it may help somebody else dealing with this issue in the future, because I have a suspicion this is actually a bug in ESXi itself.
It may be interesting to see whether the VM's vmware.log contains more details, which will help understanding what's causing this issue.
André
Be very careful when answering that message box.
With thin provisioned or sesparse vmdks both answer can end in a corrupted vmdk.
Please show the vmware.log and the vmkernel.log for more details.
Ulli
As requested I attached the vmware.log for the VM and the vmkernel.log.
I concentrated on the period from May 11th ~ 20:00 UTC until about May 12th 02:30 UTC. This is the time frame in which the errors occur. After that it's mostly crickets in the logs because the VM and ESX host are pretty much idling.
Thanks, replied with the logs! It clearly shows the VM is offline for about 4 minutes when nobody answers the question and the dialog times out after that.
I don't know if it shows much else. The VMkernel seems to think there really is no space, but "df -h" shows:
# df -h
Filesystem Size Used Available Use% Mounted on
VMFS-6 14.4T 9.9T 4.5T 69% /vmfs/volumes/datastore1-ssd-raid10
VMFS-6 1.8T 1.7T 96.4G 95% /vmfs/volumes/datastore3-nvme-bare
VMFS-6 1.8T 1.8T 37.6G 98% /vmfs/volumes/datastore2-nvme-bare
VMFS-L 119.8G 4.6G 115.2G 4% /vmfs/volumes/OSDATA-606e0d61-c556fc30-c3a2-ecf4bbf16b84
vfat 4.0G 201.6M 3.8G 5% /vmfs/volumes/BOOTBANK1
vfat 4.0G 199.2M 3.8G 5% /vmfs/volumes/BOOTBANK2
Hi,
Have you checked this command from VMKB1007638, check if you are running out of free inodes there.
" stat -f /vmfs/volumes/datastore1-ssd-raid10 "
You see output similar to:
File: "/"
ID: 0 Namelen: 255 Type: ext2/ext3
Blocks: Total: 1259079 Free: 898253 Available: 834295 Size: 4096
Inodes: Total: 640000 Free: 580065
Your log-files show that the datastore is running out of space !!!
Please show a file-listing of the directory of the affected VM so that we think about your options.
Ulli
Good tip, didn't even think of it! Here is the output:
# stat -f /
File: "/"
ID: 100000000 Namelen: 127 Type: visorfs
Block size: 4096
Blocks: Total: 1127155 Free: 917028 Available: 917028
Inodes: Total: 655360 Free: 647438
# stat -f /vmfs/volumes/datastore1-ssd-raid10/
File: "/vmfs/volumes/datastore1-ssd-raid10/"
ID: ff5b3d5beecf7490 Namelen: 127 Type: vmfs
Block size: 1048576
Blocks: Total: 15128320 Free: 4754339 Available: 4754339
Inodes: Total: 2147483647 Free: 2147483647
Not quite sure why it says that every single inode is free on the datastore, but yeah at least they are not exhausted.
If a VMFS 6 volume runs out of free inodes - it will expand the metafiles - especially it will expand the file .sbc.sf.
It will do so without any mercy which will result in a fragmented .sbc.sf - if it does so too often sooner or later Datastorebrowser can no longer keep up with it and it will fail to enumerate the files in a directory.
Try
vmkfstools -P -v10 /vmfs/volumes/datastore-name
to see the current state.
There you go! Replaced the VM name by 'xxxxx' in output because it contains a FQDN identifying the company.
# ls -la /vmfs/volumes/datastore1-ssd-raid10/
total 2216064
drwxr-xr-t 1 root root 73728 Nov 5 2021 .
drwxr-xr-x 1 root root 512 May 12 14:55 ..
-r-------- 1 root root 257261568 Apr 7 2021 .fbb.sf
-r-------- 1 root root 134807552 Apr 7 2021 .fdc.sf
-r-------- 1 root root 268632064 Apr 7 2021 .jbc.sf
-r-------- 1 root root 16908288 Apr 7 2021 .pb2.sf
-r-------- 1 root root 65536 Apr 7 2021 .pbc.sf
-r-------- 1 root root 1577910272 Apr 7 2021 .sbc.sf
drwx------ 1 root root 69632 Apr 7 2021 .sdd.sf
-r-------- 1 root root 7340032 Apr 7 2021 .vh.sf
drwxr-xr-x 1 root root 77824 Jan 11 23:36 xxxxx
# ls -la /vmfs/volumes/datastore1-ssd-raid10/xxxxx/
total 10620741952
drwxr-xr-x 1 root root 77824 Jan 11 23:36 .
drwxr-xr-t 1 root root 73728 Nov 5 2021 ..
-rw------- 1 root root 0 Dec 20 22:37 xxxxx-0fe6471e.vswp
-rw------- 1 root root 17179869184 May 12 14:55 xxxxx-flat.vmdk
-rw------- 1 root root 8684 May 11 02:34 xxxxx.nvram
-rw------- 1 root root 535 Jan 11 23:36 xxxxx.vmdk
-rw-r--r-- 1 root root 0 Apr 12 2021 xxxxx.vmsd
-rwxr-xr-x 1 root root 4556 Jan 31 08:42 xxxxx.vmx
-rw------- 1 root root 0 Dec 20 22:37 xxxxx.vmx.lck
-rw------- 1 root root 150 May 11 01:46 xxxxx.vmxf
-rwxr-xr-x 1 root root 4556 Jan 31 08:42 xxxxx.vmx~
-rw------- 1 root root 15393162788864 May 12 14:30 xxxxx_1-flat.vmdk
-rw------- 1 root root 543 Jan 11 23:36 xxxxx_1.vmdk
-rw-r--r-- 1 root root 186929 Nov 5 2021 vmware-34.log
-rw-r--r-- 1 root root 186219 Nov 5 2021 vmware-35.log
-rw-r--r-- 1 root root 187332 Nov 5 2021 vmware-36.log
-rw-r--r-- 1 root root 187081 Nov 5 2021 vmware-37.log
-rw-r--r-- 1 root root 238229 Nov 5 2021 vmware-38.log
-rw-r--r-- 1 root root 489908 Dec 20 22:15 vmware-39.log
-rw-r--r-- 1 root root 1972760 May 12 14:25 vmware.log
-rw------- 1 root root 121634816 Dec 20 22:37 vmx-xxxxx-635e6b16f8f79f51adce7b899475c1ca2d52ba0e-1.vswp
# ls -lah /vmfs/volumes/datastore1-ssd-raid10/xxxxx/
total 10620741952
drwxr-xr-x 1 root root 76.0K Jan 11 23:36 .
drwxr-xr-t 1 root root 72.0K Nov 5 2021 ..
-rw------- 1 root root 0 Dec 20 22:37 xxxxx-0fe6471e.vswp
-rw------- 1 root root 16.0G May 12 14:55 xxxxx-flat.vmdk
-rw------- 1 root root 8.5K May 11 02:34 xxxxx.nvram
-rw------- 1 root root 535 Jan 11 23:36 xxxxx.vmdk
-rw-r--r-- 1 root root 0 Apr 12 2021 xxxxx.vmsd
-rwxr-xr-x 1 root root 4.4K Jan 31 08:42 xxxxx.vmx
-rw------- 1 root root 0 Dec 20 22:37 xxxxx.vmx.lck
-rw------- 1 root root 150 May 11 01:46 xxxxx.vmxf
-rwxr-xr-x 1 root root 4.4K Jan 31 08:42 xxxxx.vmx~
-rw------- 1 root root 14.0T May 12 14:30 xxxxx_1-flat.vmdk
-rw------- 1 root root 543 Jan 11 23:36 xxxxx_1.vmdk
-rw-r--r-- 1 root root 182.5K Nov 5 2021 vmware-34.log
-rw-r--r-- 1 root root 181.9K Nov 5 2021 vmware-35.log
-rw-r--r-- 1 root root 182.9K Nov 5 2021 vmware-36.log
-rw-r--r-- 1 root root 182.7K Nov 5 2021 vmware-37.log
-rw-r--r-- 1 root root 232.6K Nov 5 2021 vmware-38.log
-rw-r--r-- 1 root root 478.4K Dec 20 22:15 vmware-39.log
-rw-r--r-- 1 root root 1.9M May 12 14:25 vmware.log
-rw------- 1 root root 116.0M Dec 20 22:37 vmx-xxxxx-635e6b16f8f79f51adce7b899475c1ca2d52ba0e-1.vswp
# df -h
Filesystem Size Used Available Use% Mounted on
VMFS-6 14.4T 9.9T 4.5T 69% /vmfs/volumes/datastore1-ssd-raid10
VMFS-6 1.8T 1.7T 96.4G 95% /vmfs/volumes/datastore3-nvme-bare
VMFS-6 1.8T 1.8T 37.6G 98% /vmfs/volumes/datastore2-nvme-bare
VMFS-L 119.8G 4.6G 115.2G 4% /vmfs/volumes/OSDATA-606e0d61-c556fc30-c3a2-ecf4bbf16b84
vfat 4.0G 201.6M 3.8G 5% /vmfs/volumes/BOOTBANK1
vfat 4.0G 199.2M 3.8G 5% /vmfs/volumes/BOOTBANK2
Output from vmkfstools command. Unsure how to interpret this at first glance:
# vmkfstools -P -v10 /vmfs/volumes/datastore1-ssd-raid10/
VMFS-6.82 (Raw Major Version: 24) file system spanning 1 partitions.
File system label (if any): datastore1-ssd-raid10
Mode: public
Capacity 15863193272320 (15128320 file blocks * 1048576), 4985285771264 (4754339 blocks) avail, max supported file size 70368744177664
Volume Creation Time: Wed Apr 7 19:52:01 2021
Files (max/free): 16384/16355
Ptr Blocks (max/free): 0/0
Sub Blocks (max/free): 24064/22264
Secondary Ptr Blocks (max/free): 256/255
File Blocks (overcommit/used/overcommit %): 0/10373981/0
Ptr Blocks (overcommit/used/overcommit %): 0/0/0
Sub Blocks (overcommit/used/overcommit %): 0/1800/0
Large File Blocks (total/used/file block clusters): 29548/0/29548
Volume Metadata size: 2262925312
Disk Block Size: 512/512/0
UUID: 606e0d61-d4dfee10-c4c5-ecf4bbf16b84
Logical device: 606e0d61-cfbf898e-4eff-ecf4bbf16b84
Partitions spanned (on "lvm"):
naa.6b8ca3a0f9e39a002800c59e22cf946f:8
Unable to connect to vaai-nasd socket [No such file or directory]
Is Native Snapshot Capable: NO
OBJLIB-LIB: ObjLib cleanup done.
WORKER: asyncOps=0 maxActiveOps=0 maxPending=0 maxCompleted=0
Do you really need the full 14 TB for xxxxx_1-flat.vmdk
If it is an option to reduce the size of the partitions and you could manage to free up some space at the end of the disk we could consider to take a scissor and cut it off at the end.
Output for vmkfstools (already posted this I think 15 min ago, but I don't see it in the thread anymore)
# vmkfstools -P -v10 /vmfs/volumes/datastore1-ssd-raid10/
VMFS-6.82 (Raw Major Version: 24) file system spanning 1 partitions.
File system label (if any): datastore1-ssd-raid10
Mode: public
Capacity 15863193272320 (15128320 file blocks * 1048576), 4985285771264 (4754339 blocks) avail, max supported file size 70368744177664
Volume Creation Time: Wed Apr 7 19:52:01 2021
Files (max/free): 16384/16355
Ptr Blocks (max/free): 0/0
Sub Blocks (max/free): 24064/22264
Secondary Ptr Blocks (max/free): 256/255
File Blocks (overcommit/used/overcommit %): 0/10373981/0
Ptr Blocks (overcommit/used/overcommit %): 0/0/0
Sub Blocks (overcommit/used/overcommit %): 0/1800/0
Large File Blocks (total/used/file block clusters): 29548/0/29548
Volume Metadata size: 2262925312
Disk Block Size: 512/512/0
UUID: 606e0d61-d4dfee10-c4c5-ecf4bbf16b84
Logical device: 606e0d61-cfbf898e-4eff-ecf4bbf16b84
Partitions spanned (on "lvm"):
naa.6b8ca3a0f9e39a002800c59e22cf946f:8
Unable to connect to vaai-nasd socket [No such file or directory]
Is Native Snapshot Capable: NO
OBJLIB-LIB: ObjLib cleanup done.
WORKER: asyncOps=0 maxActiveOps=0 maxPending=0 maxCompleted=0
Thanks for the help by the way, appreciate it!
You are correct that we don't need the full 14TB and at some earlier point I wanted to shrink the disk to 13TB to maybe solve this issue if it was due to over commitment of virtual disk space. I managed to shrink the ext4 filesystem on the VM to 13TB (see output below), but then it came down to using a hack (editing the vmdk by hand) to shrink the disk and I chickened out. I can't risk breaking this VM if I don't have to, because we regularly use the data on it. And I don't really know what I'm doing with editing the vmdk.
I already am starting up a plan to stop using this VM, delete it and start over by the way. This post was a last effort to maybe save me that trouble and of course understand what the issue is.
Output fdisk within VM:
Disk /dev/sdb: 14 TiB, 15393162788864 bytes, 30064771072 sectors
Disk model: Virtual disk
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: DDF7CC2F-8DD8-4911-8602-58BAD05A5389
Device Start End Sectors Size Type
/dev/sdb1 2048 27917289471 27917287424 13T Linux filesystem
Dont use fdisk for gpt-disk - use gdisk.
Can you run gparted inside the VM to resize the partition ?
Currently the partition has a size of 13312 GB.
Let me add that at the very least I will NOT use thin provisioning ever again. I assume the issue is related to that. Also I regret making a 14TB VM like this, because I can't migrate it anywhere because of the size, limiting my options. I might as well have made this system bare metal (skip ESX) or use smaller VM disks and multiple VM's. But that is a side note and doesn't help the issue! 🙂
Yes the partition and ext4 filesystem is already 13TB. But the VMDK remains at 14TB and shrinking the disk is officially not supported I think in ESXi
I can cut the 14 tb vmdk - but I would like to see it in gparted before - according to the fdisk output the partition still has a size of 13331 GB.
I cut vmdks with dd and that is quite destructive - thats why I want to see the disk in gparted first.
Use a livecd if your VM has no X.
I strongly believe While backup jobs you are reaching total available inodes. follow this kb step by step that may fix your issue. https://kb.vmware.com/s/article/1007638
try to investigate for the recent presence of more than usual small files, such as files ending .txt or .dmp. also other files in this datastore. If they are present in large numbers , try to cleanup or move old files if any unregistered vm files. then check the available inodes .
Thanks for the offer, but I can't risk it. I would have a hard time justifying this move if it went wrong somehow. I'd rather just set up a new backup system and rotate this system out until I can delete it. Also I guess it wouldn't clearly resolve the issue, namely that this is perhaps actually a bug in ESXi?!