Re: Major issue with creating snapshots: Insuffic...

srodenburg · ‎12-25-2015

Spoiler alert: there is PLENTY disk-space available on the vsanDatastore...

Hello,

The VSAN System in our Lab is running into more and more problems when it comes to creating snapshots. At the moment, there are several VM's which whom it has become completely impossible to create snapshots at all.

It started out a few weeks ago when i notices strange retries in our Backup-software (Veeam v8 latest version). It needed 1 or sometimes 2 retries because it could not create a snapshot due to the mention error.

Then, over time, it took 2 or 3 retries. So it would fail to create a snapshot once, twice but the third time it would work just fine. Job retries are 5 minutes aparts. So all of this happens in a timeframe of say 15 to 20 minutes or so.

At the moment, several VM's cannot be backed up anymore because it has become impossible to create snapshots. The reason is always the same: "Insufficient disk space on datastore" which is totally bolloks as we have over 5 TB free.

At first, it thought it was due to a problem in Veeam but as it turns out, it has nothing to do with Veeam because I can't create snapshots in the VI /Web Client either. The problem lies in vSphere and Veeam is mere a victim.

Looking at logfiles, I always see this error when trying to consolidate snapshots: "An error occurred while consolidating disks: msg.disklib.NOSPACE"

The error "msg.disklib.NOSPACE" is a generic error which i also see when i simply want to make a snaphot.

This lead me to KB Article VMware KB: Consolidating disks in VMware Virtual SAN fails with the error: An error occurred while c...‌

None of the conditions like "When the disk being consolidated has more than 255GB data, consolidation fails." are valid in my environment.

Each and every blog-post or KB article I found has brought me nothing.

When i look inside such a VM's directory, there is carnage. A VM with two disks started out with names like the usual "vm01.vmdk" and "vm01_1.vmdk" (for the second disk). After a couple of days, it can look like ""vm01.vmdk" and "vm01_17.vmdk" and there are a lot of snapshots lying around. If also leads to the usual "VM need consolidation" messages which always fail due to locking problems ("Consolidation failed for disk node 'scsi0:1': msg.fileio.lock").

Consolidating a VM requires some handy work but I can always get them back on track. The wierd _17.vmdk names (or similar) I can fix by renaming the .vmdk and -flat.vmdk files to _1 again and editing the .vmdk file and change the _17 to _1 etc. etc. etc.

So I can always "repair" a VM and we never have dataloss. But it's pain in the bum and it requires downtime for the VM in question.

I can repair a VM and it will be able to make snapshots just fine but it will get into trouble within 3 or 4 days with the exact same problem. Sometimes i go as far as deleting the VM from the inventory, repair the VMDK names (when it went cookoo again and calls a disk _15.vmdk or whatever instead of the original _1.vmdk etc.) and creating a brand new VM and attach the old vmdk's. That "new" VM will then be quiet and fine and dandy for a few days before the same shit starts happening all over again, getting worse over time.

Each and every blog-post or KB article I found has brought me nothing.

I desperately need your help. The VSAN runs fine. But snapshots, and with that the backups, are getting more and more problematic as ever more VM's start suffering from this issue. I could move some low I/O VM's to a cheap NFS NAS and presto, all problems are gone. Move them back, fine for a few days and then the misery starts all over again, getting worse over time. it IS a VSAN problem.

VMware is not going to help me. Reason: this is not a production environment and there is no dataloss so they don't care. Period. I would end up with a level-1 support employee which starts asking all the basic questions which I am already way beyond and they will never escalate and allocate level-2 personell to my case, making the entire exercise frustrating, lengthy (weeks) and as useless as tits on a fish (pardon my french).

Help me Obi-Wan Kanobi.

zdickinson · ‎12-25-2015

Good afternoon, my first thought would be the storage policies on a VM getting out of whack. I believe if you have FTT = 1 and a host is down, you won't be able to take a snapshot, and you get a misleading error like "out of space". This may be true or not, depending your version of vSAN.

Maybe look into a VM being not compliant from a storage policy level, then a snapshot is tried, re-tried, finally successful, but then it's falling all over itself. Any chance the VMs in question have a policy compliance of anything other than green?

Thank you, Zach.

srodenburg · ‎12-26-2015

Hi Zach,

Neeeh that would be too simple. Come on give it some effort 😉

All green everywhere. All 5 identical nodes are up and all VM's have storage-policies as green as Oh Tannenbaum.

At the moment, one of the VM's i just "rebuilt" two days ago (throw it away but keep the two VMDK's, create new VM, re-attach vmdk's etc.), has major "snapshots consolidate issues" once again.

The reason is an old poltergeist: I have two Veeam Proxies in this Lab, and one of them has 7 disks hot-added to it, leftovers from last nights backup. All those attached disks are a snapshot of the second vmdk of the VM I mentioned above.

Disk 1 = C: drive of the Veeam Proxy itself

Disk 2 = VLAN41DC01_1-000002.vmdk

Disk 3 = VLAN41DC01_1-000002.vmdk

Disk 4 = VLAN41DC01_1-000002.vmdk

Disk 5 = VLAN41DC01_1-000002.vmdk

Disk 6 = VLAN41DC01_1-000002.vmdk

Disk 7 = VLAN41DC01_1-000002.vmdk

Disk 8 = VLAN41DC01_1-000002.vmdk

Yes you are not drunk. This proxy really attached the same snapshot 7 times. Something i've never seen Veeam do before. And it does not detach anymore either due to locking-errors during the job.

My VSAN is such a mess by now. I migrated all 40'ish VM's from an FC array over to the VSAN which ran wonderful for the first few weeks. But in the last two weeks, this "snapshot disease" is spreading more and more. I'm on the latest version of bloody everything and the VSAN Cluster is fully HCL compliant. FTW is going on !?!

Any other ideas maybe?

srodenburg · ‎12-27-2015

Update:

I had to move a VM off the vsanDatastore onto a NAS to be able to repair it. After the repair, i fired up the VM and she ran fine. When i wanted to do a live storage-vMotion back to the vsanDatastore, i could not:

_______________________________________________________________________

Failed to create one or more destination disks.

A fatal internal error occurred. See the virtual machine's log for more details.

Failed waiting for data. Error 195887107. Not found.

_______________________________________________________________________

This error number on the 3rd line varies by the way.

So I dove into the VM's vmware.log file and the error there was that there was not enough diskspace on the vsanDatastore (which is far from the truth).

This VM is a SQL Server and has a cache-reservation policy on the 2nd VMDK where the databases are stored.

I then got a hunch and instead of selecting the "policy with a cache-reservation" I selected a "normal" policy (ftt=1, stripes=3) without any cache-reservation and viola, the storage-vMotion went without a hitch. Just like that.

I now removed storage-policies with cache-reservations from all VM's and see how it goes.

What I think happens is that the SSD read-caches on all nodes are totally consumed and when the storage-vMotion from an external storage starts, VSAN says "nope I don't have any cache available anywhere (because it would have to evacuate some other VM's cached data to make room for the cache reserved for this VM). Instead of evacuating cached-data from other VM's to make room, it simply bombs out with the error mentioned above (the 3 lines).

shanceaylown · ‎01-03-2016

Hi srodenburg, what is the size of the disks and SSD? That raid controller model is installed on each server? How much space are you using?

--- If you find this post useful, please consider awarding points for "Correct" or "Helpful" Leonardo Nicolini | VCP6-DCV | VCP5-DCV | MCP @shanceaylown | https://it.linkedin.com/in/leonardonicolini

All

Major issue with creating snapshots: Insufficient disk space on datastore