Slow snapshot CREATION. Stun + 8 seconds

Seve_CH · ‎03-30-2016

Dear all,

When I create a snapshot for VMs with many disks, the machines are paused (stunned) for up to 10 seconds depending on how many disks they have.

I know that consolidating a snapshot may take time depending on its size and data distribution, but our problem is not on snapshot consolidation, but snapshot creation. Something that to me, seemed to be almost straightforward and really quick to do (pause IO, stop CBT, divert IO to another set of files, start CBT, resume IO). But...

ESXi Enterprise 6 Update 2 (Build 3620759, but also with recent older ones), a VM with HW Version 11, 10 HDs SCSI Thick Lazy Zeroed (6.8TB in total) spreaded on 7 datastores, CBT active, shows this on the log when CREATING a snapshot from vSphere Client:

2016-03-30T06:34:48.000Z| vmx| I120: SnapshotVMX_TakeSnapshot start: 'test', deviceState=0, lazy=0, logging=0, quiesced=0, forceNative=0, tryNative=1, saveAllocMaps=0 cb=21A6BC00, cbData=3236ABD0

.....

2016-03-30T06:34:56.212Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 8164504 us

So 8 pings lost, +1000 file handles closed (it is a file server) and clients complaining.

The snapshot removal takes more or less the same time of "stun", but it is spreaded over several pauses, one per vdisk, which doesn't affect production. (800ms pause is high but acceptable, 8'000 ms is not).

Our goal was to backup several VMs quite frequently (30 min - 1h) with Veeam Backup and Recovery, but with that kind of long pause it is really disruptive for servers: applications timing out, risk of cluster failing over, etc.

The host hardware is powerful enough: with only the file server running on the host, 1Ghz is used of 12x3.4Ghz and the VM uses 6GB of 512GB memory @2.1GT/s.

Regarding disks, the datastores seem OK. They have a latency ranging from 0 to 3ms (read and write, they are 2 IBM XIV with around 200 disks each) and once the snapshot is made, the Veeam server is able to pull a mean of 700MB/s from them (direct SAN mode) without a significant performance penalty.

I have not seen any difference if the disks are on VMFS3 or VMFS5.

It seems as if ESX made each disk snapshot sequentially event if they are on different datastores and took almost 0.75 second per disk. 10 disks = 7.5 seconds of pause. Not good. Our tests were done with an older version of ESXi6 and they were OK so I feel something changed (CBT bug patch? But I do not discard other settings modification meanwhile) .

Do you know how to speed up the snapshot creation?

What is that "tryNative=1" ? I don't feel like wanting to wait trying things .

Thanks!

RobbieG1010 · ‎02-19-2018

Hi, came across your post and I too experience your description and wondered did you ever find a resolve?

I had some success with VMtools update (that seemed to be missing from update manager) but a week later found the problem was returned. VMware recommended clearing CBT data, which I had tested on a VM and made slight improvement but not as significant as I would of hoped.