VMware Cloud Community
cypherx
Hot Shot
Hot Shot

Snapshot consolidation hangs vm's for a minute

We use veeam backup and replication 8.0.0.917 but according to them they are just using the vmware storage API to leverage snapshots for their engine to backup properly.  We've also seen this issue manually doing snapshots or using anther product (Acronis vmProtect).  Issue is that when a snapshot is consolidated, the virtual machine temporally freezes right at the end of consolidation.  The freeze is temporary and it recovers itself, but it is enough to make our monitoring systems go haywire and automated processes attempt to restart services and such.  Is there any reason as to why this happens?  vmotion is non-disruptive which is an amazing technical feat, so I'm not quite sure why snapshots are temporarily disruptive?  Forget backing up anything during the day.  We have to backup things late at night so this disruption goes unnoticed. In the wee morning hours one might see website and webapp timeout errors or sql connect errors as those particular vm's consolidate for a minute or two.

Infrastructure

EMC VNX5200 Storage array, 3 storage pools carved out and presented to 6 ESXi 5.0.0, 2312428 hosts via NFS.

QLogic 3200/3400 series 10gb network cards with mtu 9000 on all vmk interfaces.

Brocade TurboIron 24x 10gbe switch

10gbe active twinax cables back to the VNX5200 10gbe data mover cards.

Each storage pool is on its own subnet with its own vmk interface.

The 10gb networking is using dV switch, 3 vmkernels for each subnet / storage pool NFS High shares 100% value, 1 vmkernel for vmotion which vMotion traffic is at normal physical adapter shares and 50% value.  Network I/O control enabled.

Storage I/O Control enabled, congestion threshold 30ms.

EMC VNX VAAI plugin installed and Hardware Acceleration reported as "Supported".

Average datastore latency is between 0.606 and 3.294

Reply
0 Kudos
7 Replies
Sharath_BN
Enthusiast
Enthusiast

Yes most of the backup applications just uses API to talk vCenter and create/delete snapshot.

Can we confirm if this VM freeze is not seen during snapshot creation?

The consolidation of VM has point where it has to stun the VM in-order to complete the process of switching from Delta disk to base disk. The stun time might go high depending on different aspects. Most seen among this is the time taken to switch from delta disk to base disk,

For a High I/O VM like DB if the stun time is high then application dependent on this DB would time-out causing the application to fail.

Below are the few things which we want to check to confirm the issue.

1. From when do we see the issue?

2. Was is working fine earlier or does the issue appear from the time of deployment?

3. When we look into the vmware.log file for this VM we see the time-stamp on each task performed during consolidation, where is the maximum time spent.

The below article has information on long stun time during snapshot consolidation.

http://kb.vmware.com/kb/1002836

Regards

Sharath BN

Reply
0 Kudos
cypherx
Hot Shot
Hot Shot

1. From when do we see the issue?

We see this issue right as the snapshot is beginning consolidation process.  Taking a snapshot does not cause any issue at all.

2. Was is working fine earlier or does the issue appear from the time of deployment?

We've always had this issue.  Its random, sometimes we recieve alerts that our website is down, slow or not responding.  We have just gotten used to the alerts and see if its around 1-2 AM we just ignore it because we know it has to do with these backups completing.  Just recently the vendor put an update if a service is not responsive it tries to restart it.  So we've began to look into this again.

3. When we look into the vmware.log file for this VM we see the time-stamp on each task performed during consolidation, where is the maximum time spent.

According to Veeam logs, the backup started at 1:17:20 AM and ended 1:28:46 AM.  Using a UTC to EST time convertor, I am attaching that portion of the vmware.log file to this thread.

Reply
0 Kudos
Sharath_BN
Enthusiast
Enthusiast

From the logs I see three stun times

2015-01-20T06:18:03.473Z| vcpu-0| Checkpoint_Unstun: vm stopped for 1249990 us   ---> 1.24999 seconds

2015-01-20T06:26:50.926Z| vcpu-0| Checkpoint_Unstun: vm stopped for 748057 us ---> 0.748057 seconds

2015-01-20T06:27:37.049Z| vcpu-0| Checkpoint_Unstun: vm stopped for 733033 us --> 0.733033  seconds

You may need to check with application team on how sensitive is the application for DB connection issues.

Reply
0 Kudos
cypherx
Hot Shot
Hot Shot

Do we know why it takes so long? I mean 1.25 seconds is not long to me and you, but a computer system is a different story.

Is there any tweaks we can do to lower that time?

Also some people are in the iSCSI vs. NFS wars, not looking to start another one of those threads... but would iSCSI see the same performance issue during consolidation?

Reply
0 Kudos
Sharath_BN
Enthusiast
Enthusiast

In order to know more on what is happening on the host we would need to analyse the host logs. This is done from VMware support team.

This requires to check other logs as well on the host along with the virtual machine logs.

To confirm if iSCSI would help over NFS we have to first isolate if the issue is due to the network latency caused by protocol or by other parameters.

If you have a support contract with VMware go ahead involve since this might need to be looked into

cypherx
Hot Shot
Hot Shot

Yes we have active support.  Good idea, I will open a case.

Reply
0 Kudos
cypherx
Hot Shot
Hot Shot

Our stun times are considered normal by vmware.  Whether the interface is FC, iSCSI, local storage, NFS... this happens no matter what.  Our performance with NFS appears to be in check.  Our stun times of 1-2 seconds is actually pretty good as there are some customers with performance issues where the stun times are much longer 30+ seconds.

It turns out or SQL based web application is highly sensitive and intolerant of even the slightest delay.  It is very I/O intensive so the 20-30 minutes it takes for Veeam to complete the backup and send the API call to consolidate the snapshot results in more changes to the delta disk than a non-database application. 

They did provide two KB articles and suggested we spin up a test machine and play with these tweaks to see what kind of performance numbers we can achieve.  Another suggestion is to get with our application developer and see if they can put a window of time where we know that a snapshot will be consolidating due to backups.  This window will temporally silence alerts and not try to automatically restart services. 

Two KB articles of interest:  KB# 2039754

VMware KB: Windows virtual machines become unresponsive for over 30 minutes when removing a snapshot...

KB# 1002836

VMware KB: A snapshot removal can stop a virtual machine for long time

Reply
0 Kudos