Solved: Re: Virtual Machine unavailable during Snapshot

Western0 · ‎12-16-2009

Hello,

I wanted to check in with others to see how snapshots affect your VMs. I work in several small VMware environments and have been seeing similar effects when snapshotting VMs.

Today I snapshotted an SQL server (Windows 2K8 running SQL 2008) that powers a sharepoint site. The snapshot took roughly 30 seconds and during that time the sharepoint site was unavailable. In this case the ESX server is running 3.5 on a 1.5 year old HP DL380 G5 with an HP MSA2324 G2 SAN on the back end .

After noticing the service outtage I tested snapshotting a file & print server at another location, in this case running ESX 4.0 on an HP DL380 G6 with an EMC CX300 on the back end. In this case I also experienced a service outtage, I was unable to access file shares while the snapshot. In this case the snapshot took a few minutes to complete, I'm guessing this is because the data drives are on 7.2K disks. During the first minute of the the snapshot I could not access shares, Once it got close to 100% I was able to access file shares.

In both scenarios I only lost one ping, right at the beginning of the snap. When deleting snapshots I haven't experienced any loss of service.

Is this a normal experience or could it be caused by a misconfiguration or lack of resources?

Thanks,

DSTAVERT · ‎12-16-2009

I would have a look at Understanding Snapshots . Depending on what options you have chosen with your snapshot -- quiesce or memory the OS will need to freeze activities until the snapshot has completed. Also have a look at http://kb.vmware.com/kb/5962168

-- David -- VMware Communities Moderator

View solution in original post

DSTAVERT · ‎12-16-2009

I would have a look at Understanding Snapshots . Depending on what options you have chosen with your snapshot -- quiesce or memory the OS will need to freeze activities until the snapshot has completed. Also have a look at http://kb.vmware.com/kb/5962168

-- David -- VMware Communities Moderator

krismcewan · ‎01-20-2010

We have a similar problem and i have tested this several ways.

1. VM snapshot NO MEMORY OR Quiesc drops ping during snapshot and removal of snapshot.

2. VM snapshot INCLUDING memory NOT Quiesc. Drops ping during snapshot and removal of snapshot

3. VM snapshot INCLUDING memory AND Quiesc. Drops ping During snapshot and removal of snapshot.

4. Vm snapshot NO memoryBUT INCLUDING Quies. Drops ping during snapshot and removal of snapshot

now i have used different VM's Hosts, different networks (vswitch, DVswitch, Nexus) different LUNS on the iscsi SAN, local storage

From what i can see in the enviroment i am in it happens every time

A VMware Consultant in Scotland for Taupo Consulting

http://www.taupoconsulting.co.uk

If you think I'm right or helpful award me some points please

A VMware Consultant in Scotland for Taupo Consulting http://www.taupoconsulting.co.uk If you think I'm right or helpful award me some points please

lshelton · ‎03-08-2010

I'm seeing the same thing on ESXi 3.5 update 5. We typically use NFS. We have VM's stored on 2 different NFS servers as well as a local data store.

If I do not include the memory in the snapshot the "outage" is very short, usually only 1 ping when doing the snapshot, and sometimes none. Including memory increases the ping loss to 5-6. Removing the snapshot always results in a minimul of 5-6 ping losses.

This is making backups a real nightmare.

Thank you,

Lewis Shelton

ggbailey23 · ‎05-28-2010

I am having the exact issue. We are running ESX 4.0 U1. Our organization requires a 1 hour RPO for business critical applications. This requires that I snap my SQL VM's every hour during business hours and replicate to our DR facility. We are currently trying to use EMC's Replication Manager Utility to facilitate the VM snaps. During the time of snapshot creation we are losing anywhere from 8 to 10 pings per guest. More importantly even after ping connectivity is restored, we lose any SQL application connectivity to these servers including SQL Management Studio until the snapshot has reached almost 100% completion. On some servers this is running near 10 minutes. Due to these issues, we are unable to create snapshots during business hours. Has anyone come up with any other solutions to create application consistent snapshots of SQL / Exchange VM's and replicate off-site? I am at my wits end with this solution. I have had a case open with VMWare and their response is that is working as designed. I have also had a case open with EMC since they are the ones that sold the solution for 6 weeks now with no solution.

Any help is appreciated.

obsidian009 · ‎07-15-2010

bump

We're seeing similar issues with certain VMs losing pings during the snapshot process...particularly at the end of the removal. First, this is annoying for monitoring as we'll get alerts for servers when they're backing up. Secondly, we have some applications that are very sensitive to any kind of network connection disruption and they don't always recover so well....they also throw a lot of errors in event logs when these outages occur.

What can be done to mitigate or eliminate any network connection outages during snapshots? We're on ESX 4.0.0, 244038. I know ESX 4.1 was just released...we'll probably try to go to that soon but need some things to become certified first. One suggestion I read elsewhere was to verify that the Service Console memory allocation is set to the maximum of 800MB. I checked and we had some set at 500 or 600MB, so I'm going to bump those up....not sure how much that will help. My only other thought is to isolate some of the more sensitive VMs that are dropping into their own datastores (FC LUNs)....again I'm not sure how much that will help though.

Is it really normal to lose some pings during snapshot of VMs? Any other thoughts, comments or recommendations are appreciated.

Thx

singy2002 · ‎07-22-2010

Hi,

We are having the same issue as this, we have had 3 calls open to VMware on and off for months, what they eventually come back with is its working as intended.

One of the things we are beggining to investigate is SQL, it seems clear that VMware triggers this behaviour, but I cannot be certain that it is not partly down to SQL, after all the stun process with VMware reports the stun duration as 1-3 seconds, which ties in perfectly with the amount of pings lost.

If you want to check the stun durations from the console (putty) of the ESX hsoting the VM, browse to the data storeand run the following command.

egrep Checkpoint_Unstun vmware.log

this will give you an output of the stun durations in microseconds. 1000000 = 1 second.

I would be very interested to see if you experience the same stun duration/ping timesouts as we do.

On the note of increasing the service console memory, I have done this and it has had no effect.

All

Virtual Machine unavailable during Snapshot