Solved: Re: vmware-cmd removesnapshots hangs the VM's

chrgloor · ‎12-11-2007

Hello,

I have a heavily loaded Windows 2003 R2 file server running under VMware ESX 3.0.2.

Every hour, I take a snapshot of the VM and it runs fine. However, 15 minutes later (snapmirroring my NetApps), I try to remove the snapshot using the command "vmware-cmd xxx.vmx removesnapshots.

It commits the delta file which has grown to a few (1-10) GB by the time I try to remove it. During the commit, a new delta file is created storing all the informations changed while the first delta file is removed. When it comes to removing the new delta (which is a few GB by then), the VM hangs for a few minutes.

From the Windows eventlogs, I see the machine is still running, but it doesn't answer to the network any more (even the ping requests). The CPU performance graph in VC also shows the VM as not responding. I can't even open a console on my VM.

The VM is built using a 20 GB vmdk for the system and two other vmdk located on other LUN's of the SAN for data (300GB and 600GB).

Is this behaviour normal? I saw a few posts stating they had the same kind of problems when using multiple vmdks, but none of them gives an explanation and/or a workaround.

Any ideas?

Best regards ans thanks for this forum which has been my best source for informations so far.

Karun · ‎01-28-2008

In ESX 3.0.x, the VM is quiesced when committing the snapshot(s). So, everything is frozen until the snapshot commit operation completes.Once the operation completes, everything is restored including the network. This is how the feature works in 3.0.x

In ESX 3.5, the VM is not quiesced and you will not lose network or any other i/o during snapshot consolidation (deletion).

Thanks,

Karun

View solution in original post

Chris_S_UK · ‎12-11-2007

Some loss of connectivity is I think inevitable and the larger the snapshot, the longer it will take time to commit.

I usually see 1-3 pings being lost on VMs when I commit a snapshot which is the minimum size of 16mb (i.e. has virtually nothing in it).

As you have almost 1TB of disk space in this VM, have you considered using an RDM(s) instead of a VMFS based vmdk? This would offer some benefits in a SAN/snapshot environment.

Chris

chrgloor · ‎12-11-2007

Thanks for your answer Chris.

I didn't mention it in my post, but I have another VM running Exchange 2003 with two RDM's (virtual compatibility mode) attached for the storage.

It's not as bad because the data doesn't change as fast, but then machine hangs for a few minutes as well when committing the changes.

Of course, I could use Physical RDM, but I'm not sure my database will be consistent then.

Christian

Karun · ‎01-28-2008

In ESX 3.0.x, the VM is quiesced when committing the snapshot(s). So, everything is frozen until the snapshot commit operation completes.Once the operation completes, everything is restored including the network. This is how the feature works in 3.0.x

In ESX 3.5, the VM is not quiesced and you will not lose network or any other i/o during snapshot consolidation (deletion).

Thanks,

Karun

chrgloor · ‎01-28-2008

That's great news Karun.

I'm just upgrading to 3.5 these days.

Thanks for your feedback.

Christian

fletch00 · ‎04-09-2008

I'm running 3.5 (fully remediated (patched)) and see loss of network connectivity when doing the snapshot removals.

I regularly see the VM logs showing 30-60 second "vm stopped" messages

I have had a case open on this for a few weeks - we tried inserting a delay in the snapshot removal loop and increasing the COS memory from 272Mb to 800Mb - it seemed to help but not eliminate the issue.

I am using the script from NetApp's official documentation:

http://media.netapp.com/documents/tr_3428.pdf

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info

Karun · ‎04-10-2008

fletch00,

How big are your redo logs and how many snapshots did you try to delete?

Did you lose network during the entire duration of snapshot deletion?

- Karun

fletch00 · ‎04-10-2008

We lose connectivity during the "vm stopped" periods of the snapshot removal (see logs from last night) - what is the source of your 3.5 "does not lose connectivity during snapshot removal" info?

# egrep Checkpoint_Unstun /vmfs/volumes/vmachines65net/*/vmware.log | egrep "Apr 09" | sort +8nr | head | sed s/irt-//g

/vmfs/volumes/vmachines65net/windev-01/vmware.log:Apr 09 17:48:30.390: vmx| Checkpoint_Unstun: vm stopped for 61473032 us

/vmfs/volumes/vmachines65net/desktop/vmware.log:Apr 09 17:39:35.565: vmx| Checkpoint_Unstun: vm stopped for 31526095 us

/vmfs/volumes/vmachines65net/lane-dev/vmware.log:Apr 09 17:34:29.269: vmx| Checkpoint_Unstun: vm stopped for 31337618 us

/vmfs/volumes/vmachines65net/marigold/vmware.log:Apr 09 17:51:08.694: vmx| Checkpoint_Unstun: vm stopped for 31022770 us

/vmfs/volumes/vmachines65net/medcomm/vmware.log:Apr 09 17:37:15.747: vmx| Checkpoint_Unstun: vm stopped for 30704538 us

/vmfs/volumes/vmachines65net/bfrelayfeb2808/vmware.log:Apr 09 17:29:46.273: vmx| Checkpoint_Unstun: vm stopped for 30662199 us

/vmfs/volumes/vmachines65net/IRT-PROJECT/vmware.log:Apr 09 15:47:57.024: vmx| Checkpoint_Unstun: vm stopped for 6201388 us

/vmfs/volumes/vmachines65net/hyperic-02/vmware.log:Apr 09 15:43:50.285: vmx| Checkpoint_Unstun: vm stopped for 3024249 us

/vmfs/volumes/vmachines65net/stagehand0918/vmware.log:Apr 09 17:31:56.360: vmx| Checkpoint_Unstun: vm stopped for 1327609 us

/vmfs/volumes/vmachines65net/windev-01/vmware.log:Apr 09 17:46:28.755: vmx| Checkpoint_Unstun: vm stopped for 1323707 us

The delta files are not huge - the snapshots only exist long enough to do the NetApp volume snapshot (< 10 mins)

This is the script from the Netapp paper:

#!/bin/bash

Step 1 Enumerate all VMs on an individual ESX Server, and put each VM in hot backup mode.

for i in `vmware-cmd -l`

do

echo putting $i into hot backup mode

vmware-cmd $i createsnapshot backup NetApp quiesce

done

Step 2 Rotate NetApp Snapshot copies and delete oldest, create new, maintaining 7.

echo rotating, deleting oldest and creating new snapshot

ssh na-ccsr02-v65 -l root snap delete vm65net vmsnap.esx65-02.7

ssh na-ccsr02-v65 -l root snap rename vm65net vmsnap.esx65-02.6 vmsnap.esx65-02.7

ssh na-ccsr02-v65 -l root snap rename vm65net vmsnap.esx65-02.5 vmsnap.esx65-02.6

ssh na-ccsr02-v65 -l root snap rename vm65net vmsnap.esx65-02.4 vmsnap.esx65-02.5

ssh na-ccsr02-v65 -l root snap rename vm65net vmsnap.esx65-02.3 vmsnap.esx65-02.4

ssh na-ccsr02-v65 -l root snap rename vm65net vmsnap.esx65-02.2 vmsnap.esx65-02.3

ssh na-ccsr02-v65 -l root snap rename vm65net vmsnap.esx65-02.1 vmsnap.esx65-02.2

ssh na-ccsr02-v65 -l root snap create vm65net vmsnap.esx65-02.1

Step 3 Bring all VMs out of hot backup mode

for i in `vmware-cmd -l`

do

echo bringing $i out of hot backup mode

vmware-cmd $i removesnapshots

echo sleeping for 30 to prevent IO contention

sleep 30

done

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info

jungblpe · ‎10-24-2008

I am experiencing very similar outages in my environment. I was just wondering if you ever heard more about this?

jungblpe · ‎01-22-2009

This might be a dead issue since there has been so much pub on this issue, but I thought I would post my experiences since they have changed. My issues were fixed after applying patch ESX350-200808401-BG and making the necessary changes to the vmware config file (/etc/vmware/config file and add the line:

prefvmx.ConsolidateDeleteNFSLocks = "TRUE")

Ajay_Nabh · ‎09-01-2009

I have applied ESX350-200808401... patch and update 4 on NetApps, I still get same issue where I loose ping, cannot delete snaps during workin hours... People if you can comments if mentioned patch as fixed you issue??

thanks

All

vmware-cmd removesnapshots hangs the VM's