Solved: Re: Deleting snapshots hangs VM?

mattjk · ‎02-19-2009

Hi all,

Whenever we delete a VM snapshot via VIC, it always seems to make the VM's guest OS pause a few times during the process.

The size of the snapshot seems to affect the number / length of these pauses, but as an example I just deleted a ~500MB snapshot on a VM with 4 disks and it caused the VM to pause 3-4 times for 5-10 seconds at a time. The pauses seem to occur no matter what the guest OS is - W2K3, W2K8, 32 or 64-bit. All VMs have the VMWare Tools installed, and we don't have any particular resource constraints.

Is this normal? Is there anyway to prevent these pauses?

Cheers,

Matt

Cheers, Matt

titaniumlegs · ‎02-21-2009

You probably need to apply and activate patch ESX350-200808401-BG. You said you have ESX3.5u3, so the patch is built in. You activate it by inserting

prefvmx.consolidateDeleteNFSLocks = "TRUE"

in /etc/vmware/config and reboot ESX. (VMotion, shutdown or suspend VMs first)

Details in p.12-13.

Enjoy!

Share and enjoy! Peter If this helped you, please award points! Or beer. Or jump tickets.

View solution in original post

Lightbulb · ‎02-19-2009

Local Storage or SAN?

Check your /var/log/vmkernel log for SCSI errors around the time when you delete the snapshots.

mattjk · ‎02-19-2009

Duh, trust me to leave out vital info:

- Shared storage - NetApp FAS2050 active/active using NFS. Disk is lightly utilised at present.

- ESX is 3.5u3

Hmm, /var/log/vmkernel shows a bunch of errors like this:

Feb 20 10:48:26 esx-02 vmkernel: 20:11:02:27.255 cpu3:1303)DevFS: 2222: Unable to find device: 10c5-xxx.xxx.xxx-000007-delta.vmdk

Which are concerning? They don't appear to be all from the same time I deleted the snapshot though - and all appears OK with the VM too post-snapshot deletion?

Cheers,

Matt

Cheers, Matt

Lightbulb · ‎02-19-2009

Well you are not alone

http://communities.vmware.com/thread/188876

I think there is a good chance the erros in vmkernel are related to your issue. As to cause hmm don't have that yet.

Might want to look at this

http://virtrix.blogspot.com/2007/06/vmware-dreadful-sticky-snapshot.html

mattjk · ‎02-19-2009

Thanks for that... concerning-looking errors though :-S

It looks like similar errors are being thrown when I delete snapshots on any VM - although our problem differs to most of the other similar ones in that the snapshot /appears/ to be committed file and delta files removed at the end of the process (lots of others with the same error have left-over delta files).

Time to open a case with VMWare methinks.

Cheers, Matt

Lightbulb · ‎02-20-2009

Let us know how it turns out. Good luck and sorry I could not be more help.

glynnd1 · ‎02-20-2009

Matt, what kind of disks do you have in you NetApp?

I saw this at a previous job where we have a FAS3020 with SATA, though it was limited to VMs that were doing a decent amount of disk writes during the snapshot period.

We found that while the snapshot was been merged in a second delta file would be create to handle current writes and during this time we saw no issues, but when this second delta file was been merged in we'd see the VM enter a paused state from internal or external monitoring. Moving these VMs to FC disk resolved the problem as the snapshots were smaller - faster VCB backup, the resulting delta file created during merge was smaller and the merge of the delta did not cause a noticeable interruption to the OS or application.

mattjk · ‎02-20-2009

Matt, what kind of disks do you have in you NetApp?

15k SAS. We only have the base unit (no extra shelves) though so there's only 20 disks, and the active/active setup means our main aggregate only has one RAID-DP group with 16 disks in it to.

I saw this at a previous job where we have a FAS3020 with SATA, though it was limited to VMs that were doing a decent amount of disk writes during the snapshot period.

We found that while the snapshot was been merged in a second delta file would be create to handle current writes and during this time we saw no issues, but when this second delta file was been merged in we'd see the VM enter a paused state from internal or external monitoring. Moving these VMs to FC disk resolved the problem as the snapshots were smaller - faster VCB backup, the resulting delta file created during merge was smaller and the merge of the delta did not cause a noticeable interruption to the OS or application.

Did you get the same/similar errors in your vmkernel log as the one I posted above?

iI don't think the problem is the same as the one you described though - our "normal" disk workload is very low (maybe 1-2 MB/s read/write and <10% disk utilisation from sysstat on the filer), and the VM whose snapshot-commit lead to me making this post had almost no disk I/O happening - especially writes -when I comitted the snapshot.

Really do appreciate the input though, thanks. I opened a SR with VMware yesterday about the issue, not response yet though - will post back here with the outcome for the benefit of others.

Cheers,

Matt

Cheers, Matt

titaniumlegs · ‎02-21-2009

You probably need to apply and activate patch ESX350-200808401-BG. You said you have ESX3.5u3, so the patch is built in. You activate it by inserting

prefvmx.consolidateDeleteNFSLocks = "TRUE"

in /etc/vmware/config and reboot ESX. (VMotion, shutdown or suspend VMs first)

Details in p.12-13.

Enjoy!

Share and enjoy! Peter If this helped you, please award points! Or beer. Or jump tickets.

mattjk · ‎02-22-2009

You probably need to apply and activate patch ESX350-200808401-BG. You said you have ESX3.5u3, so the patch is built in. You activate it by inserting prefvmx.consolidateDeleteNFSLocks = "TRUE" in /etc/vmware/config and reboot ESX. (VMotion, shutdown or suspend VMs first)
Details in http://media.netapp.com/documents/tr-3428.pdf p.12-13.

That document seems to change every time I blink! Thanks very much for that, that'll almost certainly be the problem - will apply the settings change ASAP.

Do you have any idea if creating/committing snapshots without this patch / setting change poses any risk of damage to the VMDK files or data within? Everything appears to be OK with the VMs we've snapshotted but thought it was worth asking.

Also, as an aside, what's happened to VMware's tech support? :-S I opened a SR with them about this on Thursday (US time) and I haven't heard a thing back from them. I should've asked NetApp about it instead - "brilliant" doesn't do their tech support justice!

Thanks again.

Cheers,

Matt

Cheers, Matt

titaniumlegs · ‎02-22-2009

TR3428 gets updates every 2-3 months because it covers a lot of material, and some of it changes or we find new information to include.

AFAIK, we've never seen or heard of data loss as a result of not having the patch - just the long VM hangs during snap commit.

I can't answer for VMware's tech support, but thanks for the props on ours!

CYa

Peter

Share and enjoy! Peter If this helped you, please award points! Or beer. Or jump tickets.

mattjk · ‎02-22-2009

TR3428 gets updates every 2-3 months because it covers a lot of material, and some of it changes or we find new information to include.

Guess I'll have to start paying attention to the Version History in the document then

AFAIK, we've never seen or heard of data loss as a result of not having the patch - just the long VM hangs during snap commit.

OK, thanks for the info.

I can't answer for VMware's tech support, but thanks for the props on ours!

Didn't realise you were from NetApp - the VMware support question was rhetorical... they have finally responded though. With regards to you guys, well, props where prop are due!

Thanks for all your help.

Cheers,

Matt Kilham

Cheers, Matt

RKCCruiser · ‎02-28-2009

Hello Peter,

Thanks for the information. We are having the same problems with a setup. We are also using NFS with two ESX 3.5u3 servers. We are using a product by Vizioncore that basically does a lot of underlying scripting for snapshotting from one storage box to another. We have the same issue whenever it states that it is "deleting snapshots". We applied the fix as outlined in the documentation you recommended and rebooted. But whenever our snapshot process completes and deletes the old snapshot, the VM is hung up for about 1-2 minutes and no one can access that VM from the network. Any ideas?

titaniumlegs · ‎03-01-2009

Hi!

It sounds like the patch isn't activated. The 1-2 minute hang on VM snapshot delete is the basic symptom.

Check to make sure the quotes around true are double quotes, there's a space on each side of the = (not sure that matters, but that's how I have it and the rest of the options in that file), and no other quotes.

We had some problems with TR3428 and Word "helping" us by converting quotes to "smart quotes" and dashes to long dashes, so when you cut+paste, it wasn't always what you expected.

Post or PM me a copy of /etc/vmware/config and I'll take a look if you like.

Peter

Share and enjoy! Peter If this helped you, please award points! Or beer. Or jump tickets.

RKCCruiser · ‎03-02-2009

Thank you, Peter.

My co-worker who has been working with me discovered that we had only installed the Update 2 for this customer and we upgraded their ESX servers to Update 3 yesterday and all is resolved. We now have the snapshot deletions taking place with only momentary connection loss for about 1 ping, which should be more than acceptable for our apps. Thank you again so much for the information.

James

mattjk · ‎03-03-2009

You probably need to apply and activate patch ESX350-200808401-BG. You said you have ESX3.5u3, so the patch is built in. You activate it by inserting

prefvmx.consolidateDeleteNFSLocks = "TRUE"

We applied this setting (and a couple of others from TR-3428 that we hadn't applied) last night and rebooted our ESX servers.

I've been testing snapshot commits this morning and it seems much faster, and I/O pauses much smaller / gone - it's hard to be 100% sure though as the problem was variable in it's severity before.

That being said, we're still getting similar errors to before showing up in our vmkernel logs, e.g.:

Mar 4 13:38:41 xxx vmkernel: 0:18:58:40.358 cpu2:1076)DevFS: 2222: Unable to find device: 14051-yyy_2-000002-delta.vmdk

Mar 4 13:38:41 xxx vmkernel: 0:18:58:40.362 cpu2:1076)DevFS: 2222: Unable to find device: 1056-yyy-000002-delta.vmdk

Mar 4 13:38:41 xxx vmkernel: 0:18:58:40.480 cpu2:1076)DevFS: 2222: Unable to find device: 1a05a-yyy_2-000002-delta.vmdk

Mar 4 13:38:42 xxx vmkernel: 0:18:58:40.665 cpu2:1076)DevFS: 2222: Unable to find device: 1060-yyy-000002-delta.vmdk

Mar 4 13:38:42 xxx vmkernel: 0:18:58:40.718 cpu2:1076)DevFS: 2222: Unable to find device: 1062-yyy_2-000002-delta.vmdk

:-S

Cheers,

Matt Kilham

Cheers, Matt

titaniumlegs · ‎03-03-2009

I think that's "normal". If you watch both the vmkernel logs and the VM-specific vmware.log (say, with tail -f each in a separate screen), or just compare the two afterwards, you can see where it actually creates another snapshot as part of the delete process. Note that in your example, it's complaining about 000002-delta, but you probably only had one snapshot, right? (It probably also complained about 000001 earlier, which did exist.)

The vmware.log shows it creating the extra snapshot and a bunch of manipulation of the snapshots and delta files. There are a couple entries like

Mar 03 22:37:00.943: vmx| Virtual Device for scsi0:0 was already successfully destroyed

Mar 03 22:37:02.024: vmx| Virtual Device for scsi0:0 was already successfully destroyed

Which correspond to the time stamps of the complaints in the vmkernel log

Mar 3 22:37:00 esx1 vmkernel: 13:02:01:31.632 cpu3:1103)DevFS: 2222: Unable to find device: 2504a-DFM-000002-delta.vmdk

Mar 3 22:37:01 esx1 vmkernel: 13:02:01:32.045 cpu0:1103)VSCSI: 4059: Creating Virtual Device for world 1104 vscsi0:0 (handle 8361)

Mar 3 22:37:01 esx1 vmkernel: 13:02:01:32.047 cpu3:1106)World: vm 1885: 900: Starting world vmware-vmx with flags 44

Mar 3 22:37:02 esx1 vmkernel: 13:02:01:32.712 cpu1:1103)DevFS: 2222: Unable to find device: 1d051-DFM-000003-delta.vmdk

Mar 3 22:37:02 esx1 vmkernel: 13:02:01:32.834 cpu2:1103)DevFS: 2222: Unable to find device: 2a058-DFM-000003-delta.vmdk

Mar 3 22:37:02 esx1 vmkernel: 13:02:01:32.951 cpu2:1103)DevFS: 2222: Unable to find device: 205e-DFM-000003-delta.vmdk

Mar 3 22:37:02 esx1 vmkernel: 13:02:01:33.196 cpu2:1103)VSCSI: 4059: Creating Virtual Device for world 1104 vscsi0:0 (handle 8362)

Edit: In the example here there are two snapshots, and I'm using vmware-cmd <vmx> removesnapshots which deletes them both.

I'm not too worried about it. I think it's "normal". (But what do I know about normal?!:p )

Share and enjoy!

Peter

Share and enjoy! Peter If this helped you, please award points! Or beer. Or jump tickets.

mattjk · ‎03-03-2009

I think that's "normal". If you watch both the vmkernel logs and the VM-specific vmware.log (say, with tail -f each in a separate screen), or just compare the two afterwards, you can see where it actually creates another snapshot as part of the delete process.

Why do you say it's normal? Even though the events in the two logs match up, and the VM is creating a second delta disk as part of the commit, I still don't see why these sorts of errors should be generated?

Note that in your example, it's complaining about 000002-delta, but you probably only had one snapshot, right? (It probably also complained about 000001 earlier, which did exist.)

Yes, only one snapshot.

Mar 3 22:37:00 esx1 vmkernel: 13:02:01:31.632 cpu3:1103)DevFS: 2222: Unable to find device: 2504a-DFM-000002-delta.vmdk

Just to confirm - you see these same errors too when you try comitting a snapshot?

Cheers,

Matt Kilham

Cheers, Matt

titaniumlegs · ‎03-03-2009

Yeah, errors when there's nothing really wrong are misleading, and that's why I put "normal" in quotes.

I see the same thing you do. I tried it with 1 snapshot twice, and 2 snapshots as well, just to confirm that most of the complaints were for snapshot (n+1).

Just for you , I tried an unpatched ESX server with a VM on NFS and a VM on VMFS via FC and got the exact same behaviour with both VMs.

This is all ESX 3.5 u3.

Peter

Share and enjoy! Peter If this helped you, please award points! Or beer. Or jump tickets.

mattjk · ‎03-03-2009

Thanks titanium, appreciate all your input and testing. Shall assume everything is working properly now - hopefully the errors will go away in a future release.

Cheers,

Matt Kilham

Cheers, Matt