VMware Cloud Community
GPB2012
Contributor
Contributor

Remove all snapshots, 16+ hours still not done

Hello,

I know this post might go unanswered but it has been a very long 28 hour day already and I am becoming quite worried and frustrated.

I have a VM with 100GB disk, 200GB provisioned size on an ESXI 5 server. It had one snapshot which had a 16GB size.  I, now know, made the mistake of letting this snapshot run for 2 months.   When I tried to commit this snapshot to the parent disk, this action is now going on for 16+ hours.

There is disk activity, the log get written every so often, and I see that it is going though a cycle of writing to two files:

000001-delta.vmdk and 000002-delta.vmdk

Around 1.5GB gets written on to number one, then it goes to number 2 and does it all over again, it has done this cycle at least 3 times already. The progress bar on teh vSphere Client shows 85% completed. It gains a few % points on every cycle.

Some more information, there is high CPU usage, has been pretty high since the process started.  I had to shutdown two other VMs for the night because they were being affected by the high CPU usage. There is enough space on the HD, around 140GB free.

Here are some of the last lines written to the log:

SnapshotVMXConsolidateOnlineCB: Starting combine of scsi0:0 /vmfs/volumes/4dc2024d-f7148cb4-d760-9c8e9916a9fa/Ubuntu Server 10 - Zimbra/Ubuntu Server 10 - Zimbra-000002.vmdk. 2 links, starting from 1
2012-06-22T05:54:22.056Z| vcpu-0| SnapshotVMXConsolidateOnlineCB: Created thread 10 for scsi0:0
2012-06-22T05:54:22.079Z| SnapshotVMXCombiner| DISKLIB-LIB   : Upward Combine 2 links at 1. Need 0 MB of free space (163155 MB available)
2012-06-22T05:54:22.079Z| SnapshotVMXCombiner| DISKLIB-DDB   : "longContentID" = "9af46be9fbc154aa1ea42132f1c38dbc" (was "76214bdf1195fd40dac07a99a21e7b84")
2012-06-22T05:54:22.330Z| vcpu-1| HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/4dc2024d-f7148cb4-d760-9c8e9916a9fa/Ubuntu Server 10 - Zimbra/Ubuntu Server 10 - Zimbra-000002.vmdk'
2012-06-22T05:54:22.330Z| vcpu-1| DISKLIB-DDB   : "longContentID" = "edebe0011137a528aad6303433d05e3c" (was "9af46be9fbc154aa1ea42132f1c38dbc")
2012-06-22T05:54:22.444Z| vcpu-1| DISKLIB-CHAIN : DiskChainUpdateContentID: old=0xf1c38dbc, new=0x33d05e3c (edebe0011137a528aad6303433d05e3c)
2012-06-22T05:55:22.906Z| vmx| GuestRpcSendTimedOut: message to toolbox-dnd timed out.
2012-06-22T06:38:30.376Z| vcpu-0| VIDE: (0x170) Switching active drive from 0 to 1 while command 0xa0 is in progress (status 0x58).
2012-06-22T06:38:30.377Z| vcpu-0| VIDE: IDE1:0: Aborting command a0, I/O was not in progress
2012-06-22T06:38:31.284Z| vcpu-0| VIDE: Already in check condition 02 3a 00
2012-06-22T06:38:31.302Z| vcpu-0| VIDE: Already in check condition 02 3a 00
2012-06-22T07:50:53.884Z| vcpu-0| VIDE: (0x170) Switching active drive from 0 to 1 while command 0xa0 is in progress (status 0x58).
2012-06-22T07:50:53.884Z| vcpu-0| VIDE: IDE1:0: Aborting command a0, I/O was not in progress
2012-06-22T07:50:54.874Z| vcpu-0| VIDE: Already in check condition 02 3a 00
2012-06-22T07:50:54.875Z| vcpu-0| VIDE: Already in check condition 02 3a 00
2012-06-22T09:12:33.985Z| vcpu-5| VIDE: (0x170) Switching active drive from 0 to 1 while command 0xa0 is in progress (status 0x58).
2012-06-22T09:12:33.985Z| vcpu-5| VIDE: IDE1:0: Aborting command a0, I/O was not in progress
2012-06-22T09:12:34.846Z| vcpu-5| VIDE: Already in check condition 02 3a 00
2012-06-22T09:12:37.112Z| vcpu-5| VIDE: Already in check condition 02 3a 00
2012-06-22T09:12:49.794Z| vcpu-5| VIDE: (0x170) Switching active drive from 0 to 1 while command 0xa0 is in progress (status 0x58).
2012-06-22T09:12:49.794Z| vcpu-5| VIDE: IDE1:0: Aborting command a0, I/O was not in progress
2012-06-22T09:12:50.618Z| vcpu-5| VIDE: Already in check condition 02 3a 00
2012-06-22T09:12:50.619Z| vcpu-5| VIDE: Already in check condition 02 3a 00
2012-06-22T09:15:20.652Z| vcpu-3| VIDE: (0x170) Switching active drive from 0 to 1 while command 0xa0 is in progress (status 0x58).
2012-06-22T09:15:20.652Z| vcpu-3| VIDE: IDE1:0: Aborting command a0, I/O was not in progress
2012-06-22T09:15:49.177Z| vcpu-3| VIDE: Already in check condition 02 3a 00
2012-06-22T09:15:49.262Z| vcpu-3| VIDE: Already in check condition 02 3a 00
2012-06-22T09:29:25.838Z| vcpu-0| VIDE: (0x170) Switching active drive from 0 to 1 while command 0xa0 is in progress (status 0x58).
2012-06-22T09:29:25.838Z| vcpu-0| VIDE: IDE1:0: Aborting command a0, I/O was not in progress
2012-06-22T09:29:36.733Z| vcpu-0| VIDE: Already in check condition 02 3a 00
2012-06-22T09:29:36.793Z| vcpu-0| VIDE: Already in check condition 02 3a 00

Any advice will be greatly appreciated.  Let me know if anyone has gone through this ordeal and how they got out of this mess.

Thank you in advance.

GPB

Reply
0 Kudos
9 Replies
GPB2012
Contributor
Contributor

A small update...

Just noticed there is a file called [VM name]Snapshot75.vmsn on the VM folder. Earlier today there was a similar file but it was numbered 72.  Not sure if there were others before with smaller numbers.

There was only one snapshot for this VM, so where is it getting these numbers from?

GPB

Reply
0 Kudos
GPB2012
Contributor
Contributor

Another thing I have noticed is that the recent task window shows, every so often, a "Manipulate file paths" task which was completed successfully.

I have seen this show up at least 6 or 7 times now.

Reply
0 Kudos
GPB2012
Contributor
Contributor

Still going (just like the Energizer Bunny).  It´s going to get slower now that users started to come in to work and I had to enable the other VMs.

Lessons learned out of all this mess, NEVER EVER leave snapshots for too long and do not rely on them too much since they can be a bit unstable.

Has anyone experienced this before?  I am really amazed at how long this is taking.

By the way the VM in question is the mail server, 10 hour downtime yesterday and looking at another 12 today. Good thing it is Friday!!!

GPB

Reply
0 Kudos
GPB2012
Contributor
Contributor

Believe it or not, it is still going.  I already brough back up a copy of the server so email is back on but I just want to see the end of this  25.5 hours and 88%!!!

This cannot be normal. 16GB snapshot, 100GB parent. What is going on?! :smileyshocked:

Reply
0 Kudos
GPB2012
Contributor
Contributor

Last update....42 hours later the consolidation process finished and I am glad to say the VM works normally.

Very strange it took this long, it is a less than a year old Proliant DL360 G7 server with two Intel Xeon 2.4Ghz processors and 24GB RAM, 15K rpm HDs, running 6 VMs.

Word to the wise, if you use snapshots commit them soon...or it could be a very long and painful consolidation process.

GPB

Reply
0 Kudos
munishpalmakhij
Contributor
Contributor

I agree, thats why vmware recommends that not to have many snapshots or commit/delete before it grows very old,

I can definately understand your mindset , i had a similar case where a snapshot was almost 1 year old, and it took almost 48 hrs to delete, luckily for me the it didnt impact to the VM.

Reply
0 Kudos
a_p_
Leadership
Leadership

Although I agree that snapshots should not be kept longer than really needed, the time needed to delete a 16GB snapshot should not be that long. Even without BBWC/FBWC (which makes a HUGE difference in disk performance) it should not take longer than ~1hour to complete this job.

For HP hosts there are basically 2 major things you need for good performance. The first one is BBWC/FBWC for disk performance and the second one is to set "Static High Performance" in the BIOS Power settings for processing performance.

André

Reply
0 Kudos
munishpalmakhij
Contributor
Contributor

Thats true but I never figured out why it took so long for me (I was using Dell M910 and NetApp Storage). I didn't want to cancel as it was my Exchange Server and i was afraid it might cause some impact

Reply
0 Kudos
delineator1980
Contributor
Contributor

I am now going thru this dreadful experience, I started my consolidation of VM snapshots yesterday at 6pm and now it is 1:30pm the next day, progress bar still shows 99%. I do not want to stop the process that could result in a corrupted file. This just bugs me because this is a common function on VMware and now I look like an idiot in front of our customer.

I logged onto the ESX host and changed directories to volume\lun\vm and I see the second hard drive always updated with the current time when I do an ls -l command. This is the hard disk I need to get to in order to restore our customers database, but everytime I try to manipulate the .vmdk file it says its locked.

Its going on 24 hours now, how long should this take. I just started but I think they had snapshots going back a couple of months.

Reply
0 Kudos