VMware Cloud Community
raoulst
Contributor
Contributor

100% CPU of one of 2 vCPUs hanging VM after snapshot removal

Hi,

one of our Windows Server-VMs went to 100% usage, after a vcb snapshot has been removed. The machine was also completely unresponsive and the VMware Tools status changed to "not running", so we had to reset the VM via the reset-button. Since we have a log that has an entry every 5 seconds, we know, that the machine went unresponsive about 1 second after the snapshot removal completed. The VM is running Server 2003 as an Active Directory Domaincontroller and runs a SQL server.

Any idea, what might have caused one process to go to 100% CPU usage after a snapshot removal and if there is a way how this could be prevented in the future?

raoulst

Reply
0 Kudos
8 Replies
kjb007
Immortal
Immortal

There have been numerous threads about VCB on a server with SQL and/or exchange or AD. When the snapshot is removed, the data is incorporated back into the running disk, which may have spiked and/or caused sql to have problems. The safest thing to do when backing up sql server db vm's, is to either stop the sql services, or to pause the I/O via the sql commands before and re-enabling services after backup.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
raoulst
Contributor
Contributor

Hello kjb007,

thanks a lot for your answer.

So what I don't exactly understand, is what does happen at the creation and the removal of the snapshot.

Is the whole VM halted like when vmotion takes place, or is just the I/O halted. If just the I/O would be halted, I could imagine, that removing a snapshot would be more I/O intense, and so more likely to cause problems with a SQL server.

Could you also be so kind, and provide some links, where problems with snapshot removals are mentioned.

regards, raoulst

Reply
0 Kudos
kjb007
Immortal
Immortal

Remember, a snapshot is of the active data that is on disk. In cases of memory inclusion, the memory as well. So, basically, at the point of the snapshot, the filesystem is quiescedm, and a new file called a delta file is created, and all new data written to be written to disk, is now written to this delta disk instead of the original vm disk file. If you include memory, then the memory data is also written to disk from the time the snapshot was taken.

When you delete the snapshot, the delta files are re-incorporated back into the original vm disk files and the delta files are removed.

In the case of a db or other io intensive applications, the quiescing process can lead to a high amount of data to be written to disk, while more I/O is still occurring or in flight. This is why a medium/high db can have problems when a snapshot is taken, unless the I/O is frozen before and thawed after the snapshot process. For sql, there are stored procedures that will perform those tasks, and allow you to more safely take a snapshot, write the memory data to disk, and freeze I/O, so you are less error prone during these actions.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
Berniebgf
Enthusiast
Enthusiast

As a matter of interest,

Has anyone created a SQL 2000 / 2005 standard Pre/post thaw scripts and made it available on the Net?

Thanks

Bernie

http://sanmelody.blogspot.com

Reply
0 Kudos
raoulst
Contributor
Contributor

I thought, that the quiescing is done by the lgto-sync driver. So since I had problems CREATING snapshots on this VM before, i removed it and thought that the problems would be gone (which has also been implied by VMware support). So are you saying, that not only the filesystem is quiesced by lgto-sync (if it's installed), but also the disk. One time at the creation and another time at the removal of the snapshot?

raoulst

ps: I still would be very happy with links to other posts, where snapshot removal has caused problems with sql or ad, since the only post I could find (http://communities.vmware.com/message/806535) doesn't seem to point to exactly the same problem.

Reply
0 Kudos
raoulst
Contributor
Contributor

Hello KjB,

thank you for the links. Unfortunately, they all seem to deal with different problems. Till now, I could't find a hint of permanently freezing VMs at snapshot removal.VMware Support told me that there is a very rare known issue under high I/O load that has been fixed with a ESX 3.5 patch and will be fixed in ESX 3.0.3. But the VMs usually only freeze a for some minutes and don't have to be reset. Also, I had very low I/O at the time the error occured.

raoulst

Reply
0 Kudos
kjb007
Immortal
Immortal

The commonality is still the db. High I/O is not necesarilly required, it's having I/O at the time you that the server is not ready for it.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB