hbowlin
Contributor
Contributor

Issues with SQL Server Availablity During Snapshot Removal

Jump to solution

I use VMware VCB to backup a virtual machine that is running SQL Server 2005. The quiesce works fine when taking the snapshot, but I am getting errors from my .NET web applications at the exact same time that the snapshot removal is taking place. The errors indicate the SQL Server is not available. So, either the network is not responding during the snapshot commit process or SQL Server I/O gets halted (perhaps via another quiesce??) while the snapshot is being removed. Anyone know why this would happen? Does VMware quiesce the OS during snapshot removal as well?

0 Kudos
1 Solution

Accepted Solutions
Chuck8773
Hot Shot
Hot Shot

No patch that I know of that would cause snapshot removal to pause the VM longer. 3.5 U2 introduced VSS components into the VM, but that should only affect when a snapshot is created. Also creating the snapshot with snapping memory pauses the VM considerably longer, but again only affects the snapshot creation not removal.

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

Charles Killmer, VCP4 If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

View solution in original post

0 Kudos
8 Replies
RParker
Immortal
Immortal

while the snapshot is being removed. Anyone know why this would happen? Does VMware quiesce the OS during snapshot removal as well?

There is a good reason why this happens. SQL is VERY sensitive to data I/O for one. But snapshots are inherently time consuming and they are disk intensive.

So first you need to find out what disks and RAID configuration are these? Because the snapshots while they are committing need a LOT of Disk IO, and if you don't have SAS disks or RAID 10 you will experience brief time outs, because the VM simply can't respond, this is normal.

If your VM is 100G, and there is a LOT of activity during a backup (which occurs with SQL) then during a 2 hour backup Window or more, that's a LOT of changes in the mean time, which causes the snapshot to grow, exponentially. Then committing these changes requires merging the snapshot with the original VM, hence the time out. So I am not surprised.

I have a laundry list of reasons why it's a bad idea to Virtualize SQL, and this is one more....

0 Kudos
hbowlin
Contributor
Contributor

I see you point; however, i don't see any delayed write failure logs which is what i would expect to see if this were the case.

0 Kudos
RParker
Immortal
Immortal

I see you point; however, i don't see any delayed write failure logs which is what i would expect to see if this were the case.

These logs are inside the SAME VM that the host cannot service because the VM is unable to respond due to heavy Disk IO, right? Of course you can't see the logs, the VM itself is having a problem not the OS. If I were to yank the battery and power from your laptop, your laptop would shut off, how many Logs would you expect to see in windows during this time? None, because the OS has no clue the machine was turned off..

That's what we are talking about... So if the VM were unresponsive at the hardware level, then the underlying OS would ALSO thus be unable to find / log any problems during that same time period...

Or to put it another way the VM is 'frozen' in time so to speak.

0 Kudos
Chuck8773
Hot Shot
Hot Shot

We virtualize SQL and Exchange with much success. A few tips. When the snapshot is removed, the VM effectively pauses during the last section of disk merging. This is the reason the OS doesn't write anything during that time. It is paused and doesn't know that time is passing.

You can check the VMDK mode. If you enable Independent Persistent Mode, then snapshots of that disk cannot be taken. If you have a few disks in the VM, system and data, and possibly logs, then you can configure it so snapshots only affect the system. This will reduce the size of your snapshots only the changes in the system OS. This will restrict your ability to backup the data through something like VCB as you willnot be able to create snapshots of that VMDK. Often you may not want to ever take snapshots of the data vmdk as you would never want a "revert to snapshot" action to remove the data that was gathered between the time the snapshot was taken and when you revert. Think Exchange.

What we currently do is this, single vmdk in the VM holds the system OS. The VM connects to a SAN volumes over software iSCSI within the VM. Snapshots if the VM do not contain data, so they stay small and fast to merge. Snapshots of the SAN volume do not contain the system OS.

We may change how this is done in ESX 4 as we will be able to use MPIO to the ESX volumes. ESX 3.5 limits LUN bandwidth to 1 Gbps using iSCSI. By going to the SAN from within the VM we can gaurantee 2 Gbps from the VM to its SAN data volume.

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

Charles Killmer, VCP4 If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
hbowlin
Contributor
Contributor

I think this is exactly what we are facing. And yes, RParker, I am with you. So basically, during the last part of the snapshot commitment, VMware pauses the VM so that the disk can be fully merged. And, what I hear you saying is that this pause is proportional to how large the snapshot delta disk is. So, my next question is this. Has an update been release recently that has caused this VM pause to be longer than normal. I haven't had a problem in the past with this.

0 Kudos
Chuck8773
Hot Shot
Hot Shot

No patch that I know of that would cause snapshot removal to pause the VM longer. 3.5 U2 introduced VSS components into the VM, but that should only affect when a snapshot is created. Also creating the snapshot with snapping memory pauses the VM considerably longer, but again only affects the snapshot creation not removal.

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

Charles Killmer, VCP4 If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

View solution in original post

0 Kudos
hbowlin
Contributor
Contributor

It looks as if the problem is due to multiple snapshots being present on the VM. Unknown to me, a snapshot was taken on 6/15 and I hadn't see it. That is exactly when the problem started happening...seems logical that would be the case.

0 Kudos
Chuck8773
Hot Shot
Hot Shot

That would do it. Glad you found it.

Thanks

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

Charles Killmer, VCP4 If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
0 Kudos