I've a Windows 2008 R2 virtual server running inside VMware ESXi 5.0. The server is also running Microsoft SQL Server 2008 R2. Backups are being done by Acronis vmProtect 7.0.
I'm experiencing a problem where after a backup is completed by Acronis, all the clients that are connected to the database(s) on the SQL server lose their connections and require a restart. An example timeline goes as follows (all times taken from the "Recent Tasks" window in vSphere):
It is at the instant the snapshot removal completes (07:49:08) that the connection to the SQL server is momentarily lost (I'm going to call this a "hiccup" in the rest of this narrative).
I've a couple other Windows 2008 R2 virtual servers that are running SQL Server 2008 R2 that are being backed-up using the same Acronis application, but they don't experience any "hiccup". I've tried moving the virtual machine to a different host, I've tried changing the time that the backup runs and I've tried moving the virtual drives to a different datastore, but nothing made any difference.
The server in question is mission-critical - we're a Public Service Access Point (PSAP) and must be up 24/7/365. Having the emergency dispatch client lose its connection to the database in the middle of a 911 call is unacceptable - but it has happened.
Anyone have any ideas as to what might be causing the "hiccup"?
Message was edited by: John Burski This question is, for all intents and purposes, answered (I just can't figure out how to mark it as "answered").
I have seen this behavior with Snap Manager for SQL and VSS, especially when the number of VMs per volume exceeds best practice. I also used to see it with vRanger. The snapshot takes too long, and the guest OS is quiesced, causing the amount of time between SQL communication with the guest to exceed what is normally acceptable by the time it is done. I don't have the exact numbers handy at the moment as far as what triggers the disconnect behavior, but when the above is the case, the error message in the SQL logs pretty much spells it out.
I've spent extensive time working on this issue. We began using SQL mirroring in a vSphere 5.1 environment, also with a new backup product that utilizes snapshots. Of dozens of SQL mirroring paired servers, a few random ones would generate mirroring errors during snapshot removal, and occasionally the mirror would even fail over. I've gone through both VMware and Microsoft support cases without a great solution or workaround (besides not using SQL mirroring or using traditional in-guest backups).
The root cause seems to center around the concept of the VM being "stunned" during the final stage of the snapshot removal, probably when I/O writes are transferred from the delta disk to the base disk. The bottom section of VMware KB: A snapshot removal can stop a virtual machine for long time kind of describes the behavior, though I'm not talking about long stun times. Stuns of a few seconds, a couple times (presumably once per VM disk as they're consolidated) has triggered this behavior. It shows how you can find the stun times in the vmkernel logs.
This problem occurred for us on new, relatively empty hosts connected to an EMC VMAX array over fiber channel. Disk performance and latency are monitored and very healthy.
After fighting with this internally and with the various support organizations for a few months, we conceded to revert to traditional in-guest agent backups for prod SQL servers.