Microsoft does not support snapshot type backups for Domain Controllers - http://support.microsoft.com/kb/888794.
In particular, Active Directory does not support any method that restores a snapshot of the operating system or the volume the operating system resides on. This kind of method causes an update sequence number (USN) rollback. When a USN rollback occurs, the replication partners of the incorrectly restored domain controller may have inconsistent objects in their Active Directory databases. In this situation, you cannot make these objects consistent.
For apps like SQL Server, you should ideally do something to quiesce the file system. That might involve stopping SQL Server just before you create the snapshot and then restarting it after the snapshot is created. That will get the DB files in a crash consistant state before you start to backup the vmdk. Otherwise it's a bit like pulling the plug on the server. http://communities.vmware.com/thread/115035. If the SQL dbs are critical, you might consider using the native sql backup tools to backup the db and transaction logs to a network drive, seperate vmdk, etc.
While it is true that snapshots on databases are not a best practice, this should work without corrupting the system because in theory the system is suspended at a point in time.
Now if you were restoring to that point in time I would say that all bets are off because of transactions that are in progress may or may not be complete.
To protect Active directory simply add an NTBackup of AD before the snapshot.
I would start looking at one of three possible tracks here.
1) It could be a bug in the snapshot code, the paint is still wet on ESX 3.5 (Snapshots are quite mature this is not very likely)
2) You may have a hardware/firmware problem. (SAS firmware and drivers are not fully matured, very very likely)
3) Resource starvation, too much I/O at the time of snapshots. (Seen it happen but not very likely)
Hope this helps.
BTW I have been snapshotting DC's and DB's every day for 2.5 years and never seen a corruption to date. (Hmmm now that I said that ....... I hope I don't eat my words)
Thanks for your help - I think I may have found the cause of the problem. Our ESX Server is currently massively under-utilised, so I/O contention is probably not an issue, but the text: "The supplied user buffer is not valid for the requested operation" in the event description got me wondering if some other system event was causing the IO operation to fail. Looking through the event log, I noticed that at the exact same time as the NTDS error occurred, the following error was logged in the System log:
Event Type: Information
Event Source: LGTO_Sync
Event Category: None
Event ID: 1
Time: 6:14:24 AM
The description for Event ID ( 1 ) in Source ( LGTO_Sync ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: , Sync Stop done.
LGTO_Sync appears to be the VMWare Tools file system sync driver, which (if my understanding is correct) is supposed to quiesce the filesystem before the snapshot is taken. After a bit of googling, it looks like this driver is responsible for quite a bit of grief where databases are concerned:
Here someone is having exactly the same problems as me:
Here it would appear the Sync driver is causing an Oracle database dismount:
And here's a VM KB article detailing issues with the sync driver and MS Exchange:
So it would seem that VMware Tools Sync driver + Databases = Trouble. Effective immediately I'm removing the sync driver from all our VMs which host databases of any type.
Could this be described as a bug, or is it just bad practice to install the filesystem sync driver in VM's which host databases?
A quick "me too" - I had a similar problem with my domain controller VMs, and removed the sync driver from those VMs. I haven't had a problem with them since, and wholeheartedly recommend it for DC VMs. I haven't had such problems yet with the few SQL VMs I have, but I would think it to be a good idea as well if you have any concerns at all about it.
I'll also toss in my own "me too". I've remove the sync driver by default form all of my VMs at this point ... they snap much much faster that way anyway. Ironicallythe sync driveris supposed to prevent that problem (basically verifying all queued writes have occured before the final pause) but it seems to cause more trouble than it solves in my case.
In an Active Directory environment I would recommend using the combination of an NTBackup follow by a file level based VCB backup.
Aside from the quiesing issues mentioned above, restoring a domain controller from a VCB snapshot could result in corruption within AD. Corruption could occur when the domain controller appears back in your forest with the clock time totally different from other domain controllers.
We have setup a nightly NTbackup on each domain controller, and this gets backed up using a VCB file level backup. The AD restore process is then very straight forward and well documented/support by Microsoft.
Hope that helps.
Yes, at least with versions 3 and above of ESX server, the Sync driver is included as part of the default vmware tools install. Following the problems I had with Active Directory, I don't really trust it anymore and make a point of uninstalling it from any VM that will be running a database of any description.
I mentioned yesterday that I've never even heard of this sync driver prior to reading this thread. I did some searches on VMWare's support site and really didn't pull up anything.
Is there a document from VMWare in regards to best practices of snapshotting VMs that explains this and other "gotchas" we should know about? If not, how would I have even known about this?
I have not seen any official KB either. I would tend to think that if this were a serious issue it would much more wide spread. The corruption is more likely external to the sole existence of the LGTOsync driver.
I would like to look at it more closely in the LAB.
And if you turn it off you have another more serious issue.
You will not be crash consistent. Then you will see DB corruption at the time you need to recover the VM. Ugly
Best bet NTBackup AD locally and then snapshot it.
My understanding of this driver is It was made to directly address the fact that you may snap at a time where all writes havn't been comitted to disk, so it does a sort of "pause" to the MV to allow it to commit all writes before you snap it. So yes, it is supposed to address and keep you form having a non-crash consistent backup.
However, from my experiance, unless you're using some odd DB this is a non-issue anyway as databases already have a built in resilience to such problems and have for a long time. Not only that, but the length of time the entre OS is "paused" seems to cause more problems than it helps (from personal experiance at least), so I've resorted to removing it in most cases.
In the context of AD, I'm not worried. I dont run only one AD server, so in the event of a localized failure I can easily force FSMO onto one of the other DCs and simply deploy from template then DCPromo and be back up in 20minutes. I wouldn't even bother with a restore. If the problem is widespread and I am forced to do an authoritive restore ... having an inconsitent backup that forces it to be another 15minutes older (as it rolls back the change logs) is the least of my worries.