VMware Communities > VMTN > VMware Infrastructure™ > VI: ESX 3.5 > Discussions

This Question is Answered

1 "correct" answer available (10 pts)
1 2 Previous Next
25 Replies Last post: Aug 6, 2009 8:09 AM by ODOCChuck
Reply

ESX Server 3.5 - Corrupted Active Directory after taking a snapshot

Jan 28, 2008 10:27 PM

Click to view alastairc's profile Novice alastairc 4 posts since
Apr 23, 2006

Hi there,

I've just had a ESE database corruption, which is pretty concerning and I'm wondering if anyone else has seen this or advise what may be happening here:

One of our Windows Server 2003 R2 Domain Controllers runs as a VM Guest under ESX Server 3.5 build 64607. The host hardware is a HP Proliant DL380G5 with 16GB RAM, running two quad core Intel Xeon E5345, hyperhreading is disabled so we have a total of 8 HEC's from ESX's perspective. We don't have a SAN, so all VM storage is on the locally attached disks (6 HP 15KRPM SAS drives in a RAID-5 configuration, attached to a Smart Array P400 Controller)

Anyway, our domain started acting strangely one Monday morning, so I had attempted to login to our ESX hosted domain controller to find it was rejecting the login credentials. After forcing a restart of the guest VM I was able to login, and from looking at the event logs I could see that the Active Directory database had become read-only after the following error was logged:

Event Type: Error
Event Source: NTDS ISAM
Event Category: General
Event ID: 482
Date: 20/01/2008
Time: 6:14:24 AM
User: N/A
Computer: MELDC1
Description:
NTDS (384) NTDSA: An attempt to write to the file "C:\WINDOWS\NTDS\edb.log" at offset 3230720 (0x0000000000314c00) for 512 (0x00000200) bytes failed after 0 seconds with system error 1784 (0x000006f8): "The supplied user buffer is not valid for the requested operation. ". The write operation will fail with error -1011 (0xfffffc0d). If this error persists then the file may be damaged and may need to be restored from a previous backup.


The timing of this error concides with a cron job which executes vcbSnapAll to backup our VMs. This leads me to suspect that the action of taking the snapshot has somehow caused a write operation to the AD database to fail. Perhaps windows tried to write during the file system queisce?

My concern here is that this could happen again. We also host production MS SQL Databases on the same ESX host, and I would hate for these to become corrupted in a similar fashion. Can anyone advise, is this problem likely to have been caused by vcbSnapAll taking a snapshot of the VM? If so, are there best practices for snapshotting domain controllers or database servers?

Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Jan 28, 2008 10:52 PM
Click to view Dave.Mishchenko's profile Guru Dave.Mishchenko 8,863 posts since
Nov 15, 2005
Moderator
Microsoft does not support snapshot type backups for Domain Controllers - http://support.microsoft.com/kb/888794.

In particular, Active Directory does not support any method that restores a snapshot of the operating system or the volume the operating system resides on. This kind of method causes an update sequence number (USN) rollback. When a USN rollback occurs, the replication partners of the incorrectly restored domain controller may have inconsistent objects in their Active Directory databases. In this situation, you cannot make these objects consistent.

For apps like SQL Server, you should ideally do something to quiesce the file system. That might involve stopping SQL Server just before you create the snapshot and then restarting it after the snapshot is created. That will get the DB files in a crash consistant state before you start to backup the vmdk. Otherwise it's a bit like pulling the plug on the server. http://communities.vmware.com/thread/115035. If the SQL dbs are critical, you might consider using the native sql backup tools to backup the db and transaction logs to a network drive, seperate vmdk, etc.
Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Jan 28, 2008 11:38 PM
Click to view mike.laspina's profile Virtuoso mike.laspina 2,253 posts since
May 26, 2006
Hello,

While it is true that snapshots on databases are not a best practice, this should work without corrupting the system because in theory the system is suspended at a point in time.

Now if you were restoring to that point in time I would say that all bets are off because of transactions that are in progress may or may not be complete.

To protect Active directory simply add an NTBackup of AD before the snapshot.

I would start looking at one of three possible tracks here.

1) It could be a bug in the snapshot code, the paint is still wet on ESX 3.5 (Snapshots are quite mature this is not very likely)

2) You may have a hardware/firmware problem. (SAS firmware and drivers are not fully matured, very very likely)

3) Resource starvation, too much I/O at the time of snapshots. (Seen it happen but not very likely)

Hope this helps.

BTW I have been snapshotting DC's and DB's every day for 2.5 years and never seen a corruption to date. (Hmmm now that I said that ....... I hope I don't eat my words)

Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Jan 29, 2008 3:22 AM
in response to: mike.laspina
Click to view alastairc's profile Novice alastairc 4 posts since
Apr 23, 2006
Hi Mike,

Thanks for your help - I think I may have found the cause of the problem. Our ESX Server is currently massively under-utilised, so I/O contention is probably not an issue, but the text: "The supplied user buffer is not valid for the requested operation" in the event description got me wondering if some other system event was causing the IO operation to fail. Looking through the event log, I noticed that at the exact same time as the NTDS error occurred, the following error was logged in the System log:

Event Type: Information
Event Source: LGTO_Sync
Event Category: None
Event ID: 1
Date: 20/01/2008
Time: 6:14:24 AM
User: N/A
Computer: MELDC1
Description:
The description for Event ID ( 1 ) in Source ( LGTO_Sync ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: , Sync Stop done.

LGTO_Sync appears to be the VMWare Tools file system sync driver, which (if my understanding is correct) is supposed to quiesce the filesystem before the snapshot is taken. After a bit of googling, it looks like this driver is responsible for quite a bit of grief where databases are concerned:

Here someone is having exactly the same problems as me:
http://supportforums.vizioncore.com/forums/thread/2472.aspx

Here it would appear the Sync driver is causing an Oracle database dismount:
http://support.p2v.net/boards/read.php?1,1163,1165

And here's a VM KB article detailing issues with the sync driver and MS Exchange:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=5962168

So it would seem that VMware Tools Sync driver + Databases = Trouble. Effective immediately I'm removing the sync driver from all our VMs which host databases of any type.

Could this be described as a bug, or is it just bad practice to install the filesystem sync driver in VM's which host databases?
Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Jan 29, 2008 7:51 AM
in response to: alastairc
Click to view mike.laspina's profile Virtuoso mike.laspina 2,253 posts since
May 26, 2006
That's great info. Thanks very much!
Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Jan 29, 2008 9:02 AM
in response to: alastairc
Click to view DigitalVoodoo's profile Hot Shot DigitalVoodoo 109 posts since
Mar 7, 2006
A quick "me too" - I had a similar problem with my domain controller VMs, and removed the sync driver from those VMs. I haven't had a problem with them since, and wholeheartedly recommend it for DC VMs. I haven't had such problems yet with the few SQL VMs I have, but I would think it to be a good idea as well if you have any concerns at all about it.
Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Jan 29, 2008 10:23 AM
in response to: DigitalVoodoo
Click to view Sangokan's profile Hot Shot Sangokan 142 posts since
Apr 12, 2007
I had a dismounted store once in Exchange 2003 after a snapshot, also disabling the sync driver in Vmware tools fixed it for us as advised on the vizioncore thread.
Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Jan 29, 2008 10:26 AM
in response to: DigitalVoodoo
Click to view Justin King's profile Enthusiast Justin King 85 posts since
Oct 26, 2006
I'll also toss in my own "me too". I've remove the sync driver by default form all of my VMs at this point ... they snap much much faster that way anyway. Ironicallythe sync driveris supposed to prevent that problem (basically verifying all queued writes have occured before the final pause) but it seems to cause more trouble than it solves in my case.
Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Feb 22, 2008 1:23 PM
in response to: Justin King
Click to view mrbrown66's profile Novice mrbrown66 22 posts since
Aug 13, 2006

In an Active Directory environment I would recommend using the combination of an NTBackup follow by a file level based VCB backup.

Aside from the quiesing issues mentioned above, restoring a domain controller from a VCB snapshot could result in corruption within AD. Corruption could occur when the domain controller appears back in your forest with the clock time totally different from other domain controllers.

We have setup a nightly NTbackup on each domain controller, and this gets backed up using a VCB file level backup. The AD restore process is then very straight forward and well documented/support by Microsoft.

Hope that helps.


Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Feb 24, 2008 10:32 AM
in response to: mrbrown66
Click to view JRink's profile Hot Shot JRink 150 posts since
Jan 10, 2007
Is the Sync Driver automatically installed when you do a typical installation of the VMTools on a Windows server?
I never heard about this driver before...
Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Feb 24, 2008 2:08 PM
in response to: JRink
Click to view alastairc's profile Novice alastairc 4 posts since
Apr 23, 2006
Yes, at least with versions 3 and above of ESX server, the Sync driver is included as part of the default vmware tools install. Following the problems I had with Active Directory, I don't really trust it anymore and make a point of uninstalling it from any VM that will be running a database of any description.
Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Feb 24, 2008 2:26 PM
in response to: alastairc
Click to view joergriether's profile Hot Shot joergriether 185 posts since
Sep 17, 2006
it´s generally a good idea to disable the sync driver inside any ad domain controller and inside any exchange server.
when it comes to sql i never ever had any problems with the sync driver when snapshots were taken. strange, but that´s what i experienced.

best regards
Joerg
Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Feb 25, 2008 8:54 AM
in response to: joergriether
Click to view JRink's profile Hot Shot JRink 150 posts since
Jan 10, 2007

I mentioned yesterday that I've never even heard of this sync driver prior to reading this thread. I did some searches on VMWare's support site and really didn't pull up anything.

Is there a document from VMWare in regards to best practices of snapshotting VMs that explains this and other "gotchas" we should know about? If not, how would I have even known about this?

Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Feb 25, 2008 9:38 AM
in response to: JRink
Click to view mike.laspina's profile Virtuoso mike.laspina 2,253 posts since
May 26, 2006

I have not seen any official KB either. I would tend to think that if this were a serious issue it would much more wide spread. The corruption is more likely external to the sole existence of the LGTOsync driver.

I would like to look at it more closely in the LAB.

And if you turn it off you have another more serious issue.

http://supportforums.vizioncore.com/forums/permalink/943/986/ShowThread.aspx#986

You will not be crash consistent. Then you will see DB corruption at the time you need to recover the VM. Ugly

Best bet NTBackup AD locally and then snapshot it.

Reply Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot Feb 25, 2008 10:25 AM
in response to: mike.laspina
Click to view Justin King's profile Enthusiast Justin King 85 posts since
Oct 26, 2006
My understanding of this driver is It was made to directly address the fact that you may snap at a time where all writes havn't been comitted to disk, so it does a sort of "pause" to the MV to allow it to commit all writes before you snap it. So yes, it is supposed to address and keep you form having a non-crash consistent backup.

However, from my experiance, unless you're using some odd DB this is a non-issue anyway as databases already have a built in resilience to such problems and have for a long time. Not only that, but the length of time the entre OS is "paused" seems to cause more problems than it helps (from personal experiance at least), so I've resorted to removing it in most cases.

In the context of AD, I'm not worried. I dont run only one AD server, so in the event of a localized failure I can easily force FSMO onto one of the other DCs and simply deploy from template then DCPromo and be back up in 20minutes. I wouldn't even bother with a restore. If the problem is widespread and I am forced to do an authoritive restore ... having an inconsitent backup that forces it to be another 15minutes older (as it rolls back the change logs) is the least of my worries.

1 2 Previous Next
Actions