ESX Server 3.5 - Corrupted Active Directory after ...

alastairc · ‎01-28-2008

Hi there,

I've just had a ESE database corruption, which is pretty concerning and I'm wondering if anyone else has seen this or advise what may be happening here:

One of our Windows Server 2003 R2 Domain Controllers runs as a VM Guest under ESX Server 3.5 build 64607. The host hardware is a HP Proliant DL380G5 with 16GB RAM, running two quad core Intel Xeon E5345, hyperhreading is disabled so we have a total of 8 HEC's from ESX's perspective. We don't have a SAN, so all VM storage is on the locally attached disks (6 HP 15KRPM SAS drives in a RAID-5 configuration, attached to a Smart Array P400 Controller)

Anyway, our domain started acting strangely one Monday morning, so I had attempted to login to our ESX hosted domain controller to find it was rejecting the login credentials. After forcing a restart of the guest VM I was able to login, and from looking at the event logs I could see that the Active Directory database had become read-only after the following error was logged:

Event Type: Error

Event Source: NTDS ISAM

Event Category: General

Event ID: 482

Date: 20/01/2008

Time: 6:14:24 AM

User: N/A

Computer: MELDC1

Description:

NTDS (384) NTDSA: An attempt to write to the file "C:\WINDOWS\NTDS\edb.log" at offset 3230720 (0x0000000000314c00) for 512 (0x00000200) bytes failed after 0 seconds with system error 1784 (0x000006f8): "The supplied user buffer is not valid for the requested operation. ". The write operation will fail with error -1011 (0xfffffc0d). If this error persists then the file may be damaged and may need to be restored from a previous backup.

The timing of this error concides with a cron job which executes vcbSnapAll to backup our VMs. This leads me to suspect that the action of taking the snapshot has somehow caused a write operation to the AD database to fail. Perhaps windows tried to write during the file system queisce?

My concern here is that this could happen again. We also host production MS SQL Databases on the same ESX host, and I would hate for these to become corrupted in a similar fashion. Can anyone advise, is this problem likely to have been caused by vcbSnapAll taking a snapshot of the VM? If so, are there best practices for snapshotting domain controllers or database servers?

Dave_Mishchenko · ‎01-28-2008

Microsoft does not support snapshot type backups for Domain Controllers - http://support.microsoft.com/kb/888794.

In particular, Active Directory does not support any method that restores a snapshot of the operating system or the volume the operating system resides on. This kind of method causes an update sequence number (USN) rollback. When a USN rollback occurs, the replication partners of the incorrectly restored domain controller may have inconsistent objects in their Active Directory databases. In this situation, you cannot make these objects consistent.

For apps like SQL Server, you should ideally do something to quiesce the file system. That might involve stopping SQL Server just before you create the snapshot and then restarting it after the snapshot is created. That will get the DB files in a crash consistant state before you start to backup the vmdk. Otherwise it's a bit like pulling the plug on the server. http://communities.vmware.com/thread/115035. If the SQL dbs are critical, you might consider using the native sql backup tools to backup the db and transaction logs to a network drive, seperate vmdk, etc.

mike_laspina · ‎01-28-2008

Hello,

While it is true that snapshots on databases are not a best practice, this should work without corrupting the system because in theory the system is suspended at a point in time.

Now if you were restoring to that point in time I would say that all bets are off because of transactions that are in progress may or may not be complete.

To protect Active directory simply add an NTBackup of AD before the snapshot.

I would start looking at one of three possible tracks here.

1) It could be a bug in the snapshot code, the paint is still wet on ESX 3.5 (Snapshots are quite mature this is not very likely)

2) You may have a hardware/firmware problem. (SAS firmware and drivers are not fully matured, very very likely)

3) Resource starvation, too much I/O at the time of snapshots. (Seen it happen but not very likely)

Hope this helps.

BTW I have been snapshotting DC's and DB's every day for 2.5 years and never seen a corruption to date. (Hmmm now that I said that ....... I hope I don't eat my words)

http://blog.laspina.ca/ vExpert 2009

alastairc · ‎01-29-2008

Hi Mike,

Thanks for your help - I think I may have found the cause of the problem. Our ESX Server is currently massively under-utilised, so I/O contention is probably not an issue, but the text: "The supplied user buffer is not valid for the requested operation" in the event description got me wondering if some other system event was causing the IO operation to fail. Looking through the event log, I noticed that at the exact same time as the NTDS error occurred, the following error was logged in the System log:

Event Type: Information

Event Source: LGTO_Sync

Event Category: None

Event ID: 1

Date: 20/01/2008

Time: 6:14:24 AM

User: N/A

Computer: MELDC1

Description:

The description for Event ID ( 1 ) in Source ( LGTO_Sync ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: , Sync Stop done.

LGTO_Sync appears to be the VMWare Tools file system sync driver, which (if my understanding is correct) is supposed to quiesce the filesystem before the snapshot is taken. After a bit of googling, it looks like this driver is responsible for quite a bit of grief where databases are concerned:

Here someone is having exactly the same problems as me:

http://supportforums.vizioncore.com/forums/thread/2472.aspx

Here it would appear the Sync driver is causing an Oracle database dismount:

http://support.p2v.net/boards/read.php?1,1163,1165

And here's a VM KB article detailing issues with the sync driver and MS Exchange:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=596216...

So it would seem that VMware Tools Sync driver + Databases = Trouble. Effective immediately I'm removing the sync driver from all our VMs which host databases of any type.

Could this be described as a bug, or is it just bad practice to install the filesystem sync driver in VM's which host databases?

mike_laspina · ‎01-29-2008

That's great info. Thanks very much!

http://blog.laspina.ca/ vExpert 2009

DigitalVoodoo · ‎01-29-2008

A quick "me too" - I had a similar problem with my domain controller VMs, and removed the sync driver from those VMs. I haven't had a problem with them since, and wholeheartedly recommend it for DC VMs. I haven't had such problems yet with the few SQL VMs I have, but I would think it to be a good idea as well if you have any concerns at all about it.

Sangokan · ‎01-29-2008

I had a dismounted store once in Exchange 2003 after a snapshot, also disabling the sync driver in Vmware tools fixed it for us as advised on the vizioncore thread.

Justin_King · ‎01-29-2008

I'll also toss in my own "me too". I've remove the sync driver by default form all of my VMs at this point ... they snap much much faster that way anyway. Ironicallythe sync driveris supposed to prevent that problem (basically verifying all queued writes have occured before the final pause) but it seems to cause more trouble than it solves in my case.

mrbrown66 · ‎02-22-2008

In an Active Directory environment I would recommend using the combination of an NTBackup follow by a file level based VCB backup.

Aside from the quiesing issues mentioned above, restoring a domain controller from a VCB snapshot could result in corruption within AD. Corruption could occur when the domain controller appears back in your forest with the clock time totally different from other domain controllers.

We have setup a nightly NTbackup on each domain controller, and this gets backed up using a VCB file level backup. The AD restore process is then very straight forward and well documented/support by Microsoft.

Hope that helps.

JRink · ‎02-24-2008

Is the Sync Driver automatically installed when you do a typical installation of the VMTools on a Windows server?

I never heard about this driver before...

alastairc · ‎02-24-2008

Yes, at least with versions 3 and above of ESX server, the Sync driver is included as part of the default vmware tools install. Following the problems I had with Active Directory, I don't really trust it anymore and make a point of uninstalling it from any VM that will be running a database of any description.

joergriether · ‎02-24-2008

it´s generally a good idea to disable the sync driver inside any ad domain controller and inside any exchange server.

when it comes to sql i never ever had any problems with the sync driver when snapshots were taken. strange, but that´s what i experienced.

best regards

Joerg

JRink · ‎02-25-2008

I mentioned yesterday that I've never even heard of this sync driver prior to reading this thread. I did some searches on VMWare's support site and really didn't pull up anything.

Is there a document from VMWare in regards to best practices of snapshotting VMs that explains this and other "gotchas" we should know about? If not, how would I have even known about this?

mike_laspina · ‎02-25-2008

I have not seen any official KB either. I would tend to think that if this were a serious issue it would much more wide spread. The corruption is more likely external to the sole existence of the LGTOsync driver.

I would like to look at it more closely in the LAB.

And if you turn it off you have another more serious issue.

You will not be crash consistent. Then you will see DB corruption at the time you need to recover the VM. Ugly

Best bet NTBackup AD locally and then snapshot it.

http://blog.laspina.ca/ vExpert 2009

Justin_King · ‎02-25-2008

My understanding of this driver is It was made to directly address the fact that you may snap at a time where all writes havn't been comitted to disk, so it does a sort of "pause" to the MV to allow it to commit all writes before you snap it. So yes, it is supposed to address and keep you form having a non-crash consistent backup.

However, from my experiance, unless you're using some odd DB this is a non-issue anyway as databases already have a built in resilience to such problems and have for a long time. Not only that, but the length of time the entre OS is "paused" seems to cause more problems than it helps (from personal experiance at least), so I've resorted to removing it in most cases.

In the context of AD, I'm not worried. I dont run only one AD server, so in the event of a localized failure I can easily force FSMO onto one of the other DCs and simply deploy from template then DCPromo and be back up in 20minutes. I wouldn't even bother with a restore. If the problem is widespread and I am forced to do an authoritive restore ... having an inconsitent backup that forces it to be another 15minutes older (as it rolls back the change logs) is the least of my worries.

mike_laspina · ‎02-25-2008

I can see why we are not on the same page. If VMware would provide a clear description of the functional parts of LGTOsync we would not have these issues or this discussion.

Here is how I understand the functional side of this driver.

Since ESX can not determine what the VM is doing with disk writes and memory cache at the time a snapshot is requested it cannot correctly freeze the I/O state of the vmdk for that snapshot to provide integrity of any cached SCSI operations.

VMware needed to montior this SCSI disk write and memory cache activity and feed that info back to the ESX physical hardware and complete those disk I/O operations in order to provide the integrity requirements of backup functions. They provided this capability by developing the LGTOsync driver with Legato. So when I think of the LGTOsync driver I am not looking at the VM flushing it's writes, It's the underlying host that needs to do this function. If you disable the driver you lose this capability all together. So unless you can tell me that you know for certain that the VMWare SCSI drivers support forced unit access down to the host OS then it is better to run the driver.

So I consider the following when I build VM's for AD or any DB's

http://blog.laspina.ca/ vExpert 2009

Justin_King · ‎02-25-2008

Huh, I think we're saying the same thing but comming to different conclusions

The meat of what I'm saying is that AD is simply stored in a jet database. Just like any other modern database it keeps log files of data to be commited and thus this level of data security is already partially accounted for. A partially commited DB change would still have a log present even in a circular setup and thus I consider the issue minor. Perhaps I'm unique in this area, but I've had a number of problems occur when a snapshot occurs on a DC and it has the sync driver installed. Observational data implies the entire VM is paused while the host commits vmdisk writes, simply uninstall the driver and do a snap and look at the difference in time. At least on any of my esx hosts it's easily visible, and the entire host is essentially "paused" for a few seconds.

Only VMware docs I can find that reference FUA is this one:

Still need to read through it though.

mike_laspina · ‎02-25-2008

Yes. Now we are on the same page. This paper identifies the issue of not knowing which writes need to be commitied at the Hypervisor layer. Now the part we need to know is does the VMware SCSI driver work with the Hypervisor to write FUA flagged requests immediately or not. Once that is known the we can determine if we can trust the underlying host OS with this task or does this only occur with the LGTOsync driver. And I completely agree that todays DB's will deal with these issues much better than 1st and 2nd gen ones did and will recover to a usable state. They do still get in trouble when under extreem stress as the requests start saturating cache and can bottle neck with a transaction log and data items in the write-back cache at the failure point.

Thats a good paper, thanks for sharing it.

http://blog.laspina.ca/ vExpert 2009

FunkyD · ‎02-28-2008

It seems to me that the golden rule with any server that hosts a database is to backup the database first either locally or using your file backup solution and then snapshot the server. To recover, restore the snapshot and then resotre the databases following whichever method suits. You cannot reliably snapshot a server with a database and simply restore it and expect it to work - it might work but it's better not to risk it.

On that basis my strategy is to do a daily snapshot of file servers and SQL (they all have a local backup of the database) and sftp them to the remote site. For database servers (AD, Exchange) I do a monthly snapshot, sftp to the remote site and to recover use the last nights backup to disc files for the databases.

I am hoping to improve things by upgrading Veritas 9 to version 11 so I can use the VCB module.

mike_laspina · ‎02-28-2008

Yes it is, I completely concur and that is what I suggested on the first post I made on this thread.

http://blog.laspina.ca/ vExpert 2009

All

ESX Server 3.5 - Corrupted Active Directory after taking a snapshot