VMware Cloud Community
alastairc
Contributor
Contributor

ESX Server 3.5 - Corrupted Active Directory after taking a snapshot

Hi there,

I've just had a ESE database corruption, which is pretty concerning and I'm wondering if anyone else has seen this or advise what may be happening here:

One of our Windows Server 2003 R2 Domain Controllers runs as a VM Guest under ESX Server 3.5 build 64607. The host hardware is a HP Proliant DL380G5 with 16GB RAM, running two quad core Intel Xeon E5345, hyperhreading is disabled so we have a total of 8 HEC's from ESX's perspective. We don't have a SAN, so all VM storage is on the locally attached disks (6 HP 15KRPM SAS drives in a RAID-5 configuration, attached to a Smart Array P400 Controller)

Anyway, our domain started acting strangely one Monday morning, so I had attempted to login to our ESX hosted domain controller to find it was rejecting the login credentials. After forcing a restart of the guest VM I was able to login, and from looking at the event logs I could see that the Active Directory database had become read-only after the following error was logged:

Event Type: Error

Event Source: NTDS ISAM

Event Category: General

Event ID: 482

Date: 20/01/2008

Time: 6:14:24 AM

User: N/A

Computer: MELDC1

Description:

NTDS (384) NTDSA: An attempt to write to the file "C:\WINDOWS\NTDS\edb.log" at offset 3230720 (0x0000000000314c00) for 512 (0x00000200) bytes failed after 0 seconds with system error 1784 (0x000006f8): "The supplied user buffer is not valid for the requested operation. ". The write operation will fail with error -1011 (0xfffffc0d). If this error persists then the file may be damaged and may need to be restored from a previous backup.

The timing of this error concides with a cron job which executes vcbSnapAll to backup our VMs. This leads me to suspect that the action of taking the snapshot has somehow caused a write operation to the AD database to fail. Perhaps windows tried to write during the file system queisce?

My concern here is that this could happen again. We also host production MS SQL Databases on the same ESX host, and I would hate for these to become corrupted in a similar fashion. Can anyone advise, is this problem likely to have been caused by vcbSnapAll taking a snapshot of the VM? If so, are there best practices for snapshotting domain controllers or database servers?

0 Kudos
25 Replies
vmcms
Contributor
Contributor

Justin,

It looks like you've got a process for recovering from a corrupted DC such that AD corruption seems a non-event should it transpire.

Does this take into account DNS and AD cleanup? My experience is that though these objects are supposed to clean themeselves up, they do not always. And occasionally there is a DC setting or two in Microsoft Exchange 00/03/07 that gets stuck on an old DC that has to be manually tracked down and corrected.

With these things explicitly considered, how significant an event do you feel AD corruption and recovery to be?

VMCMS

0 Kudos
Tounet
Contributor
Contributor

Alternative method could be using the pre-freeze script of vcb (C:\windows\pre-freeze-script.bat) and post-thaw script (C:\windows\post-thaw-script.bat) inside the virtual machine.

You could easily stopping the services database, or run a locally backup of the database if you can't stop the services for few seconds / minutes.

For AD Controller, you could use these script for run systemstate backup locally with ntbackup. Don't forget that systemstate is the only supported and recommended microsoft solution for backup AD Controller.

0 Kudos
Jabadakkas
Contributor
Contributor

A few months ago I had a chat with a VMware pre-sales consultant. He told me that the upcoming ESX 3.5 Update 2 which is due Q3 2008 will include Volume Shadow Copy Service (VSS) support. This will hopefully eliminate any problems ESX admins are experiencing in regards to snapshots on AD vm's. I googled for VSS support and found that the Virtual Server 2.0 betas already include VSS support. Ofcourse I realize that Virtual Server is a hosted virtualization product and accesses storage in a different way than ESX does, but still you could give it a try. :smileyblush:

Are there any forum readers that tested VSS on Virtual Server? Were you able to snapshot an AD vm without the corruption problems on ESX that identified at the start of this forum topic?

0 Kudos
piyush1414
Contributor
Contributor

Try Edb Repair tool to repair corrupted Active Directory.

Download the free Trial from here http://www.edbrepair.org/edb-active-directory-recovery.php

0 Kudos
Josh26
Virtuoso
Virtuoso

Hi Mike,

. Effective immediately I'm removing the sync driver from all our VMs which host databases of any type.

Could this be described as a bug, or is it just bad practice to install the filesystem sync driver in VM's which host databases?

I see this advise thrown around a lot, with no regard for the consequences. Basically what the sync driver does is to freeze IO, in a "best effort" at getting an application consistent snapshot. Most databases, eg Exchange and Oracle have major issues with this write lag.

The "just disable the driver" approach can be just as dangerous. It means nothing is remotely consistent in the snapshot. Try restoring one of those snapshots, and doing a chkdsk. You have about a 50% chance of something requiring a repair.

The bigger warning is, don't try restoring a snapshot of yoru domain controller. It's not pretty.

0 Kudos
ODOCChuck
Contributor
Contributor

Is this still an issue, have updates to the driver or an Update release resolved the issue with the synchDriver, or is it "works as designed" and dont expect a fix? I am not so concerned about a corrupted AD, but it does cause issues when the DB is corrupted. There are more reasons to do a snap shot than restoring a full VM.

0 Kudos