Hi there,
I've just had a ESE database corruption, which is pretty concerning and I'm wondering if anyone else has seen this or advise what may be happening here:
One of our Windows Server 2003 R2 Domain Controllers runs as a VM Guest under ESX Server 3.5 build 64607. The host hardware is a HP Proliant DL380G5 with 16GB RAM, running two quad core Intel Xeon E5345, hyperhreading is disabled so we have a total of 8 HEC's from ESX's perspective. We don't have a SAN, so all VM storage is on the locally attached disks (6 HP 15KRPM SAS drives in a RAID-5 configuration, attached to a Smart Array P400 Controller)
Anyway, our domain started acting strangely one Monday morning, so I had attempted to login to our ESX hosted domain controller to find it was rejecting the login credentials. After forcing a restart of the guest VM I was able to login, and from looking at the event logs I could see that the Active Directory database had become read-only after the following error was logged:
Event Type: Error
Event Source: NTDS ISAM
Event Category: General
Event ID: 482
Date: 20/01/2008
Time: 6:14:24 AM
User: N/A
Computer: MELDC1
Description:
NTDS (384) NTDSA: An attempt to write to the file "C:\WINDOWS\NTDS\edb.log" at offset 3230720 (0x0000000000314c00) for 512 (0x00000200) bytes failed after 0 seconds with system error 1784 (0x000006f8): "The supplied user buffer is not valid for the requested operation. ". The write operation will fail with error -1011 (0xfffffc0d). If this error persists then the file may be damaged and may need to be restored from a previous backup.
The timing of this error concides with a cron job which executes vcbSnapAll to backup our VMs. This leads me to suspect that the action of taking the snapshot has somehow caused a write operation to the AD database to fail. Perhaps windows tried to write during the file system queisce?
My concern here is that this could happen again. We also host production MS SQL Databases on the same ESX host, and I would hate for these to become corrupted in a similar fashion. Can anyone advise, is this problem likely to have been caused by vcbSnapAll taking a snapshot of the VM? If so, are there best practices for snapshotting domain controllers or database servers?