1 2 Previous Next 25 Replies Latest reply: Aug 6, 2009 8:09 AM by ODOCChuck RSS

    ESX Server 3.5 - Corrupted Active Directory after taking a snapshot

    alastairc Novice

       

      Hi there,

       

       

      I've just had a ESE database corruption, which is pretty concerning and I'm wondering if anyone else has seen this or advise what may be happening here:

       

       

      One of our Windows Server 2003 R2 Domain Controllers runs as a VM Guest under ESX Server 3.5 build 64607. The host hardware is a HP Proliant DL380G5 with 16GB RAM, running two quad core Intel Xeon E5345, hyperhreading is disabled so we have a total of 8 HEC's from ESX's perspective. We don't have a SAN, so all VM storage is on the locally attached disks (6 HP 15KRPM SAS drives in a RAID-5 configuration, attached to a Smart Array P400 Controller)

       

       

      Anyway, our domain started acting strangely one Monday morning, so I had attempted to login to our ESX hosted domain controller to find it was rejecting the login credentials. After forcing a restart of the guest VM I was able to login, and from looking at the event logs I could see that the Active Directory database had become read-only after the following error was logged:

       

       

      Event Type:    Error

      Event Source:    NTDS ISAM

      Event Category:    General

      Event ID:    482

      Date:        20/01/2008

      Time:        6:14:24 AM

      User:        N/A

      Computer:    MELDC1

      Description:

      NTDS (384) NTDSA: An attempt to write to the file "C:\WINDOWS\NTDS\edb.log" at offset 3230720 (0x0000000000314c00) for 512 (0x00000200) bytes failed after 0 seconds with system error 1784 (0x000006f8): "The supplied user buffer is not valid for the requested operation. ".  The write operation will fail with error -1011 (0xfffffc0d).  If this error persists then the file may be damaged and may need to be restored from a previous backup.

       

       

       

      The timing of this error concides with a cron job which executes vcbSnapAll to backup our VMs. This leads me to suspect that the action of taking the snapshot has somehow caused a write operation to the AD database to fail. Perhaps windows tried to write during the file system queisce?

       

       

      My concern here is that this could happen again. We also host production MS SQL Databases on the same ESX host, and I would hate for these to become corrupted in a similar fashion. Can anyone advise, is this problem likely to have been caused by vcbSnapAll taking a snapshot of the VM? If so, are there best practices for snapshotting domain controllers or database servers? 

       

       

        • 1. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
          Dave.Mishchenko Guru User Moderators

          Microsoft does not support snapshot type backups for Domain Controllers - http://support.microsoft.com/kb/888794.

           

          In particular, Active Directory does not support any method that restores a snapshot of the operating system or the volume the operating system resides on. This kind of method causes an update sequence number (USN) rollback. When a USN rollback occurs, the replication partners of the incorrectly restored domain controller may have inconsistent objects in their Active Directory databases. In this situation, you cannot make these objects consistent.

           

          For apps like SQL Server,  you should ideally do something to quiesce the file system.  That might involve stopping SQL Server just before you create the snapshot and then restarting it after the snapshot is created.  That will get the DB files in a crash consistant state before you start to backup the vmdk.  Otherwise it's a bit like pulling the plug on the server.  http://communities.vmware.com/thread/115035.  If the SQL dbs are critical, you might consider using the native sql backup tools to backup the db and transaction logs to a network drive, seperate vmdk, etc.

          • 2. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
            mike.laspina Virtuoso

            Hello,

             

            While it is true that snapshots on databases are not a best practice, this should work without corrupting the system because in theory the system is suspended at a point in time.

             

             

            Now if you were restoring to that point in time I would say that all bets are off because of transactions that are in progress may or may not be complete.

             

             

            To protect Active directory simply add an NTBackup of AD before the snapshot.

             

             

            I would start looking at one of three possible tracks here.

             

             

            1) It could be a bug in the snapshot code, the paint is still wet on ESX 3.5 (Snapshots are quite mature this is not very likely)

             

             

            2) You may have a hardware/firmware problem. (SAS firmware and drivers are not fully matured, very very likely)

             

             

            3) Resource starvation, too much I/O at the time of snapshots. (Seen it happen but not very likely)

             

             

            Hope this helps.

             

             

            BTW I have been snapshotting DC's and DB's every day for 2.5 years and never seen a corruption to date. (Hmmm now that I said that ....... I hope I don't eat my words)

            • 3. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
              alastairc Novice

              Hi Mike,

               

              Thanks for your help - I think I may have found the cause of the problem. Our ESX Server is currently massively under-utilised, so I/O contention is probably not an issue, but the text: "The supplied user buffer is not valid for the requested operation" in the event description got me wondering if some other system event was causing the IO operation to fail. Looking through the event log, I noticed that at the exact same time as the NTDS error occurred, the following error was logged in the System log:

               

              Event Type:     Information

              Event Source:     LGTO_Sync

              Event Category:     None

              Event ID:     1

              Date:          20/01/2008

              Time:          6:14:24 AM

              User:          N/A

              Computer:     MELDC1

              Description:

              The description for Event ID ( 1 ) in Source ( LGTO_Sync ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: , Sync Stop done.

               

              LGTO_Sync appears to be the VMWare Tools file system sync driver, which (if my understanding is correct) is supposed to quiesce the filesystem before the snapshot is taken. After a bit of googling, it looks like this driver is responsible for quite a bit of grief where databases are concerned:

               

              Here someone is having exactly the same problems as me:

              http://supportforums.vizioncore.com/forums/thread/2472.aspx

               

              Here it would appear the Sync driver is causing an Oracle database dismount:

              http://support.p2v.net/boards/read.php?1,1163,1165

               

              And here's a VM KB article detailing issues with the sync driver and MS Exchange:

              http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=5962168

               

              So it would seem that VMware Tools Sync driver + Databases = Trouble. Effective immediately I'm removing the sync driver from all our VMs which host databases of any type.

               

              Could this be described as a bug, or is it just bad practice to install the filesystem sync driver in VM's which host databases?

              • 4. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
                mike.laspina Virtuoso

                That's great info. Thanks very much!

                • 5. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
                  DigitalVoodoo Hot Shot

                  A quick "me too" - I had a similar problem with my domain controller VMs, and removed the sync driver from those VMs. I haven't had a problem with them since, and wholeheartedly recommend it for DC VMs. I haven't had such problems yet with the few SQL VMs I have, but I would think it to be a good idea as well if you have any concerns at all about it.

                  • 6. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
                    Sangokan Hot Shot

                    I had a dismounted store once in Exchange 2003 after a snapshot, also disabling the sync driver in Vmware tools fixed it for us as advised on the vizioncore thread.

                    • 7. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
                      Justin King Enthusiast

                      I'll also toss in my own "me too".  I've remove the sync driver by default form all of my VMs at this point ... they snap much much faster that way anyway.  Ironicallythe sync driveris supposed to prevent that problem (basically verifying all queued writes have occured before the final pause) but it seems to cause more trouble than it solves in my case.

                      • 8. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
                        mrbrown66 Novice

                         

                        In an Active Directory environment I would recommend using the combination of an NTBackup follow by a file level based VCB backup. 

                         

                         

                        Aside from the quiesing issues mentioned above, restoring a domain controller from a VCB snapshot could result in corruption within AD. Corruption could occur when the domain controller appears back in your forest with the clock time totally different from other domain controllers.

                         

                         

                        We have setup a nightly NTbackup on each domain controller, and this gets backed up using a VCB file level backup.  The AD restore process is then very straight forward and well documented/support by Microsoft.

                         

                         

                        Hope that helps.

                         

                         

                         

                         

                         

                         

                         

                         

                        • 9. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
                          JRink Hot Shot

                          Is the Sync Driver automatically installed when you do a typical installation of the VMTools on a Windows server?

                          I never heard about this driver before...

                          • 10. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
                            alastairc Novice

                            Yes, at least with versions 3 and above of ESX server, the Sync driver is included as part of the default vmware tools install. Following the problems I had with Active Directory, I don't really trust it anymore and make a point of uninstalling it from any VM that will be running a database of any description.

                            • 11. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
                              joergriether Hot Shot vExpert

                              it´s generally a good idea to disable the sync driver inside any ad domain controller and inside any exchange server.

                              when it comes to sql i never ever had any problems with the sync driver when snapshots were taken. strange, but that´s what i experienced.

                               

                              best regards

                              Joerg

                              • 12. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
                                JRink Hot Shot

                                 

                                I mentioned yesterday that I've never even heard of this sync driver prior to reading this thread.  I did some searches on VMWare's support site and really didn't pull up anything.

                                 

                                 

                                Is there a document from VMWare in regards to best practices of snapshotting VMs that explains this and other "gotchas" we should know about?  If not, how would I have even known about this?

                                 

                                 

                                • 13. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
                                  mike.laspina Virtuoso

                                   

                                  I have not seen any official KB either. I would tend to think that if this were a serious issue it would much more wide spread. The corruption is more likely external to the sole existence of the LGTOsync driver.

                                   

                                   

                                  I would like to look at it more closely in the LAB.

                                   

                                   

                                  And if you turn it off you have another more serious issue.

                                   

                                   

                                  http://supportforums.vizioncore.com/forums/permalink/943/986/ShowThread.aspx#986

                                   

                                   

                                  You will not be crash consistent. Then you will see DB corruption at the time you need to recover the VM. Ugly

                                   

                                   

                                  Best bet NTBackup AD locally and then snapshot it. 

                                   

                                   

                                  • 14. Re: ESX Server 3.5 - Corrupted Active Directory after taking a snapshot
                                    Justin King Enthusiast

                                    My understanding of this driver is It was made to directly address the fact that you may snap at a time where all writes havn't been comitted to disk, so it does a sort of "pause" to the MV to allow it to commit all writes before you snap it. So yes, it is supposed to address and keep you form having a non-crash consistent backup.

                                     

                                    However, from my experiance, unless you're using some odd DB this is a non-issue anyway as databases already have a built in resilience to such problems and have for a long time. Not only that, but the length of time the entre OS is "paused" seems to cause more problems than it helps (from personal experiance at least), so I've resorted to removing it in most cases.

                                     

                                     

                                    In the context of AD, I'm not worried. I dont run only one AD server, so in the event of a localized failure I can easily force FSMO onto one of the other DCs and simply deploy from template then DCPromo and be back up in 20minutes. I wouldn't even bother with a restore. If the problem is widespread and I am forced to do an authoritive restore ... having an inconsitent backup that forces it to be another 15minutes older (as it rolls back the change logs) is the least of my worries.

                                    1 2 Previous Next