11 Replies Latest reply on Sep 21, 2011 9:46 AM by zzelazek

    Backup fails after adding a new Win XP guest

    dcorsello Novice

      I've been running ghettoVCBg2 nightly to successful completion for a  few months.  Last night, it failed on the first attempted backup of a  new Windows XP guest, called virtual-xp.  Virtual-xp appeared first in a  list of 11 VMs that were to be backed up.

       

      I'm running ESXi 4.1.0 build 348481.  My backup datastore is NFS, and the OS is Windows XP 64-bit. 

       

      ghettoVCBg2.log contains the following, which is not helpful, except that it shows that the job didn't complete:

       

      08-18-2011 00:05:02 --  info: ============================== ghettoVCBg2 LOG START ==============================
              08-18-2011 00:05:02 --  info: CONFIG - BACKUP_LOG_OUTPUT = /tmp/ghettoVCBg2.log
              08-18-2011 00:05:02 --  info: CONFIG - VM_BACKUP_DATASTORE = backups1
              08-18-2011 00:05:02 --  info: CONFIG - VM_BACKUP_DIRECTORY = backups
              08-18-2011 00:05:02 --  info: CONFIG - DISK_BACKUP_FORMAT = thin
              08-18-2011 00:05:02 --  info: CONFIG - ADAPTER_FORMAT = buslogic
              08-18-2011 00:05:02 --  info: CONFIG - POWER_VM_DOWN_BEFORE_BACKUP = NO
              08-18-2011 00:05:02 --  info: CONFIG - VM_SNAPSHOT_MEMORY = NO
              08-18-2011 00:05:02 --  info: CONFIG - VM_SNAPSHOT_QUIESCE = NO
              08-18-2011 00:05:02 --  info: CONFIG - VM_BACKUP_DIR_NAMING_CONVENTION = 2011-08-18
              08-18-2011 00:05:02 --  info: CONFIG - VM_VMDK_FILES = all

       

              08-18-2011 00:05:02 --  debug: Main: Login by vi-fastpass to: 192.168.60.2
              08-18-2011 00:05:02 --  debug: copyTask: Task START
              08-18-2011 00:05:02 --  debug: copyTask: waiting for next job and sleep ...
              08-18-2011 00:05:05 --  info: Initiate backup for virtual-xp found on 192.168.60.2
              08-18-2011 00:05:08 --  debug: virtual-xp original powerState: poweredOn
              08-18-2011 00:05:08 --  debug: Creating Snapshot "ghettoVCBg2-snapshot-2011-08-18" for virtual-xp
              08-18-2011 00:05:18 --  info: virtual-xp has 2 VMDK(s)
               08-18-2011 00:05:18 --  debug: backupVMDK: Backing up "[datastore1]  virtual-xp/virtual-xp.vmdk" to "[backups1]  backups/virtual-xp/virtual-xp-2011-08-18/virtual-xp.vmdk"
              08-18-2011 00:05:18 --  debug: backupVMDK: Signal copyThread to start
              08-18-2011 00:05:18 --  debug: backupVMDK: Backup progress: Elapsed time 0 min
              08-18-2011 00:05:18 --  debug: copyTask: Wake up and follow the white rabbit, with status: doCopy
              08-18-2011 00:05:18 --  debug: CopyThread: Start backing up VMDK(s) ...
              08-18-2011 00:20:20 --  debug: backupVMDK: Backup progress: Elapsed time 15 min
              08-18-2011 00:35:21 --  debug: backupVMDK: Backup progress: Elapsed time 30 min
              08-18-2011 00:50:22 --  debug: backupVMDK: Backup progress: Elapsed time 45 min
              08-18-2011 01:05:23 --  debug: backupVMDK: Backup progress: Elapsed time 60 min
              08-18-2011 01:08:14 --  debug: copyTask: send copySuccess message ...
              08-18-2011 01:08:14 --  debug: copyTask: waiting for next job and sleep ...
               08-18-2011 01:08:24 --  debug: backupVMDK: Successfully completed  backup for [datastore1] virtual-xp/virtual-xp.vmdk Elapsed time: 63 min
               08-18-2011 01:08:24 --  debug: backupVMDK: Backing up "[datastore1]  virtual-xp/virtual-xp_1.vmdk" to "[backups1]  backups/virtual-xp/virtual-xp-2011-08-18/virtual-xp_1.vmdk"
              08-18-2011 01:08:24 --  debug: backupVMDK: Signal copyThread to start
              08-18-2011 01:08:24 --  debug: backupVMDK: Backup progress: Elapsed time 0 min
              08-18-2011 01:08:24 --  debug: copyTask: Wake up and follow the white rabbit, with status: doCopy
              08-18-2011 01:08:24 --  debug: CopyThread: Start backing up VMDK(s) ...
              08-18-2011 01:23:25 --  debug: backupVMDK: Backup progress: Elapsed time 15 min
              08-18-2011 01:39:30 --  debug: copyTask: send copySuccess message ...
              08-18-2011 01:39:30 --  debug: copyTask: waiting for next job and sleep ...

       

       

       

      The first sign of trouble I see in the system logs, based on my limited understanding, is this excerpt from the hostd.log:

       

      [2011-08-18  01:50:10.535 3E344B90 info 'DiskLib'] DISKLIB-VMFS :VmfsMoveDataInt :  failed to move data (Input/output error:0x50009).
      [2011-08-18  01:50:10.535 3E344B90 info 'DiskLib'] DISKLIB-LIB   :  DiskLibCloneChunks: Could not perform clone using vmkernel data mover.  Falling back to non-accelerated clone.

       

       

      I see the following recorded in hostd.log about a half-hour later:

       

      [2011-08-18  02:24:40.547 3E344B90 info 'DiskLib'] DISKLIB-VMFS   :VmfsExtentCommonClose: Failed to close aiohandle 3E4D97B0: Input/output  error (1282)
      [2011-08-18 02:24:40.547 3E344B90 info 'DiskLib']  DISKLIB-VMFS  :  "/vmfs/volumes/4ff2ffda-827547f7/backups/virtual-xp/virtual-xp-2011-08-18/virtual-xp_1-flat.vmdk"  : closed.
      [2011-08-18 02:24:40.547 3E344B90 info 'DiskLib'] DISKLIB-LIB   : Failed to close handle "3DF38C24".
      [2011-08-18  02:24:40.816 3E344B90 info 'DiskLib'] DISKLIB-VMFS  :  "/vmfs/volumes/4ff2ffda-827547f7/backups/virtual-xp/virtual-xp-2011-08-18/virtual-xp_1-flat.vmdk"  : open successful (1041) size = 53711078400, hd = 0. Type 3
      [2011-08-18  02:24:40.969 3E344B90 info 'DiskLib'] DISKLIB-VMFS  :  "/vmfs/volumes/4ff2ffda-827547f7/backups/virtual-xp/virtual-xp-2011-08-18/virtual-xp_1-flat.vmdk"  : closed.
      [2011-08-18 02:24:41.158 3E344B90 info 'DiskLib']  DISKLIB-LIB   : Cannot remove extent  `/vmfs/volumes/4ff2ffda-827547f7/backups/virtual-xp/virtual-xp-2011-08-18/virtual-xp_1-flat.vmdk':  Insufficient permission to access file
      [2011-08-18 02:24:41.158  3E344B90 info 'DiskLib'] DISKLIB-LIB   : Failed to delete disk  '/vmfs/volumes/4ff2ffda-827547f7/backups/virtual-xp/virtual-xp-2011-08-18/virtual-xp_1.vmdk'  or one of its components: Insufficient permission to access file
      [2011-08-18 02:24:41.158 3E344B90 info 'DiskLib'] DISKLIB-LIB   : Failed to clone : Input/output error (327689).
      [2011-08-18 02:24:41.158 3E344B90 warning 'VdisksvcPlugin'] Failed to clone disk : Input/output error (327689).
      [2011-08-18  02:24:41.158 3E344B90 info 'DiskLib'] DISKLIB-VMFS  :  "/vmfs/volumes/4d7e80b5-6fe224bc-d4bc-bcaec53d2323/virtual-xp/virtual-xp_1-flat.vmdk"  : closed.
      [2011-08-18 02:24:41.158 3E344B90 warning 'VdisksvcPlugin'] Got Methodfault exception: vmodl.fault.SystemError
      [2011-08-18  02:24:41.158 3E344B90 info 'TaskManager'] Task Completed :  haTask--vim.VirtualDiskManager.copyVirtualDisk-68879 Status  error       

       

      My understanding of ESXi is limited.  Can someone shed some light?

       

      Thanks.

        • 1. Re: Backup fails after adding a new Win XP guest
          lamw Guru
          Community WarriorsVMware Employees

          The issue looks like it's storage related to your NFS volume, as you can see there are several error messages including some input/output errors. I would suggest investigating your storage and make sure nothing was modified recently. The permission error is also something you'll want to look into, as part of the rotation of the script, it'll remove past archives. Is this the only VM that's having issues? If so, an easy test is to back it up against local storage and see if you run into same issue, but this looks like an issue with your storage sub-system or possibly performance on the host.

          • 2. Re: Backup fails after adding a new Win XP guest
            dcorsello Novice

            Hi lamw,

             

            Thanks for your reply, and many thanks for your script and your blog.

             

            Only one VM is having issues.  It differs from all other VMs in that it's running Windows XP Pro, whereas all others are running Linux or BSD, and it has two virtual disks, instead of one like all the others.  Formerly, another difference was that it was converted from a Workstation VM, but I rebuilt it from scratch within ESXi yesterday.  The rebuild didn't improve things.

             

            These are the symptoms:

             

            -  The copy of the Windows XP  VM to the backup volume goes extremely slowly.

            -  Sometimes, the .vmdk files are copied, but not the .vmx file.

            -  The move of the files between folders on the backup volume takes several hours.

            -  After the files are moved, no cleanup is done to roll off the oldest backup.

             

            No similar problems are encountered while backing up the other guests during the same run.

             

            On my ESXi host, I'm using 4% CPU, 69% memory and 86% storage. There are a total of 12 guests running on a single CPU and 12GB RAM.  There are plenty of resources to spare within each guest, and I see no guest performance issues.

             

            The backup volume is an NFS volume on a machine running Windows XP Pro 64-bit with 8GB RAM and a Highpoint RocketRAID 1720 card with two 2TB drives in RAID 1.  I'm using hanewin nfsd.  Both the OS and the NFS volume are on the same array.  Free space on the NFS volume is 313 GB; total space is 1.81 TB.

             

            A few days ago, I cleaned up inventory in ESXi by deleting from disk a couple of VMs that were not in use, and I rebooted the host.

             

            After this, the backup of the problematic VM actually ran to completion a couple of times, and I thought the problem was solved.

             

            But then I successfully installed the LSI MegaRAID SAS driver version 5.29 on my host, and the backup problems returned.  I don't see how this is related, though, because all guests but one continue to backup with no problem.

             

            I don't think free space on the NFS volume is the issue, because the problematic VM is not the largest VM.  The largest disk that's backed up is 100GB, and it belongs to a FreeBSD VM; the disks in the problematic VM are 20GB and 50GB.

             

            As I mentioned, I rebuilt the Windows VM from scratch in ESXi yesterday.  I also cleaned up the NFS machine by deleting temp files, correcting an insufficient virtual memory setting and upgrading hanewin to the latest version.  I also reset the maximum NFS transfer size in hanewin to the default value of 8192; it had been much higher.

             

            After making these changes, it seems like the files were copied from the host to the backup volume in a more  reasonable amount of time.  However, the moving of files from one folder to another on the backup volume continues to present a problem--it's still running after several hours.

             

            I'm not knowledgeable enough to decipher the hostd log.  I see a number of entries that look like errors, but I have no idea if they're related to my problem, or if they're serious.

             

            Sorry for the long post.  I hope this information gives you a better picture of my setup and my problems.

             

            Thanks,

            Dave

            • 3. Re: Backup fails after adding a new Win XP guest
              Dagnabbit Novice

              I am out of the office Friday, August 26th, and will return Monday August 29th.

               

              Thank you,

               

              Ben Gray

              • 4. Re: Backup fails after adding a new Win XP guest
                dcorsello Novice

                I just watched the backup of a FreeBSD VM with a single, 100GB disk.  The copy from the local datastore to the NFS backup volume took 1.5 hours, and the move from one folder to another on the backup volume happened instantaneously.

                • 5. Re: Backup fails after adding a new Win XP guest
                  chrwei Enthusiast

                  The errors in the hostd log are not detailed, but do relate to storage issues.  that's really all that can be gleaned from them.

                   

                  I have seen similar issues before, but not in quite the same situation.  I had a linux nfs export that was a lessfs fuse based mount.  the lessfs developer knows about issues when using nfs with it, and the issues cause large pauses in IO, which nfs clients think means the host crashed.

                   

                  Your nfsd host is running on gigabit right?  Does does have a server class NIC or desktop class on-board one like a realtek or similar?  I have found that the realtek on board nics tend to get slow at about half a gigabit.

                   

                  from an ESXi perspective, 2 sata disks in raid1 on a highpoint (which are renowned to have issues under high load) is going to eventually trigger the errors in hostd you are seeing.  Due to the way the highpoint firmware works, you will almost never see write performance better than a single sata disk, outside of sequential benchmarks.

                   

                  I'm not familiar with the nfsd you are using, but using llinux's nfsd you have to set the export to use async in order to get any sort of acceptable performance.  This may be part of the issue as well.

                   

                  As for guest OS's and drivers, it doesn't matter.  Once the snapshot is taken, it's just esxi's hostd copying a vmdk file, hostd doens't care what OS it, and the OS doesn't even know it's happening.  However, if the guest has a lot of active reads, it will be accessing the vmdk too, and this simultaneous vmdk file access could slow things down.  It should not causes errors like you are seeing though.

                   

                  In short: a realtek nic, combined with the highpoint and an effective single sata disk, combined with a nfsd that may be in synchronous write mode, combined with ntfs tuned for desktop usage and not server could all add up to just enough IO latency to make ESXi think the server has crashed.  Your lowering of nfs transfer sizes supports this.

                   

                  without replacing hardware, verify async on the nfsd export.  Maybe also lower the transfer size some more, which will make it go slower, but might be more relliable.

                  • 6. Re: Backup fails after adding a new Win XP guest
                    dcorsello Novice

                    chrwei,

                     

                    Thanks for your reply.

                     

                    The nfs host has a desktop-grade gigabit adapter, and the Highpoint RAID card was chosen because it fit my budget.

                     

                    The nfsd has an option for async when using nfs 2, but no such option for nfs 3.  I think that's because nfs 3 supports async by default, right?  And since ESXi requires nfs3, I would assume that the IO is being done asynchronously.

                     

                    I don't think performance issues explain what's happening, though, because every file operation would be affected equally by poor hardware and software performance.  What's happening is that one vmdk, whose provisioned space is twice that of another, is being copied to and moved between folders on the nfs host much faster than the smaller vmdk.

                     

                    Could it be a matter of differences in the amount of used space the disks contain?  From an OS perspective, the number of bytes of used space on each of the larger-provisioned disks is much less than the number of used bytes on the problematic disk.  Does hostd care at all about what the vmdk's contain?  Maybe lots of similar data, like binary zeros, or whatever a virtual disk's free space contains, is moved much faster by hostd?  I think this is a stretch, but I'm at a loss.

                     

                    --Dave

                    • 7. Re: Backup fails after adding a new Win XP guest
                      chrwei Enthusiast

                      if the destination is a thin provisioned vmdk, then yes.  The nfs store will create it as a sparse file and then copies will take less time.  And actually I think esx always makes vmdk's on nfs thin disks, so long as the nfs server says it supports sparse files.

                       

                      From http://www.faqs.org/rfcs/rfc1813.html async was introduced as an official stable feature in nfs3.  I did not read the whole document, but what I did read it is implied that it is optional (usage of "if" around the feature and commands).  In the linux kernel server nfs implementation, async is not enabled by default.

                       

                      Have you monitored the xp host using perfmon.msc durring a failure?  disk queue length would be a good indicator of a storage related issue.

                      1 person found this helpful
                      • 8. Re: Backup fails after adding a new Win XP guest
                        dcorsello Novice

                        ghettoVCBg2 is configured with 'my $DISK_BACKUP_FORMAT = "thin";'.

                         

                        My understanding is that with this configuration, if a source vmdk is thick, then the resulting backup vmdk should be equal in size to the source's provisioned size, and if a source vmdk is thin, then the resulting backup vmdk should be equal in size to the source's used size.  Is this correct?

                         

                        If so, then my results are strange in this regard:  for all thin source vmdk's, the backup vmdk is equal in size to the source's provisioned size, not its used size.  I wonder if this is more evidence of strangeness in the interraction between ESXi and my nfsd.

                         

                        I wasn't aware of perfmon in XP.  Thanks.

                        • 9. Re: Backup fails after adding a new Win XP guest
                          chrwei Enthusiast

                          ESX's "thin" is different that Workstation/player/server's "allocate on demand".  ESX makes a sparse file, which will appear to be the full provisioned size, but on a filesystem that supports sparse files it will only consume the amount of space that it needs.  If this file is then copied, the tool making the copy will also need to support sparse files or it will again be fully allocated. 

                           

                          Also worth noting: on an older (time since creation, not version) vmdk or one that sees a lot of files changing or add/removes, the guests filesystem MAY decide to make new data in previously unsused space so as to preserve a possible undelete.  This will eventually allocate the entire vmdk at a block level.  VMWare products don't communicate directly with the guest OS as to what blocks are currently allocated in the filesystem, they only know that "this block is non-zero and so is used" and so the block is preserved, even on conversion to thin.  You'd need to write zero's to the free space in the guest (dd on *nix, sdelete.exe on windows) in order for the conversion to thin to recognize the free space as actually unused.

                          1 person found this helpful
                          • 10. Re: Backup fails after adding a new Win XP guest
                            vmbru Enthusiast

                            I am currently out of the office until September 26th.  You can reach another

                            associate of the Information Technology Department at 513-763-1822. 

                            Thank you.

                             

                             

                             

                            NOTICE: The information contained in this electronic mail transmission is intended by Kemba Credit Union, Inc. for the use of the named individual or entity to which it is directed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this email is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this electronic mail transmission in error, please delete it from your system without copying or forwarding it, and notify the sender of the error by reply email, so that the sender's address records can be corrected.

                            • 11. Re: Backup fails after adding a new Win XP guest
                              zzelazek Lurker

                              Uprzejmie informuje ze w dniach 19.09. - 23.09  jestem niedostępny

                              W sprawach pilnych proszę o kontakt  ze Zbigniewem Kamińskim 609 788 035.