Continuous CorruptRedoLog errors with snapshot CID/parentCID issues involved

Here is the scenario...

3 hosts all using iSCSI (no HA, no DRS, no clustering etc...) all running ESX 3.5 build 153875

approx. 6 - 9 guests per host

at night backups run against all powered on VM's using Vizioncore's ESX vRanger product

vRanger sends an API call to VMWare to create snapshots of the VM (up to 67 of them - why I don't know, but they say that is normal....ok)

vRanger then proceeds to back the server up

Once backup completes then vRanger sends a second API call to VMWare to consolidate the snapshots.... Now the problems begin

at some point usually when it gets to the last (or the second to last) snapshot, something happens and the process gets corrupted...

VM then presents a question/error message:

msg.hbacommon.corruptredo: The RedoLog of SERVERNAME-000001.vmdk has been detected to be corrupt. The virtual machine needs to be powered down.
+ If the problem still persists, you need to discard the redolog+

We respond to this event by doing the following:

Finding the VMID with cat /proc/vmware/vm/*/names |grep VMNAME.VMX

Use the VMID to find the Master user world id with less /proc/vmware/vm/VMID#/cpu/status

killing the process that has the VM in stuck state /usr/lib/vmware/bin/vmkload_app -k 9 MASTER WORLD ID#

waiting a bit for the VM to refresh in vCenter

Then, check the snapshot manager to see if there are any snapshots. If there are, then attempt to DELETE ALL. If all works great. power on and move on.

If not, OR if there are no snapshots listed go to the console, CD to the VM's directory and do the following: (*note there are probably easier ways to do this, but this one is the one that is most comfortable to me)

ls -ltrh - to list the files in the directory

cat SERVERNAME-00000x.VMDK (where x is the number of the disk that hard disk one is reported to be running on in vCenter) - this is done to show the CID and parentCID numbers in the VMDK

perform the same step as above until you have listed each VMDK from current to the parent. Then examine the CID and parentCID numbers and adjust them so that they are all pointing to their parent.

This is an example of what we see when the problem occurs : (NOTICE THAT THE FIRST DISK IS POINTING TO ITSELF AS THE PARENT)

# cat ktdnwbdv01_1-000001.vmdk
# Disk DescriptorFile
* parentCID=9203d13e*

# Extent description
RW 85983232 VMFSSPARSE "ktdnwbdv01_1-000001-delta.vmdk"
# The Disk Data Base
# cat ktdnwbdv01_1.vmdk
# Disk DescriptorFile

# Extent description
RW 85983232 VMFS "ktdnwbdv01_1-flat.vmdk"
# The Disk Data Base
ddb.virtualHWVersion = "4"
ddb.uuid = "60 00 C2 95 47 d6 da 47-61 8c 1c 84 96 87 2b 7b"
ddb.geometry.cylinders = "5352"
ddb.geometry.heads = "255"
ddb.geometry.sectors = "63"
ddb.adapterType = "lsilogic"
ddb.toolsVersion = "7303"

The disk pointing to itself as the parent is the reason why the snapshot consolidation fails. So in order to fix this we do the following:

Using nano command, edit the first disk so that it has a unique CID. Next validate that the parentCID is the same as the CID of the parent disk file. Continue this for each disk file down or up to the parent file.

Once the CID/parentCID issue is taken care of, go back to vCenter and if there are snapshots in snapshot manager, then select Delete All. (95% of the time, this will succeed in removing ALL deltas and snaps), power up and move on. If there are no snapshots, create one, then attempt the Delete All (60 - 80% of the time this will fail because the snap that gets created will have CID and parentCID that points to itself. SOOOO you have to go and edit the snapshot file so that it is pointing correctly, THEN Delete All (this will most likely work) Done!

Now here is the real problem.... This will eventually occur again. and we don't know why.

Any ideas. What should I check, what should i be looking at. We previously had some iSCSI problems with cabling where connectors were bad or wire pairs were crossed and it was affecting speed and connectivity to one path. This was filling up the logs so we could never see anything but those errors, but that has long since been fixed (and it did have a positive affect on the number of times this occurs). Also, the problem will occur almost 100% of the time if the VMTools is out of date, but we have turned on the option to update the tools at poweron and made sure that they are all updated. It still occurs randomly and without any rhyme or reason. The latest occurence occured last night on a VM that is affected constantly. Here is the vmware.log around the time of the issue. The area in red is the only ERROR i can find.

Jul 09 21:02:56.822: vmx| DISKLIB-DSCPTR: Opened : "ktdnwbdv01-flat.vmdk" (0xe)
Jul 09 21:02:56.825: vmx| DISKLIB-LINK : Opened '/vmfs/volumes/47e1388d-5470c616-4e55-0019b9f59a10/ktdn
wbdv01_1/ktdnwbdv01.vmdk' (0xe): vmfs, 62914560 sectors / 30720 Mb.
Jul 09 21:02:56.898: vmx| DISKLIB-LIB : Opened "/vmfs/volumes/47e1388d-5470c616-4e55-0019b9f59a10/ktdn
wbdv01_1/ktdnwbdv01-000001.vmdk" (flags 0xa). 878D6DC
Jul 09 21:02:57.375: vmx| DISK: OPEN '/vmfs/volumes/47e1388d-5470c616-4e55-0019b9f59a10/ktdnwbdv01_1/ktd
nwbdv01-000001.vmdk' Geo (3916/255/63) BIOS Geo (0/0/0) freeSpace=261656Mb, lsilogic
Jul 09 21:02:57.377: vmx| Creating virtual dev for 0:0
Jul 09 21:02:57.377: vmx| DumpDiskInfo: scsi0:0 createType=11, capacity = 62914560, numLinks = 2, alloca
tionType = 0
Jul 09 21:02:57.377: vmx| SCSIDiskESXCreateVirtualDev: scsi0:0 passthruTransientErrors set to 1
Jul 09 21:02:57.378: vmx| SCSIDiskESXPopulateVDevDesc: Using FS backend
Jul 09 21:02:57.378: vmx| VMMon_SetStorageSchedPolicy: Not supported
Jul 09 21:02:57.378: vmx| Creating virtual dev for 0:1
Jul 09 21:02:57.378: vmx| DumpDiskInfo: scsi0:1 createType=11, capacity = 85983232, numLinks = 2, alloca
tionType = 0
Jul 09 21:02:57.378: vmx| SCSIDiskESXCreateVirtualDev: scsi0:1 passthruTransientErrors set to 1
Jul 09 21:02:57.378: vmx| SCSIDiskESXPopulateVDevDesc: Using FS backend
Jul 09 21:02:57.378: vmx| VMMon_SetStorageSchedPolicy: Not supported
Jul 09 21:02:57.381: vmx| SnapshotVMX done with snapshot '_EsxRanger_Backup': 329
Jul 09 21:02:57.704: vcpu-0| HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/47e1388d-5470c616
Jul 09 21:02:57.712: vcpu-0| DISKLIB-CHAIN : UpdateContentID: old = 0x635f2c6, new = 0xf48b2c87
Jul 09 22:00:09.765: vcpu-0| Msg_Question:
Jul 09 22:00:09.765: vcpu-0| http://msg.hbacommon.corruptredo The RedoLog of ktdnwbdv01-000001.vmdk has been
detected to be corrupt. The
Jul 09 22:00:09.765: vcpu-0| virtual machine needs to be powered down. If the problem
Jul 09 22:00:09.765: vcpu-0| still persists, you need to discard the redolog
Jul 09 22:00:09.765: vcpu-0| -

Anybody have any assistance they can add is appreciated. I have taken this to VMWare support numerous times with no real resolution.



0 Kudos
1 Reply

0 Kudos