Re: Corrupted redolog again

GlenB · ‎07-01-2010

It happened again. Another domestic power interruption and another corrupt redolog. It seems like the VMware code is badly enough designed that it leaves open such windows. It should come with a product warning that says "do not use unless you have 100% backup power". Every couple of months I am wasting hours or days rebuilding damaged machines.

So, to start with the simple stuff, I tried to create a snapshot and then delete all snapshots. Creating was no problem. Deleting all was a problem. It got to 95% and timed out in 15 minutes. I tried twice. The snapshot manager first thought the snapshot was still there, but after the second attempt it appears to be gone. Restarting the guest gets started OK then issues the corrupt redolog message again and cancels.

So what does the vmfs think is there?

/vmfs/volumes/4ab6c4ff-de1e6ea7-d316-0024e8734364/User # ls -al drwxr-xr-x 1 root root 3220 Jul 1 16:20 . drwxr-xr-t 1 root root 3500 May 7 22:27 .. -rw------- 1 root root 107389255680 Jul 1 16:17 User-flat.vmdk -rw------- 1 root root 8684 Jul 1 16:11 User.nvram -rw------- 1 root root 399 Jul 1 16:12 User.vmdk -rw------- 1 root root 482 Jul 1 15:25 User.vmsd -rwxr-xr-x 1 root root 2416 Jul 1 16:11 User.vmx -rw------- 1 root root 259 Apr 5 05:06 User.vmxf -rw------- 1 root root 20451952640 Jul 1 16:05 User_1-000001-delta.vmdk -rw------- 1 root root 248 Jul 1 15:41 User_1-000001.vmdk -rw------- 1 root root 216711825408 Jul 1 15:41 User_1-000002-delta.vmdk -rw------- 1 root root 255 Jul 1 15:04 User_1-000002.vmdk -rw------- 1 root root 17303552 Jul 1 16:16 User_1-000003-delta.vmdk -rw------- 1 root root 255 Jul 1 16:12 User_1-000003.vmdk -rw------- 1 root root 274877906944 Apr 3 14:44 User_1-flat.vmdk -rw------- 1 root root 401 Jul 1 15:41 User_1.vmdk -rw-rr 1 root root 33607 Jul 1 14:29 vmware-10.log -rw-rr 1 root root 33109 Jul 1 15:22 vmware-11.log -rw-rr 1 root root 32550 Jun 5 03:05 vmware-6.log -rw-rr 1 root root 40331 Jun 10 04:58 vmware-7.log -rw-rr 1 root root 360448 Jul 1 03:38 vmware-8.log -rw-rr 1 root root 31845 Jul 1 11:38 vmware-9.log -rw-rr 1 root root 33914 Jul 1 16:20 vmware.log

So that indicates a disk structure that looks like this:

User.vmx C: User.vmdk + User-flat.vmdk (100 Gb) D: User_1.vmdk + User_1-flat.vmdh (256 Gb) + -000001.vmdk and -delta ( 20 Gb) + -000002.vmdk and -delta (210 Gb) + -000003.vmdk and -delta ( 17 Mb)

One of the deltas appears to be corrupted - how do I know which one?

I can edit the vmdk files to relink around the damaged one, but I lose a lot of edits in the process, don't I? Some of those I can probably recover from backups, but I'll never know for sure if I've lost anything. The bad VMware design is becoming very annoying!

Regards - Glen

RvdNieuwendijk · ‎07-01-2010

Hi Glen,

If you delete a snapshot and get a timeout after 15 minutes, you get the timeout from the vCenter server but the snapshot deletion process still runs on the ESX server. So don't try a second time and just wait untill the snapshot is gone. This can take hours. For more information see: Large snapshot delete operations time out in VirtualCenter.

To repair the corrupted VM your best bet is probably to create a new VM and restore a backup. Allthough you might lose information created after the backup, you are sure that the VM is not corrupted anymore.

If you have a lot of power interruptions I would think about an uninterruptible power supply and an emergency power system

Regards, Robert

Blog: https://rvdnieuwendijk.com/ | Twitter: @rvdnieuwendijk | Author of: https://www.packtpub.com/virtualization-and-cloud/learning-powercli-second-edition

GlenB · ‎07-01-2010

Thanks. The snapshot was in existence for only minutes. Its deletion did, as you expected, run to completion - finally. But that did NOT do anything about the corrupted redolog. The message from VM says "if the problem still exists, you need to discard the redolog". But I have yet to find anyone who can tell me HOW to do that. It was suggested that adding a snapshot then deleting all snapshots would do it, but apparently not. Do you have any ideas?

As for the UPS ... it's a great idea. But I need one that can keep a 500W power supply running for a couple of hours in order to prevent the problem. That's a LOT of batteries! Once upon a time I used to have a UPS that had a serial cable that plugged into my Windows server. The UPS told the server to shut itself down when the remaining power in the UPS reached a configured level. That was very useful. Do you know of any product that does that to talk to a VMware host machine running ESX 3.5i?

Regards - Glen

Jackobli · ‎07-04-2010

As for the UPS ... it's a great idea. But I need one that can keep a 500W power supply running for a couple of hours in order to prevent the problem.

There are calculators for most UPS vendors like this one. For a couple of hours you would probably have to look for a generator based system, sounds kind of pricy. I would shutdown earlier.

>Once upon a time I used to have a UPS that had a serial cable that plugged into my Windows server. The UPS told the server to shut itself down when the remaining power in the UPS reached a configured level. That was very useful. Do you know of any product that does that to talk to a VMware host machine running ESX 3.5i?

Have a look into this thread or have a search for UPS and ESXi. In short, best is to use a UPS with network (IP) connection.

DSTAVERT · ‎07-04-2010

Have a look at this script for VM shutdown http://communities.vmware.com/docs/DOC-11902. If memory serves it is for APC but could be crafted to work with others.

As for your problem. A suggestion other than to get your VMs to shutdown before the UPS dies. If your disk controller supports a battery backed cache module add it. If you don't have a controller that supports a battery module consider getting one. Any crash where data isn't properly written to disk can cause corruption. The battery module preserves uncommitted data until power is restored.

-- David -- VMware Communities Moderator