VMware Cloud Community
Nour
Contributor
Contributor

ESX 3.5 Unable to read *-flat.vmdk on RAID 5 after raid rebuilt

Last night one of the disks in out raid 5 array went down and after replacement and rebuiding of the array ESX cound't start guest systems with 'Cannon open the disk # or one of the snapshot disks it depends on. Reason: Input/output error.' The array iteself is mounted and its content is visible in both shell and VMIClient. I also checked content of .vmdk file and it looks in order. The file I can't open is *-flat.vmdk.

Here is what I get in vmkernel and vmkwarning logs if I try to Download vmdk from Datastore in VMIClient or simply 'tail' *-flat.vmdk.

vmkernel >

vmkernel: 0:00:12:41.234 cpu1:1075)FS3: 1014: Error reading HB addr 314c00: I/O error

vmkwarning >

vmkernel: 0:00:08:47.844 cpu3:1041)WARNING: FS3: 3368: Failed with bad000a

vmkernel: 0:00:08:47.844 cpu3:1041)WARNING: Fil3: 1791: Failed to reserve volume f530 28 1 496b6034 48dbfc3 3000ecd4 d731c648 0 0 0 0 0 0 0

What could go wrong? Any ideas on how to solve it will be appreciated.

Reply
0 Kudos
15 Replies
Lightbulb
Virtuoso
Virtuoso

Your inablity to access the file using tail does not bode well for you chances.

I would recommend moving everything that can be moved off this volume as its funtionality is suspect.

Can you verify the RAID rebuild completed successfully. Is there a utility like HP ADU that you could use to detect issues on the Array? You might also want to engage vendors (vmware and hardware) before proceeding with troubleshooting actions.

Reply
0 Kudos
Nour
Contributor
Contributor

There is just one folder and 2 files in it on the array disk, -flat.vmdk and corresponding .vmdk.

The raid controller is Adaptec 5805, I'll install Storage Manager to see if anything went wrong with the rebuild.

The worst thing is that disk died during backup procedure, so there is no way to recover the data. The end of the world as we know it.

Reply
0 Kudos
Nour
Contributor
Contributor

In the worst case scenario, is there a way to salvage at least some data out of .vmdk?

I was hoping, since there is just one NTFS partition on this independent permanent datastorage, there should be some way of recovering partial data.

Reply
0 Kudos
Nour
Contributor
Contributor

After Adaptec Storage Manager was installed it reported 'bad stripes' on the array.

I guess ESX has nothing to do with it and at this point it's offtopic here, but is there any way to recover good data on the array?

Is it possible to get everything outside 'bad stripes'? Is there a way to clear 'bad stripes' table with partial data loss?

Any idea will be appreciated.

Reply
0 Kudos
Lightbulb
Virtuoso
Virtuoso

Sorry to say but you may be out of luck. If the file system cannot be read it is hard to recover.

You could try some recovery utils like UBCD4win (http://ubcd4win.com/) or maybe gpated (http://gparted.sourceforge.net/) and try to boot the VM to the ISOs and see if they can read the file system. if not i do not know a way to get the data off. that does not meant that their might not be a way, but I am not hopeful

Reply
0 Kudos
kjb007
Immortal
Immortal

Lightbulb is right. If you can't add the disk to the vm, then you can't get at the data to attempt any type of recovery. Is there a way to fix the bad stripes?

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
Nour
Contributor
Contributor

Thank you guys for all the answers so far.

The thing is, since I can read vmfs partition and there are only two files on it, .vmdk and -flat.vmdk, I was thinking to copy everything I can out of -flat.vmdk on another disk and after that try to mount it and recover. There is just one ntfs partition inside that vmdk image and even if some data is corrupted or zeroed, the rest of it can be recoverd.

The problem right now is that esx (or adaptec driver) does not allow me to skip bad data and copy the rest.

I was hoping that somebody knows some tool to copy the data while ignoring bad peices.

Reply
0 Kudos
kjb007
Immortal
Immortal

Have you tried to see if you can clone the file using vmkfstools?

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
kjb007
Immortal
Immortal

Also, have you tried to create a new file in that vmfs location to make sure you can still write to it? A simple copy or touch would work as well to verify write access.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
Lightbulb
Virtuoso
Virtuoso

Very good idea

Reply
0 Kudos
Svante
Enthusiast
Enthusiast

Nour, it's tempting to ask "How can there be no other backups?" but that is not helpful at this stage so..

The problem is obviously that there are I/O errors from the disk system when trying to read some, or all, blocks that

the .vmdk-flat disk file(s) consists of. Your only hope is that those unreadable blocks are very few, and not that important. If the

raid rebuild went sour I don't even dare to guess how bad the damage is though..

I would try to copy the .vmdk-flat file(s) to another file system using some tool that ignores bad blocks and just zeroes them out

in the copy to keep the size and internal structure as close as possible to the original. Then try to add this disk to another

VM (not as system disk though). If you are very lucky it might be possible to rescue some critical files this way. One could

also try some FS repair within the guest OS in the new VM (backup the copied disk file first!!) and see if more files can be

recovered after that.

I would start looking at dd_descue for instance for making the initial copy of the damaged .vmdk-flat file(s):

http://www.garloff.de/kurt/linux/ddrescue/

I have successfully rescued data from disks with hard errors this way. It all depends on how much is still there however...

I wish you the best of luck, I feel your pain Smiley Sad

EDIT: You should also NOT write anything to that VMFS volume until you are sure nothing else can be recovered

from it. You have no idea how bad the allocation has gotten internally and chances are you are just making things a lot

worse. Write protect it if possible.

Reply
0 Kudos
Nour
Contributor
Contributor

Thanks a lot, guys, for looking into it.

Since I have too little experience with ESX and linux, I desided to hand the disks over to a company specialized on data recovery. Hopefully they'll get some data out of it.

The case is closed I guess.

There is just one question remaining and I don't think it's necessary to start new topic.

Is there some tool for linux to performe raid scan (consistency, surface scan, something alike) periodically and send email if anything's wrong?

I guess in our case, where there is no full time sysadmin and the servers are on remote location shared among several companies, the failure went on unnoticed for too long.

Thanks again.

Reply
0 Kudos
kjb007
Immortal
Immortal

There are some command-line utilities, but it depends on your RAID storage vendor. HP will have some, dell will have some, etc. You'll have to find out from your storage vendor what kind of management utilities they offer for Linux.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
JohnADCO
Expert
Expert

Had this raid been in service for a long time? Like years to several years?

I only ask, because I swear adaptec raids whack after several years of service. Not sure why, but I gave up on them entirely last year for good. Over the past decade every serious unrecoverable raid issue I have had has been with adaptec raids that are older than 4 or 5 years.

I guess I am favoring the PERCs these days. We have only had simulated drive failures on our MD3000i sans, so I guess it remains to be seen still with those controllers.

Reply
0 Kudos
riturajmnnit
Contributor
Contributor

Hi Nour,

I am also facing similar issue, where the ESX crashed due to HDisk issues and now I'm unable to move flat-vmdk file to recreate the VM due to I/O error.

Were you able to solve this.

Any input will help me a lot.

Regards,

Reply
0 Kudos