Re: Recover corrupted VMFS filesystem -- URGENT!

branfarm1 · ‎02-09-2008

Hey Everybody, this is a last ditch effort before I am forced to start over from backups -- is there any way to recover a corrupted vmfs filesystem? Long story short, one of my LUNs on my SAN went bad due to a controller glitch and now I am unable to get any of my ESX hosts to recognize it. They can all see the LUN, but they don't see that there is a vmfs filesystem located there. Does anyone know of a way to get ESX to recover the VMFS filesystem located on the LUN?

Thanks for the help!

contra422 · ‎02-09-2008

Just a suggestion, not had any experience of this myself; Look at the tools e2fsck or tune2fs, or maybe vmkfstools. I would guess that the controller error means the whole disk data is bad and might not be recoverable by rolling transactions back.

depping · ‎02-09-2008

you can try vmkfstools -v for a consistency check, but you probably don't see the vmfs file system at all? hold on, will have to test something in my lab...

Duncan

My virtualisation blog:

branfarm1 · ‎02-09-2008

Yeah, I can't see the vmfs filesystem at all -- I can see the LUN, but that's it.

depping · ‎02-09-2008

oops, see below

Duncan

My virtualisation blog:

depping · ‎02-09-2008

Okay try the following to label the LUN as VMFS again, but first check if it even is VMFS at the moment with "fdisk -lu", if it is than I can't help if it's not do the following:

Start "fdisk /dev/sdX" where X is the letter for the crashed lun, than the next commands:

p

d

n

p

1

default

t

fb

X

b

1

128

W

than rescan your hba cards

Duncan

My virtualisation blog:

branfarm1 · ‎02-09-2008

Looks like it does show up as fb:

Disk /dev/sda: 1027.7 GB, 1027705528320 bytes

255 heads, 63 sectors/track, 124944 cylinders, total 2007237360 sectors

Units = sectors of 1 * 512 = 512 bytes

Device Boot Start End Blocks Id System

/dev/sda1 128 2007225359 1003612616 fb Unknown

jhanekom · ‎02-09-2008

Data corruption is an extremely rare event. No SAN should corrupt data just because of a "controller glitch" - efficient designs generally safeguards against this.

I'd like to recommend that you check /var/log/vmkwarning for any error messages that might give you additional clues. On the odd chance that the LUN is being detected as a snapshot, that's relatively easy to address (resignaturing or simply correcting the LUN presentation from the controller side.)

If you truly believe that the VMFS file system has been damaged, your best recourse is to log a support call with VMware. There are no public tools to repair VMFS, but PSS have some tools and methods available to them that - I have heard - can work miracles.

Also, as with corruption on any platform, proper diagnosis before action is definitely recommended. Until you know exactly what happened, avoid trying anything that will cause changes being writted to the LUN. Every time you write anything, chances of recovery diminish. If you do do something, document what you have done exactly so that you can pass that information on to PSS.

depping · ‎02-09-2008

If it's FB than check lun id's on all hosts and indeed check if it's seen as a snapshot.

Duncan

My virtualisation blog:

branfarm1 · ‎02-09-2008

The LUN ID's are all correct. how do I tell if it's being seen as a snapshot?

jhanekom · ‎02-09-2008

If you examine /var/log/vmkwarning and see a message similar to the following, the LUN may have been detected as a snapshot:

cpu2:1034)LVM: ProbeDeviceInt:4903: vmhabX:Y:Z:1 may be snapshot: disabling access. See resignaturing section in SAN config guide

The snapshot detection mechanism relies on comparing Target:LUN ID information recorded in the datastore's structures to current Target:LUN ID and seeing if they match. If there's a mismatch, ESX makes the safe assumption that this LUN may be a snapshot of an existing LUN already presented to the server and prevents you from accessing it. This is done to prevent writes to the same LUN without appropriate locking mechanisms.

mikepodoherty · ‎02-09-2008

Also in VirtualCenter - under configuration, advanced settings, (i believe disk) is the option to turn off the feature that prevents VC from displaying the LUN if it is detected as a snapshot - change the 1 to 0 and rescan. If you then see the LUN, then it is being detected as a snapshot.

contra422 · ‎02-09-2008

If you disconnect all other connected ESX servers to the SAN LUN and rescan

one host and/or reboot the one host does it detect it?

All

Recover corrupted VMFS filesystem -- URGENT!