Hey Everybody, this is a last ditch effort before I am forced to start over from backups -- is there any way to recover a corrupted vmfs filesystem? Long story short, one of my LUNs on my SAN went bad due to a controller glitch and now I am unable to get any of my ESX hosts to recognize it. They can all see the LUN, but they don't see that there is a vmfs filesystem located there. Does anyone know of a way to get ESX to recover the VMFS filesystem located on the LUN?
Thanks for the help!
Just a suggestion, not had any experience of this myself; Look at the tools e2fsck or tune2fs, or maybe vmkfstools. I would guess that the controller error means the whole disk data is bad and might not be recoverable by rolling transactions back.
Yeah, I can't see the vmfs filesystem at all -- I can see the LUN, but that's it.
Okay try the following to label the LUN as VMFS again, but first check if it even is VMFS at the moment with "fdisk -lu", if it is than I can't help if it's not do the following:
Start "fdisk /dev/sdX" where X is the letter for the crashed lun, than the next commands:
p
d
n
p
1
default
t
fb
X
b
1
128
W
than rescan your hba cards
Duncan
My virtualisation blog:
Looks like it does show up as fb:
Disk /dev/sda: 1027.7 GB, 1027705528320 bytes
255 heads, 63 sectors/track, 124944 cylinders, total 2007237360 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/sda1 128 2007225359 1003612616 fb Unknown
Data corruption is an extremely rare event. No SAN should corrupt data just because of a "controller glitch" - efficient designs generally safeguards against this.
I'd like to recommend that you check /var/log/vmkwarning for any error messages that might give you additional clues. On the odd chance that the LUN is being detected as a snapshot, that's relatively easy to address (resignaturing or simply correcting the LUN presentation from the controller side.)
If you truly believe that the VMFS file system has been damaged, your best recourse is to log a support call with VMware. There are no public tools to repair VMFS, but PSS have some tools and methods available to them that - I have heard - can work miracles.
Also, as with corruption on any platform, proper diagnosis before action is definitely recommended. Until you know exactly what happened, avoid trying anything that will cause changes being writted to the LUN. Every time you write anything, chances of recovery diminish. If you do do something, document what you have done exactly so that you can pass that information on to PSS.
The LUN ID's are all correct. how do I tell if it's being seen as a snapshot?
If you examine /var/log/vmkwarning and see a message similar to the following, the LUN may have been detected as a snapshot:
cpu2:1034)LVM: ProbeDeviceInt:4903: vmhabX:Y:Z:1 may be snapshot: disabling access. See resignaturing section in SAN config guide
The snapshot detection mechanism relies on comparing Target:LUN ID information recorded in the datastore's structures to current Target:LUN ID and seeing if they match. If there's a mismatch, ESX makes the safe assumption that this LUN may be a snapshot of an existing LUN already presented to the server and prevents you from accessing it. This is done to prevent writes to the same LUN without appropriate locking mechanisms.
Also in VirtualCenter - under configuration, advanced settings, (i believe disk) is the option to turn off the feature that prevents VC from displaying the LUN if it is detected as a snapshot - change the 1 to 0 and rescan. If you then see the LUN, then it is being detected as a snapshot.
If you disconnect all other connected ESX servers to the SAN LUN and rescan
one host and/or reboot the one host does it detect it?