Re: Have I lost my vmfs data?

pgregg · ‎11-24-2010

2 days ago, my ESX 3.5 server suffered a pretty major failure. I went into the datacenter to find several failed disks - lots of blinking orange lights. I was unable to repair it at the scene, so pulled the server.

Anyway, the hardware is a Dell PE2800 with:

3 x 15K 73GB disks as a RAID5 array

3 x 10K 300GB disks as a RAID5 array

2 x 10K 73GB disks as HotSpares.

The first array had ESX installed (/dev/sda) and the remaining space, approx 120GB as VMFS storage1 (/dev/sda3).

The second array was all as VMFS storage2 (/dev/sdb1).

When I powered up again, all the the disks appeared healthy, but the RAID controller did automatically go through a recheck/rebuild on both arrays - but when it came back it did not boot ESX, but hung at the GRUB _ (blinking underscore) and refused to move on. This was a good sign, I thought, because at least it knew about the partitions and got as far as looking in /boot (suggesting the data was intact).

After lots of searching here and elsewhere, I determined I would have to reinstall ESX - butensuring I retained all vmfs stores. I installed ESX 3.5U5 successfully.

However, only the second array - storage3 /dev/sdb1 - is recognised by ESX as a VMFS store. And I'm able to see the VM data in it ok.

storage1 does not appear.

# fdisk -lu /dev/sda

Disk /dev/sda: 146.5 GB, 146548981760 bytes

255 heads, 63 sectors/track, 17816 cylinders, total 286228480 sectors

Units = sectors of 1 * 512 = 512 bytes

Device Boot Start End Blocks Id System

/dev/sda1 * 63 417689 208813+ 83 Linux

/dev/sda2 417690 17189549 8385930 83 Linux

/dev/sda3 17189550 280703744 131757097+ fb Unknown

/dev/sda4 280703745 286214039 2755147+ f Win95 Ext'd (LBA)

/dev/sda5 280703808 284896709 2096451 83 Linux

/dev/sda6 284896773 286005194 554211 82 Linux swap

/dev/sda7 286005258 286214039 104391 fc Unknown

# esxcfg-vmhbadevs -a

vmhba0:0:0 /dev/sda

vmhba0:1:0 /dev/sdb

# esxcfg-mpath -l

Disk vmhba0:0:0 /dev/sda (139760MB) has 1 paths and policy of Fixed

Local 2:14.0 vmhba0:0:0 On active preferred

Disk vmhba0:1:0 /dev/sdb (572160MB) has 1 paths and policy of Fixed

Local 2:14.0 vmhba0:1:0 On active preferred

So far so good.

# esxcfg-vmhbadevs -m

vmhba0:1:0:1 /dev/sdb1 48642536-75d5e1f9-92d3-001143ec7a00

# ls -la /vmfs/devices/disks/vmhba0:0:0:3

-rw------- 1 root root 134919267840 Nov 24 11:20 /vmfs/devices/disks/vmhba0:0:0:3

Again promising, since this is the correct size and location.

So I know the partition table has not been damaged. Most solutions in this forum relate to repairing the partition table - so that doesn't apply here.

# hexdump -C vmhba0:0:0:3 | less

00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

*

0000a400 30 ff 2f ff 30 ff 53 4f 48 00 00 00 00 00 00 00 |0./.0.SOH.......|

0000a410 41 00 72 00 70 00 68 00 69 00 63 00 20 00 47 00 |A.r.p.h.i.c. .G.|

0000a420 79 00 6f 00 6b 00 61 00 69 00 6c 00 65 00 6e 00 |y.o.k.a.i.l.e.n.|

0000a430 6d 00 65 00 6e 00 74 00 61 00 69 00 20 00 48 00 |m.e.n.t.a.i. .H.|

0000a440 65 00 61 00 76 00 79 00 20 00 4a 00 49 00 53 00 |e.a.v.y. .J.I.S.|

0000a450 00 00 00 00 00 00 00 00 41 00 72 00 70 00 68 00 |........A.r.p.h.|

0000a460 69 00 63 00 20 00 47 00 79 00 6f 00 6b 00 61 00 |i.c. .G.y.o.k.a.|

0000a470 69 00 6c 00 65 00 6e 00 6d 00 65 00 6e 00 74 00 |i.l.e.n.m.e.n.t.|

0000a480 61 00 69 00 20 00 4c 00 69 00 67 00 68 00 74 00 |a.i. .L.i.g.h.t.|

Now that doesn't look so good - but if I search for data that I know is inside one of the VMs (such as my email address) then I can find that - which gives me some glimmer of hope.

00c9f110 b8 ff ff ff 3a 00 70 00 73 00 65 00 72 00 76 00 |....:.p.s.e.r.v.|

00c9f120 65 00 72 00 3a 00 70 00 67 00 72 00 65 00 67 00 |e.r.:.p.g.r.e.g.|

00c9f130 67 00 40 00 70 00 67 00 72 00 65 00 67 00 67 00 |g.@.p.g.r.e.g.g.|

00c9f140 34 00 3a 00 2f 00 4f 00 53 00 53 00 43 00 56 00 |4.:./.O.S.S.C.V.|

Rescanning doesn't give me anything more in Storage (SCSI Target 0 - the 130GB partition is listed in Storage Adapters ok - but it always was)... but if I go into Storage / Add Storage, it does show me the 130GB partition but that would (and claims to) destroy all the data on the partition if I were to add it.

I also tried the Resignaturing instructions at but this did not make any difference.

I've temporarily put vsftpd on the box and am copying down /vmfs/devices/disks/vmhba0:0:0:3 so at least I have the file outside of ESX. The VMFS partition only has a single (100GB) VM.

Thoughts? and thanks.

PG

pgregg · ‎11-24-2010

Forgot to add this:

# less /var/log/vmkernel

Nov 24 11:44:22 core01 vmkernel: 0:12:28:04.723 cpu3:1035)SCSI: 863: GetInfo for adapter vmhba0, , max_vports=0, vports_inuse=0, linktype=0, s

tate=0, failreason=0, rv=-1, sts=bad001f

Nov 24 11:44:22 core01 vmkernel: 0:12:28:04.724 cpu3:1035)ScsiScan: 398: Path 'vmhba0:C0:T0:L0': Vendor: 'MegaRAID' Model: 'LD 0 RAID5 139G' Rev: '516A

'

Nov 24 11:44:22 core01 vmkernel: 0:12:28:04.724 cpu3:1035)ScsiScan: 399: Type: 0x0, ANSI rev: 2

Nov 24 11:44:22 core01 vmkernel: 0:12:28:04.724 cpu3:1035)ScsiUid: 776: Path 'vmhba0:C0:T0:L0' does not support VPD Serial Id page.

Nov 24 11:44:22 core01 vmkernel: 0:12:28:04.724 cpu3:1035)ScsiUid: 847: Path 'vmhba0:C0:T0:L0' does not support VPD Device Id page.

Nov 24 11:44:22 core01 vmkernel: 0:12:28:04.724 cpu3:1035)ScsiScan: 524: Path 'vmhba0:C0:T0:L0': No standard UID: Failure

Nov 24 11:44:22 core01 vmkernel: 0:12:28:04.724 cpu3:1035)ScsiScan: 398: Path 'vmhba0:C0:T1:L0': Vendor: 'MegaRAID' Model: 'LD 1 RAID5 572G' Rev: '516A

'

ThompsG · ‎11-24-2010

Hi,

Run the following from the Service Console:

- esxcfg-volume --list

If this returns the missing data store then try:

- esxcfg-volume --mount <VMFS UUID|label>

If storage 1 does not stay mounted over a restart then run the above command but use --persistent-mount <VMFS UUID|label>

Let me know if any of this works.

Thanks and kind regards.

Message was edited by: ThompsG edit grammar and spelling

pgregg · ‎11-24-2010

Thanks for your reply. However I don't have an esxcfg-volume command - I believe that came in with ESX 4.x, whereas I am using ESX 3.5.

ThompsG · ‎11-24-2010

Sorry about that. Forgot what forum I was in

Given that you have tried the LVM.EnableResignature and SCSI.CompareLUNNumber options there does not appear to be many other options available to you.

I was wondering if it possible to attach the VMkernel log (/var/log/vmkernel) from after the ESX servers boots? Won't mind looking through to see if anything jumps out.

Kind regards.

pgregg · ‎11-29-2010

Im afraid I posted the vmkernel already (see first comment)...

I couldn't wait on rebuilding the machine so I ended up taking a backup of the VMFS disk with dd and moving it off the ESX box. I was then able to extract the /usr UFS partition and roughly knowing the data I wanted page through it rebuilding the missing files.

I still have the VMFS image, so if anything magical crops up in the future, I'll try a more thorough restoration of the contents.

Thanks

All

Have I lost my vmfs data?