Trying to recover datastore after RAID10 rebuild

jizenman · ‎05-24-2019

So, we had some drive failures on a box living at a colo facility, and over the course of the reboot and rebuild process, our install has fallen into "The ESXi host does not have persistent storage" error mode, and I haven't been able to recover the datastore.

The partition table still exists, and shows a vmfs partition:

~ # partedUtil getptbl /dev/disks/naa.6848f690ed12bc001a0aa0750395bd28

gpt

145782 255 63 2341994496

1 64 8191 C12A7328F81F11D2BA4B00A0C93EC93B systemPartition 128

5 8224 520191 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0

6 520224 1032191 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0

7 1032224 1257471 9D27538040AD11DBBF97000C2911D1B8 vmkDiagnostic 0

8 1257504 1843199 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0

2 1843200 10229759 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0

3 10229760 2341994462 AA31E02A400F11DB9590000C2911D1B8 vmfs 0

However, it will not mount using vmkfstools -V. I am seeing the following errors in vmkernel.log that are most likely related:

2019-05-23T22:50:31.187Z cpu4:6420)WARNING: LVM: 9998: Invalid firstPE 0

2019-05-23T22:50:31.187Z cpu4:6420)WARNING: LVM: 10005: Invalid lastPE 0

2019-05-23T22:50:31.187Z cpu4:6420)WARNING: LVM: 10017: Invalid volume state 0

2019-05-23T22:50:31.187Z cpu4:6420)WARNING: LVM: 5230: Error detected for vol , dev naa.6848f690ed12bc001a0aa0750395bd28:3

2019-05-23T22:50:31.187Z cpu4:6420)LVM: 7121: Device scan failed for <naa.6848f690ed12bc001a0aa0750395bd28:3>: Invalid metadata

Based on an XML config file I found, I know what the UUID of the old datastore was, but when I try to mount using it I get "No matching volume 5277e6a0-2d596de7-495f-b8ca3a5bd3a0 found!"

I'm asking the onsite techs if they can add an external drive, install Linux on that, and try to use vmfs-fuse, but is there anything else I can try using the ESXi CLI? The vast majority of the solutions I've found for lost datastores are for cases where the data is intact but the partition table is lost, but since the partition table still seems to be doing just fine, I'm not sure where to go from here.

daphnissov · ‎05-24-2019

continuum

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

continuum · ‎05-24-2019

Partitiontable looks ok to me.

This : "Device scan failed for <naa.6848f690ed12bc001a0aa0750395bd28:3>: Invalid metadata" not so much.

> I'm asking the onsite techs if they can add an external drive, install Linux on that, and try to use vmfs-fuse,

This will very likely not help - without the LVM metadata vmfs-fuse will fail.

Do you use VMFS 6 or 5 ?

Can you create a vmfs-header-dump with

dd if=/dev/disks/naa.6848f690ed12bc001a0aa0750395bd28:3 bs=1M count=1536 | gzip -c > /tmp/jizenman.1536.gz

and then provide a download for jizenman.1536.gz ?

I can tell you what options you have when I see the dump.

Ulli

Create a VMFS-Header-dump using an ESXi-Host in production | VM-Sickbay

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

jizenman · ‎05-24-2019

Thanks,. It should be VMFS 5, I believe. Here's a link to the dump: jizenman.1536.gz - Google Drive

jizenman · ‎05-24-2019

I've confirmed that vmfs-fuse won't do the job:

[root@localhost ~]# vmfs-fuse /dev/sdb3 /mnt/vmfs

VMFS: i/o spanned over several extents is unsupported

VMFS: Unable to read FS information

Unable to open filesystem

continuum · ‎05-24-2019

Bad news: the Raid10 rebuild failed completely.

The result is not a corrupt VMFS-volume but rather a pile of fragments in raid-stripe-size pieces.

I highly recommend that you consult the vendor of your raid-controller and ask for specific advice on how to restore the original drive-constellation.
If the vendor cant help you have 2 more options:

1. switch the controller to JBOD-mode if that is possible and then try to use a Linux LiveCD and create a software RAID either with the buildin Linux-tools or with for example UFSexplorer.

2. call Ontrack and prepare for a ***** $ bill.

Only good news I have is that apparently your vmdks were all thick provisioned.

Thats means that they maybe recoverable if option 1 was a success.

I attach a list of vmx and vmdk-files that I found. Some of the vmx-files were truncated because they did not fit into a single stripe-segment.

I did not look for snapshots - the chance to recover any snapshots is almost zero in a case like this.

I highly recommend that you write down a log which lists all your actions during or before the Raid-rebuild.

Also write down the current Raid-config - this will be useful if you try to build a software/ virtual raid with the disks.

Feel free to call me via skype if you have further questions - or need help with option 1.

Wish I had better news ....

Ulli

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

jizenman · ‎05-24-2019

Thanks for the diagnosis. I'll send along the information about the RAID rebuild to the datacenter guys (we are at RackspaceDedicated, but unfortunately they don't support the ESXi OS, only the hardware), to see if they have any further steps they can pursue.

Ultimately I don't need to restore the ESXi datastore to full function, I'd be happy just to reconstruct the files for one on the VMs, the one that formerly lived at /vmfs/volumes/5277e6a0-2d596de7-495f-b8ca3a5bd3a0/backup/. That should have enough on it to at least get the essential software up and running on another server.

continuum · ‎05-24-2019

> (we are at RackspaceDedicated, but unfortunately they don't support the ESXi OS, only the hardware)

Damn - then it will be probably almost impossible to access the single disks to build a virtual raid.

Can you ask them to send the disks to you via mail or messenger ?

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

jizenman · ‎05-24-2019

In your experience is it normal for a RAID failure like this to only ruin certain partitions? I don't know a ton about RAID, but in general the ESXi OS is running without issue, and only the datastore is actual broken.

jizenman · ‎05-24-2019

I've sent your recommendations on to Rackspace, we'll see if they can do anything with them (as well as asking if they'll be willing to mail out the hardware if needed). I've also reached out to Ontrack to see if they can do a remote evaluation, or would need the hardware in hand first. Now we wait.

continuum · ‎05-24-2019

Now wait - do you say that the esxi-installation on the same raid-array was still functional ?

Cant really believe that - I have found several fat-partition fragments and even a small NTFS-partition fragment inside the dump.

According to what I see I assumed that the esxi could not boot at all - maybe not even try to.

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

jizenman · ‎05-24-2019

Yeah, ESXi was having trouble before the rebuild attempt, but now runs. I can SSH to it (that's how I was able to get partition table info and attempt vmkfstools -V commands previously) and connect vSphere, though I get the warning about no persistent data storage, and all the VMs, though still listed in the inventory, come up as Unavailable.

The data center guys confirmed that initially drive 0:1 failed and was replaced, but while the rebuild was happening 0:0 also went offline, causing the problem. I'm guessing the isolated failure is due to the problems only affecting one matched pair, while the other 3 pairs (it was an 8 drive array) were fine? I don't know if that makes sense, but it's all I've got.

jizenman · ‎05-24-2019

It is true that the partition table shows eight partitions but only four volumes are actually mounted, so there is probably more damaged than just the datastore. But whatever else might be corrupted the bulk of the OS is up and running.

All

Trying to recover datastore after RAID10 rebuild