VMware Cloud Community
techdesign
Contributor
Contributor
Jump to solution

ESXi 5.5 will not show VMFS datastore in vSphere client after RAID rebuild

Someone recently brought a Dell PE-R720 to me after a "crash" and having someone else replace a failed hard drive. From what I have been told, it was a simple drive failure and replacement. However, after the replacement, when the server booted, none of the VM's came up and they are all listed as inaccessible in the vSphere client. The system has five 4TB SATA drives in it, configured RAID 5, and two volumes on that RAID5 - 500GB and 14ish TB. No, apparently the office this came from doesn't have any backups, which is why it has ended up with me.

Inside vSphere, the 500GB device has a datastore that does show up. The 14TB device has a datastore as well, but it does not show up. I can look at Devices, and see both of them, and the associated VMFS partitions on them.

vsphere1.png

vsphere2.png

When I connect via SSH, I can see and navigate both VMFS volumes:

/vmfs/volumes # ls -la

total 3076

drwxr-xr-x    1 root     root           512 Sep 14 04:11 .

drwxr-xr-x    1 root     root           512 Sep 13 14:53 ..

drwxr-xr-x    1 root     root             8 Jan  1  1970 560948da-a221fd2d-163f-f8bc1246cd0a

drwxr-xr-t    1 root     root          1260 Sep 28  2015 560948e3-e9884764-70d0-f8bc1246cd0a

drwxr-xr-x    1 root     root             8 Jan  1  1970 560948e6-4e07cd8b-51dc-f8bc1246cd0a

drwxr-xr-t    1 root     root          2380 Nov 23  2015 560954f5-debd2d16-560c-f8bc1246cd0a

drwxr-xr-x    1 root     root             8 Jan  1  1970 82245d7c-3eb4aa98-e653-e1cfc32d3ff8

lrwxr-xr-x    1 root     root            35 Sep 14 04:11 datastore1 -> 560948e3-e9884764-70d0-f8bc1246cd0a

drwxr-xr-x    1 root     root             8 Jan  1  1970 e3cd597c-6b667b0d-70c6-4fd873e04ddd

lrwxr-xr-x    1 root     root            35 Sep 14 04:11 storage -> 560954f5-debd2d16-560c-f8bc1246cd0a

If I do a Rescan All from vSphere, I can see the following in the hostd.log file:

2017-09-14T04:13:16.989Z [7E6C2B70 error 'Hostsvc.FSVolumeProvider' opID=CA63A857-0000027E user=root] RefreshVMFSVolumes: ProcessVmfs threw HostCtlException Unable to get FS Attrs for /vmfs/volumes/560954f5-debd2d16-560c-f8bc1246cd0a

VmFileSystem: SlowRefresh() failed: Unable to get FS Attrs for /vmfs/volumes/560954f5-debd2d16-560c-f8bc1246cd0a. Unable to get FS Attrs for /vmfs/volumes/560954f5-debd2d16-560c-f8bc1246cd0a

2017-09-14T04:13:17.333Z [7E681B70 error 'Hostsvc.FSVolumeProvider' opID=CA63A857-00000280 user=root] RefreshVMFSVolumes: ProcessVmfs threw HostCtlException Unable to get FS Attrs for /vmfs/volumes/560954f5-debd2d16-560c-f8bc1246cd0a

I appear to be able to copy the data files off of this volume using WinSCP (or at least it is in the middle of a multi-hour copy with no issues so far), so I'm guessing the volume header was messed up somehow in the rebuild, or I wasn't told everything that happened.

Is there any way I can get this VMFS volume back up and functional without hours and hours of copying off data and then recreating the volume and copying the data back? Or is this a case where a single support incident with VMware may be the only resort?

1 Solution

Accepted Solutions
continuum
Immortal
Immortal
Jump to solution

Very surprising that you can browse the datastore via ssh and at the same time get an I/O error in the header section.
Anyway I suggest to forget about fixing this volume.
Extract the data as long as you can - situation may deteriorate further.
If you run into problems during your extraction attempts - feel free to call me via skype.
Please try to skip the bad area by using the following commands.

dd if=/dev/zero bs=1M count=1536 of=/vmfs/volumes/datastore1/baddrive.1536

dd if=/dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1 bs=1M count=36 of=/vmfs/volumes/datastore1/baddrive.1536 conv=notrunc

dd if=/dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1 bs=1M count=1499 seek=37 of=/vmfs/volumes/datastore1/baddrive.1536 conv=notrunc
This assumes you have a bad area 36Mb into the volume and the bad area is not larger than 1MB.
Ulli


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

View solution in original post

7 Replies
vijayrana968
Virtuoso
Virtuoso
Jump to solution

This could be detected as a snapshot. Run command cat /var/log/vmkernel | grep snapshot , post results here.

0 Kudos
techdesign
Contributor
Contributor
Jump to solution

Nothing returned:

/var/log # cat /var/log/vmkernel.log | grep snapshot

/var/log #

0 Kudos
techdesign
Contributor
Contributor
Jump to solution

Instead of trying to copy 4TB of data via SCP, would it be valid to internally install a 4 or 6TB hard drive and copy the data directly across? I've seen several people talk about not doing that with USB attached drives, but an internal drive should work fine to speed up this process shouldn't it?

0 Kudos
continuum
Immortal
Immortal
Jump to solution

that will surely help and would probably give you a chance to get rid of the raid 5.
I tell all of my recovery customers NOT to use raid 5 for VMFS.
By the way - I would like to see a VMFS header dump - see my notes
Create a VMFS-Header-dump using an ESXi-Host in production | VM-Sickbay
Maybe there is still a way to get around the copy-out-everything approach
Ulli


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

techdesign
Contributor
Contributor
Jump to solution

I agree. Once this is done, I plan to recommend they rebuild this thing RAID10. They've barely touched the space they have.

I followed your article as far as I could, but got an input/output error. I thought it was due to space issues on /tmp, but when I ran it to the volume that shows up okay, I got the same error and ended up with a file of the same size:

/vmfs/volumes/560948e3-e9884764-70d0-f8bc1246cd0a # dd if=/dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1 bs=1M count=1536 of=/vmfs/volumes/datastore1/baddrive.1536

dd: /dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1: Input/output error

The resulting baddrive.1536 file is 37,748,736 bytes in both cases. So it sounds like the volume has something broken at that point?

I'm getting a temporary drive brought in so I can try to get data copied off more quickly at this point.

0 Kudos
continuum
Immortal
Immortal
Jump to solution

Very surprising that you can browse the datastore via ssh and at the same time get an I/O error in the header section.
Anyway I suggest to forget about fixing this volume.
Extract the data as long as you can - situation may deteriorate further.
If you run into problems during your extraction attempts - feel free to call me via skype.
Please try to skip the bad area by using the following commands.

dd if=/dev/zero bs=1M count=1536 of=/vmfs/volumes/datastore1/baddrive.1536

dd if=/dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1 bs=1M count=36 of=/vmfs/volumes/datastore1/baddrive.1536 conv=notrunc

dd if=/dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1 bs=1M count=1499 seek=37 of=/vmfs/volumes/datastore1/baddrive.1536 conv=notrunc
This assumes you have a bad area 36Mb into the volume and the bad area is not larger than 1MB.
Ulli


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

techdesign
Contributor
Contributor
Jump to solution

Thanks for the info. I agree that I need to focus on getting the data off. I still got the I/O error, even when trying count=500 seek=1036 (and various combinations leading up to that). So I'll get the data off, boot the VM's on a separate system, do disk checks of the VMs to find any issues there and then move forward with rebuilding the array and checking those physical disks.

My thought is to get the VM's booted up and then do a chkdsk (they are all Windows based guests). I'll take stock with the client at that point to see what data remains. We may get lucky!

I appreciate your help!

0 Kudos