Someone recently brought a Dell PE-R720 to me after a "crash" and having someone else replace a failed hard drive. From what I have been told, it was a simple drive failure and replacement. However, after the replacement, when the server booted, none of the VM's came up and they are all listed as inaccessible in the vSphere client. The system has five 4TB SATA drives in it, configured RAID 5, and two volumes on that RAID5 - 500GB and 14ish TB. No, apparently the office this came from doesn't have any backups, which is why it has ended up with me.
Inside vSphere, the 500GB device has a datastore that does show up. The 14TB device has a datastore as well, but it does not show up. I can look at Devices, and see both of them, and the associated VMFS partitions on them.
When I connect via SSH, I can see and navigate both VMFS volumes:
/vmfs/volumes # ls -la
total 3076
drwxr-xr-x 1 root root 512 Sep 14 04:11 .
drwxr-xr-x 1 root root 512 Sep 13 14:53 ..
drwxr-xr-x 1 root root 8 Jan 1 1970 560948da-a221fd2d-163f-f8bc1246cd0a
drwxr-xr-t 1 root root 1260 Sep 28 2015 560948e3-e9884764-70d0-f8bc1246cd0a
drwxr-xr-x 1 root root 8 Jan 1 1970 560948e6-4e07cd8b-51dc-f8bc1246cd0a
drwxr-xr-t 1 root root 2380 Nov 23 2015 560954f5-debd2d16-560c-f8bc1246cd0a
drwxr-xr-x 1 root root 8 Jan 1 1970 82245d7c-3eb4aa98-e653-e1cfc32d3ff8
lrwxr-xr-x 1 root root 35 Sep 14 04:11 datastore1 -> 560948e3-e9884764-70d0-f8bc1246cd0a
drwxr-xr-x 1 root root 8 Jan 1 1970 e3cd597c-6b667b0d-70c6-4fd873e04ddd
lrwxr-xr-x 1 root root 35 Sep 14 04:11 storage -> 560954f5-debd2d16-560c-f8bc1246cd0a
If I do a Rescan All from vSphere, I can see the following in the hostd.log file:
2017-09-14T04:13:16.989Z [7E6C2B70 error 'Hostsvc.FSVolumeProvider' opID=CA63A857-0000027E user=root] RefreshVMFSVolumes: ProcessVmfs threw HostCtlException Unable to get FS Attrs for /vmfs/volumes/560954f5-debd2d16-560c-f8bc1246cd0a
VmFileSystem: SlowRefresh() failed: Unable to get FS Attrs for /vmfs/volumes/560954f5-debd2d16-560c-f8bc1246cd0a. Unable to get FS Attrs for /vmfs/volumes/560954f5-debd2d16-560c-f8bc1246cd0a
2017-09-14T04:13:17.333Z [7E681B70 error 'Hostsvc.FSVolumeProvider' opID=CA63A857-00000280 user=root] RefreshVMFSVolumes: ProcessVmfs threw HostCtlException Unable to get FS Attrs for /vmfs/volumes/560954f5-debd2d16-560c-f8bc1246cd0a
I appear to be able to copy the data files off of this volume using WinSCP (or at least it is in the middle of a multi-hour copy with no issues so far), so I'm guessing the volume header was messed up somehow in the rebuild, or I wasn't told everything that happened.
Is there any way I can get this VMFS volume back up and functional without hours and hours of copying off data and then recreating the volume and copying the data back? Or is this a case where a single support incident with VMware may be the only resort?
Very surprising that you can browse the datastore via ssh and at the same time get an I/O error in the header section.
Anyway I suggest to forget about fixing this volume.
Extract the data as long as you can - situation may deteriorate further.
If you run into problems during your extraction attempts - feel free to call me via skype.
Please try to skip the bad area by using the following commands.
dd if=/dev/zero bs=1M count=1536 of=/vmfs/volumes/datastore1/baddrive.1536
dd if=/dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1 bs=1M count=36 of=/vmfs/volumes/datastore1/baddrive.1536 conv=notrunc
dd if=/dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1 bs=1M count=1499 seek=37 of=/vmfs/volumes/datastore1/baddrive.1536 conv=notrunc
This assumes you have a bad area 36Mb into the volume and the bad area is not larger than 1MB.
Ulli
This could be detected as a snapshot. Run command cat /var/log/vmkernel | grep snapshot , post results here.
Nothing returned:
/var/log # cat /var/log/vmkernel.log | grep snapshot
/var/log #
Instead of trying to copy 4TB of data via SCP, would it be valid to internally install a 4 or 6TB hard drive and copy the data directly across? I've seen several people talk about not doing that with USB attached drives, but an internal drive should work fine to speed up this process shouldn't it?
that will surely help and would probably give you a chance to get rid of the raid 5.
I tell all of my recovery customers NOT to use raid 5 for VMFS.
By the way - I would like to see a VMFS header dump - see my notes
Create a VMFS-Header-dump using an ESXi-Host in production | VM-Sickbay
Maybe there is still a way to get around the copy-out-everything approach
Ulli
I agree. Once this is done, I plan to recommend they rebuild this thing RAID10. They've barely touched the space they have.
I followed your article as far as I could, but got an input/output error. I thought it was due to space issues on /tmp, but when I ran it to the volume that shows up okay, I got the same error and ended up with a file of the same size:
/vmfs/volumes/560948e3-e9884764-70d0-f8bc1246cd0a # dd if=/dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1 bs=1M count=1536 of=/vmfs/volumes/datastore1/baddrive.1536
dd: /dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1: Input/output error
The resulting baddrive.1536 file is 37,748,736 bytes in both cases. So it sounds like the volume has something broken at that point?
I'm getting a temporary drive brought in so I can try to get data copied off more quickly at this point.
Very surprising that you can browse the datastore via ssh and at the same time get an I/O error in the header section.
Anyway I suggest to forget about fixing this volume.
Extract the data as long as you can - situation may deteriorate further.
If you run into problems during your extraction attempts - feel free to call me via skype.
Please try to skip the bad area by using the following commands.
dd if=/dev/zero bs=1M count=1536 of=/vmfs/volumes/datastore1/baddrive.1536
dd if=/dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1 bs=1M count=36 of=/vmfs/volumes/datastore1/baddrive.1536 conv=notrunc
dd if=/dev/disks/naa.6c81f660f7ba21001b1f527f05382e05:1 bs=1M count=1499 seek=37 of=/vmfs/volumes/datastore1/baddrive.1536 conv=notrunc
This assumes you have a bad area 36Mb into the volume and the bad area is not larger than 1MB.
Ulli
Thanks for the info. I agree that I need to focus on getting the data off. I still got the I/O error, even when trying count=500 seek=1036 (and various combinations leading up to that). So I'll get the data off, boot the VM's on a separate system, do disk checks of the VMs to find any issues there and then move forward with rebuilding the array and checking those physical disks.
My thought is to get the VM's booted up and then do a chkdsk (they are all Windows based guests). I'll take stock with the client at that point to see what data remains. We may get lucky!
I appreciate your help!