Hopefully someone can help me with a fairly frustrating situation. I have done quite a few searches and read a number of articles here and elsewhere - but the situation doesn't seem to match precisely.
The problem arose with an "inelegant" reboot of my ESXi server. This is one of a few free installations using ScaleIO to map LUNs for the datastores. I was encountering what appeared to be a hung server, tried to perform a controlled reboot (which did not seem to take) followed by a power cycle of the server. When it came back up, the VM running on a local datastore came back fine, but the datastores on the ScaleIO LUNs were not showing up. Everything seems to be saying there is no filesystem - but this doesn't make sense for every LUN in question, and when I look at the disks with dd it seems to show data I would expect.
What I see:
* The LUNs themselves are discovered along with the partitions on them. When I look at the devices in the GUI or cli, I see them no problem.
[root@vm-host101:/var/log] ls /vmfs/devices/disks/eui*
/vmfs/devices/disks/eui.0f60934767ed6d2731f35a0d00000005
/vmfs/devices/disks/eui.0f60934767ed6d273d0a632d00000000
/vmfs/devices/disks/eui.0f60934767ed6d273d0a632d00000000:1
[root@vm-host101:/var/log] esxcli storage core device list | grep eui.
eui.0f60934767ed6d273d0a632d00000000
Display Name: EMC Fibre Channel Disk (eui.0f60934767ed6d273d0a632d00000000)
Devfs Path: /vmfs/devices/disks/eui.0f60934767ed6d273d0a632d00000000
eui.0f60934767ed6d2731f35a0d00000005
Display Name: EMC Fibre Channel Disk (eui.0f60934767ed6d2731f35a0d00000005)
Devfs Path: /vmfs/devices/disks/eui.0f60934767ed6d2731f35a0d00000005
* partedUtil and coma shows the GPT partitions still there.
[root@vm-host101:/var/log] partedUtil getptbl /dev/disks/eui.0f60934767ed6d273d0
a632d00000000
gpt
35507 255 63 570425344
1 128 570425304 AA31E02A400F11DB9590000C2911D1B8 vmfs 0
[root@vm-host101:/var/log] voma -m ptbl -f check -d /vmfs/devices/disks/eui.0f60
934767ed6d273d0a632d00000000
Running Partition table checker version 0.1 in check mode
Phase 1: Checking device for valid primary GPT
Detected valid GPT signatures
Number Start End Type
1 128 570425304 vmfs
Found a valid partition table on the device
Total Errors Found: 0
* esxcli shows no filesystems on those LUNs
[root@vm-host101:/var/log] esxcli storage filesystem list
Mount Point Volume Name UUID Mounted Type Size Free
------------------------------------------------- ----------- ----------------------------------- ------- ------ ------------ ------------
/vmfs/volumes/598ccaea-8afb9c77-80a4-001517d9a462 datastore1 598ccaea-8afb9c77-80a4-001517d9a462 true VMFS-6 241055039488 239537750016
/vmfs/volumes/dcaf3470-159653b7-59a1-55193a835180 dcaf3470-159653b7-59a1-55193a835180 true vfat 261853184 110673920
/vmfs/volumes/2044b777-2642e6ce-6fc4-7407f0f24801 2044b777-2642e6ce-6fc4-7407f0f24801 true vfat 261853184 110866432
/vmfs/volumes/598ccaf4-06170a72-d2b8-001517d9a462 598ccaf4-06170a72-d2b8-001517d9a462 true vfat 4293591040 4285333504
/vmfs/volumes/598ccadb-9d7c923c-6aa1-001517d9a462 598ccadb-9d7c923c-6aa1-001517d9a462 true vfat 299712512 83927040
* esxcfg-volume shows no snapshots
[root@vm-host101:/var/log] esxcfg-volume -l
[root@vm-host101:/var/log]
* voma similarly shows an issue with the file system
[root@vm-host101:/var/log] voma -m vmfs -f check -d /vmfs/devices/disks/eui.0f60
934767ed6d273d0a632d00000000:1
Checking if device is actively used by other hosts
Running VMFS Checker version 2.1 in check mode
Initializing LVM metadata, Basic Checks will be done
Initializing LVM metadata..-
LVM magic not found at expected Offset,
It might take long time to search in rest of the disk.
VMware ESX Question:
Do you want to continue (Y/N)?
0) _Yes
1) _No
Select a number from 0-1: 0
ERROR: LVM Major or Minor version Mismatch, Not supported
ERROR: Failed to Initialize LVM Metadata
VOMA failed to check device : Not Supported
Total Errors Found: 0
Kindly Consult VMware Support for further assistance
* dd shows data exists and even shows the datastore label I would expect.
[root@vm-host101:/var/log] dd if=/vmfs/devices/disks/eui.0f60934767ed6d273d0a632
d00000000:1 | od -c | head
0000000 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
*
114000000 ^ 0N1 0E3 / 030 \0 \0 \0 Q e S 0J5 X 1 \0 0O1
114000020 0D4 0K4 0J0 P 0C1 0H1 i 0O1 026 \0 \0 \0 V M -
114000040 W i n d o w s - 7 - x 6 4 - P r
114000060 o \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
114000100 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
*
114000220 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 002 \0
114000240 \0 \0 \0 020 \0 \0 \0 \0 \0 e S 0J5 X 001 \0 \0
I had once seen what I would have thought to be a similar issue, which I rectified by mapping the LUN to a different server. But in this case this hasn't worked. I even built a new server and added it to the ScaleIO network and I still cannot see the filesystem.
In a "typical" Unix environment I would expect I could do an fsck, but I'm not seeing such an opportunity here.
Any help would be greatly appreciated.
> In a "typical" Unix environment I would expect I could do an fsck, but I'm not seeing such an opportunity here.
The concept of having a tool like fsck or chkdsk is so old school :smileylaugh:
If you have customers that believe buying redundant hardware plus additional software licenses is way cooler than commandline repair work it would be contraproductive to supply a solid fsck-tool. And it actually works out: companies that think as big as VMware suggests do not need recovery.
Serious now: read Create a VMFS-Header-dump using an ESXi-Host in production | VM-Sickbay
Create a dump like I explained above and provide a download link.
In most of the cases I can then give you a solid recovery prognosis in about an hour.
Contact me via skype before you send any data.
If creating a dump fails due to I/O errors let me know then I will create one myself.
Ulli
Try to contact continuum via Skype (details in his profile), he might be able to help.
André
> In a "typical" Unix environment I would expect I could do an fsck, but I'm not seeing such an opportunity here.
The concept of having a tool like fsck or chkdsk is so old school :smileylaugh:
If you have customers that believe buying redundant hardware plus additional software licenses is way cooler than commandline repair work it would be contraproductive to supply a solid fsck-tool. And it actually works out: companies that think as big as VMware suggests do not need recovery.
Serious now: read Create a VMFS-Header-dump using an ESXi-Host in production | VM-Sickbay
Create a dump like I explained above and provide a download link.
In most of the cases I can then give you a solid recovery prognosis in about an hour.
Contact me via skype before you send any data.
If creating a dump fails due to I/O errors let me know then I will create one myself.
Ulli
According to the outputs VMFS LVM is corrupted. Can you fix this?
This could happen dut to storage outage, firmware issue or this storage array was powered down and unpluged from the power with expired\broken batteries.
> Can you fix this?
I do not see any sense in attempts to fix a VMFS that misbehaves after a power failure. To my customers I only recommend to evacuate the datastore ASAP, wipe the LUN and build a new volume from scratch.
The question here is wether the VMFS metadata still allows to extract files and I can tell you more after I have seen the dump.
Rule of thumb: the older the VMFS-version and the more thick provisioning was used the better the chances.