VMware Cloud Community
bolgoff
Contributor
Contributor

Dump VMFS metadata without unmounting datastore

Hi,

I want to check metadata of VMFS-5 datastore for corruption but without unmounting it from all hosts and stopping running VMs.

VOMA only works when datastore is unmounted

Technically I can dump metadata with dd command like that on active datastore

dd if=/vmfs/devices/disks/naa.600508b7801cda92f124520ea7f5ff27:1 of=/tmp/naa.600508b7801cda92f124520ea7f5ff27.dd bs=1M count=1500 conv=notrunc


But is it safe to run this command on active datastore

would it cause any damage to data on datastore?

I know that dd would be running in read only mode but wouldn't it be conflicting if some host wants to write data to that region and dd process blocks the write operation and that would cause a damage

Reply
0 Kudos
7 Replies
continuum
Immortal
Immortal

That command is safe - however you should be aware that the dd-dump will be inconsistent if the datastore is active during the few minutes in which the dd-command runs.
May I ask how you want to go on once you have the dump-file ?
I would suggest to use debugvmfs which comes with vmfs-tools for Linux.

Ulli

 

 


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Reply
0 Kudos
bolgoff
Contributor
Contributor

After getting a dump file goes a little trick which seems is working pretty well.

I've created a nested ESXi host (Created Virtual Machine and installed ESXi on it).

That VM have two disks, one main disk where ESXi installed and where first datastore located

And a second disk  where just second datastore for testing. 

Then I copy dump file to that nested ESXi host and inside it with the same dd command copy data from dump onto second disk basically overwriting  the metadata with one inside the dump,  like this

dd if=/vmfs/volumes/datastore1/tmp/naa.600508b7801cda92f124520ea7f5ff27.dd of=/vmfs/devices/disks/mpx.vmhba1:C0:T1:L0:1 bs=1M count=1500

After that I can run VOMA against  that second disk

voma -m vmfs -f check -d /vmfs/devices/disks/mpx.vmhba1:C0:T1:L0:1

 

I will see messages like this

ON-DISK ERROR: Invalid device Size

Found stale lock

 

But they are expected as I understood.  Other messages should be investigated

Reply
0 Kudos
continuum
Immortal
Immortal

Nice ...

additional trick if you run into issues because of the incorrect disksize:
Also dump the first MB of the original to get the original partitiontable - then use a thin provisioned vmdk and set it to the original size.
Finally fix the GPT-table by copying the GPT-table to the end of the disk ...
Then the vmfs-volume only needs a resignature to appear as the real deal.

By the way - have you tried voma with fix option yet or do you use this for analysis only ?

Ulli

 


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Reply
0 Kudos
bolgoff
Contributor
Contributor

I had a PSOD on one of ESXi hosts and as a result on two datastores metadata was corrupted along with some vmdk-flat files missing completely and some vmdk-flat files got corrupted. I've already evacuated the rest of the data from those datastores and restored from backups what was corrupted. There is no point to fix metadata on those datastores. I will just unpresent backend LUNs and recreate them.

Now I want to analyze metadata on other datastores. They show no obvious signs of corruption so far. But I want  to be sure so that wouldn't be a ticking bomb.

Reply
0 Kudos
continuum
Immortal
Immortal

> along with some vmdk-flat files missing completely and some vmdk-flat files got corrupted ...

If you ever need assistance with problems like that - call me on skype. I deal with that stuff almost daily.

Ulli

 

 


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Reply
0 Kudos
bolgoff
Contributor
Contributor

> If you ever need assistance with problems like that - call me on skype. I deal with that stuff almost daily.

I've sent you some messages several days ago but you didn't answer.

Tags (1)
Reply
0 Kudos
bolgoff
Contributor
Contributor

My trick not working always. for some datastores it is working and for some it is not.

I have corrupted datastore which is unmounted and I can run voma on it. it shows a lot of errors.

But after I did my trick and run voma on test esxi host I've got these messages


Checking if device is actively used by other hosts
Running VMFS Checker version 2.1 in check mode
Initializing LVM metadata, Basic Checks will be done
ON-DISK ERROR: Invalid device Size 2040108400128, should be 11809063424
Phase 1: Checking VMFS header and resource files
Detected VMFS file system (labeled:'*****') with UUID:*****, Version 5:61
ERROR: Short IO access
ON-DISK ERROR: Corruption too severe in resource file [FB]
ERROR: Failed to check fbb.sf.
VOMA failed to check device : IO error

Total Errors Found: 2
Kindly Consult VMware Support for further assistance


But on the original datastore checking with voma looks like this


Checking if device is actively used by other hosts
Running VMFS Checker version 2.1 in check mode
Initializing LVM metadata, Basic Checks will be done
Phase 1: Checking VMFS header and resource files
Detected VMFS file system (labeled:'*****') with *******, Version 5:61
ON-DISK ERROR: Cluster number 9604 should be 9605
Phase 2: Checking VMFS heartbeat region
Phase 3: Checking all file descriptors.
ON-DISK ERROR: Duplicate addresses found: <FDA cnum 622 rnum 5> <407, -1> (FBA tbz 0 cow 0 blk 1920848)
ON-DISK ERROR: Duplicate addresses found: <FDA cnum 622 rnum 5> <767, -1> (FBA tbz 0 cow 0 blk 1920839)

........

Total Errors Found: 217857

 

For other 3 out of 5 live datastores which are in use I've got exactly the same Failed to check fbb.sf. error.

After doing the trick on test ESXi host


why is that tick do not work for some datastores and should I be worried?

Reply
0 Kudos