mschenker
Contributor
Contributor

VMFS 5, metadata corruption, how come?

Shortly before we discovered the corruption we installed VMware View 5.1. The next morning all our ESXi 5 hosts were down, the storage (HP P6300) still running. After restarting the ESXi hosts one failed with not being able to mount the VMFS partition the VM files reside on.

We sent the logs to VM support, their answer was: Corrupt metadata, though luck, reinstall the VMFS after moving 6 TB of data off...

There seem to be 64k "holes" all over the filesystem, around 7000 occurences.

Is there any other way? After asking for any tools to check the VMFS state the reply was: No such thing, reinstall... Seems a bit poor for a production environment filesystem. Repair might be tricky but I'd rather like to know what's going on before rebuilding the filesystem just like that.

Anyway before blowing away the whole system I'd like to get some feedback how not to experience this again.

a) is there anybody who had problems with View 5.1 as well?

b) how do I protect the metadata of VMFS? We had NO errors on the P6300 at all... the system has RAID5 on the disks. No disk fails etc.

Thanks for any feedback,

Martin

0 Kudos
3 Replies
brucekconvergen
Enthusiast
Enthusiast

That is a good question -- I've been wondering why Vmware just hasn't issued a disk check tool (fsck, etc.) that can work with VMFS volumes, to repair minor issues, or even to detect them ... speaking of which... how did you know that you had corrupt metadata -- care to post the relevant log excerpts so we can check our own systems for similar errors?

0 Kudos
mschenker
Contributor
Contributor

The entries in the log files of server 3 said "Invalid metadata" when trying to mount the data partition.

We have two partitions added via SAN, a Boot-from-SAN system disk (not affected, server still comes up) and the Data partition.

5-23T07:54:22.538Z cpu0:2626)WARNING: Res3: 6121: Invalid clusterNum: expected 29986, read 0

5-23T07:54:22.538Z cpu0:2626)WARNING: J3: 1574: Failed to reserve space for journal on 4e73485a-5f288060-e44f-001e0bd796ae : Invalid metadata

5-23T07:54:22.538Z cpu2:2626)FSS: 890: Failed to get object f530 28 1 4e73485a 5f288060 1e00e44f ae96d70b 0 0 0 0 0 0 0 :Invalid metadata

5-23T07:54:22.538Z cpu2:2626)WARNING: Fil3: 2034: Failed to reserve volume f530 28 1 4e73485a 5f288060 1e00e44f ae96d70b 0 0 0 0 0 0 0

5-23T07:54:22.538Z cpu2:2626)FSS: 890: Failed to get object f530 28 2 4e73485a 5f288060 1e00e44f ae96d70b 4 1 0 0 0 0 0 :Invalid metadata

This prompted the call the VMware, they created a disk dump file and checked the contents.

The other two ESXi 5.0U1 servers are still running and we're preparing the evacuation now...

Best, Martin

0 Kudos
brucekconvergen
Enthusiast
Enthusiast

Thanks for the info... I'd suspect that SAN has some sort of issue, to have that kind of corruption issue.

0 Kudos