Re: VSAN health check - component metadata health

bardtr · ‎08-31-2015

Hi,

I just installed VSAN health plugin on vCenter Appliance Server and four brand new servers. Servers consist of a total of 8 SSD's each.

Just after checking basic test in vCenter I saw this metadata error on one of the hosts:

Have anyone encountered this error before and found a solution? When looking at vmware's pages and FAQ it seems error can be related to RAID controller, SSD failure and other issues.

Any other suggestions of pinpointing location of this metadata health error?

Thanks and regards

BT

srodenburg · ‎03-15-2016

In my environment, I got this message directly after upgrading everything feom 6.0 U1 to 6.0 U1b.

I have no idea how to solve it. There is a useless KB Article (VMware KB: Virtual SAN Health Service - Physical Disk Health - Metadata Health).

If only i could find out on which disk that component is sitting and what that component is.

Looking everywhere and anywhere, all the disks, VM etc. etc. etc. are all fine. So i have no idea what this message would mean.

I tried to find out that object with RVC to no avail (i blame my limited experience with rvc for that).

CHogan · ‎03-16-2016

I agree - this is a real drag. Unfortunately, for you, it will have to be via RVC.

First, search on the component UUID, to get the disk UUID:

/localhost/Cork-Datacenter/computers> vsan.cmmds_find 0 -u dc3ae056-0c5d-1568-8299-a0369f56ddc0

+---+-------------+--------------------------------------+-------------------------+---------+-----------------------------------------------------------+

+---+-------------+--------------------------------------+-------------------------+---------+-----------------------------------------------------------+

| 1 | LSOM_OBJECT | dc3ae056-0c5d-1568-8299-a0369f56ddc0 | esxi-hp-05.rainpole.com | Healthy | {"diskUuid"=>"52e5ec68-00f5-04d6-a776-f28238309453", |

| | | | | | "compositeUuid"=>"92559d56-1240-e692-08f3-a0369f56ddc0", |

| | | | | | "formatVersion"=>1} |

+---+-------------+--------------------------------------+-------------------------+---------+-----------------------------------------------------------+

/localhost/Cork-Datacenter/computers>

Now that you have the diskUuid, you can use that in the next command:

/localhost/Cork-Datacenter/computers> vsan.cmmds_find 0 -t DISK -u 52e5ec68-00f5-04d6-a776-f28238309453

+---+------+--------------------------------------+-------------------------+---------+-------------------------------------------------------+

+---+------+--------------------------------------+-------------------------+---------+-------------------------------------------------------+

| 1 | DISK | 52e5ec68-00f5-04d6-a776-f28238309453 | esxi-hp-05.rainpole.com | Healthy | {"capacity"=>145303273472, |

| | | | | | "ssdUuid"=>"52bbb266-3a4e-f93a-9a2c-9a91c066a31e", |

| | | | | | "devName"=>"naa.600508b1001c5c0b1ac1fac2ff96c2b2:2", |

| | | | | | "dedupScope"=>0} |

+---+------+--------------------------------------+-------------------------+---------+-------------------------------------------------------+

/localhost/Cork-Datacenter/computers>

In the devName field above, you now have the NAA id (the SCSI id) of the disk.

I will leave some feedback on the KB on how to determine the disk through RVC.

A word of caution however - this health check is transitory in nature. This does not mean that there is anything inherently wrong with the device, It could be that there is some peak load running on the system temporarily, and that the threshold set for the health check has been passed. I would regularly revisit the health check and periodically test to see if the check is still failing. If you are still concerned, please discuss it with our support organisation.

http://cormachogan.com

srodenburg · ‎04-04-2016

in my case, i kept having the error for weeks. But all ran perfectly fine and the device was good.

The error went away when i upgraded to 6.2 and the FIlesystem was upgraded from v2 to v3. Due to the "re-allocate everything on a node -> upgrade fs -> move it all back" action during the upgrade, the data was apparently rebuild "correctly" and the false data got deleted. Something along those lines.

CHogan · ‎04-05-2016

Yes - we are aware of a cosmetic issue around this health check where it can give a false positive with a status of "invalid state", not "failed". We're working to have that addressed.

In the meantime, if anyone sees this error and wants to check whether it is a false positive, open an SR with our support folks and they can verify it for you once they have the logs.

http://cormachogan.com

Nocturne · ‎09-29-2016

Hi

When its allowed, i like to explain my workaround for this issue. Please correct me, when i am wrong, but this helps me to solve the "wrong" status. Its possible a one of hundred solution for this little problem.

1. I put the host who stands in the Host field (Component of issues) in maintenance mode an choose "Full Data Migration". I am not sure, if this task is necessary.

2. When the data was fully migrated, i have rebooted the affected host. Because i remembered me, that there was a VSAN Preparing Process or something else where the VSAN component for this host will reinitialize?

After the reboot, the health check was completely green and successful.

I hope this can be a input for others and perhaps someone can verify this procedure?

Csh2 · ‎10-02-2016

Hi, Nocturne

It seems this problem has been solved recently. If not there is a solution: Component metadata health check fails with invalid state error (2145347) | VMware KB

From that you have to remove the disk from the disk group or destroy(recreate) the entire disk group

All

VSAN health check - component metadata health