VMware Cloud Community
srodenburg
Expert
Expert

Cleanup after vSAN crash (due to power-failure)

Hello,

We had a total power outage on both our UPS'ses at the same time (yep, you can't make this sh*t up...) causing both power-sections to go **POOF** at the same time.

Our 8 node vSAN 6.2 environment did not really like that but i've cleaned the mess up and it's running again.

All is fine now, but i'm left with 6 components that show up as having an invalid state:

VSAN Failed Objects after crash.png

Using RVC and  "vsan.cmmds_find" etc.  it turns out that these objects no longer seem to exist, as for all these objects in an invalid state, the output looks like this:

RVC Output.png

That's not a lot of info...

Anyone have a clue on how to get rid of these error messages? They stick like tar. Are these objects really gone or do they still lurk around somewhere?  (because the WebGUI keeps displaying them...)

Thanks in advance,

Steve

21 Replies
srodenburg
Expert
Expert

Hi Duncan,

No. It's a Lab environment on an NFR license so not entitled to support. By the way, I already removed and rebuild the disk-group, wiping that "evidence". There will only be logs (if stuff like this get's logged to such a degree that it would help engineering).

But I think that dev. is aware that VSAN gets "confused" quite easily when hardware breaks (a disk is suddenly gone -> the absent objects get rebuilt succesfully but I guess in metadata, there remain "things" that have died with the disk, that the current set of tools cannot locate so support will just say "remove the diskgroup and start-over" as they don't have the means to locate, let alone fix stuff like this either (disclaimer: that has been my impression so far).

If you are really interrested, I could still open a SR  (will need your blessing as again, not entitled to it) and maybe engineering can pull usefull logs (we have a Loginsight running too) and have something to work with.

Otherwise, If it happens again, Instead of "fixing" it myself, I could open a SR instead and show engineering "live" what's going on. If they only ever hear "stories" like this, but can't reproduce, or have access to an affected environment, how can they improve the product? I'm happy to help.

Kind regards,

Steve

0 Kudos
depping
Leadership
Leadership

I just started digging Steve and we have several of these cases in our bug database which are either being worked on or have been solved in an upcoming patch release. If we need more info I will let you know.

0 Kudos