Re: Cleanup after vSAN crash (due to power-failure...

srodenburg · ‎06-30-2016

Hello,

We had a total power outage on both our UPS'ses at the same time (yep, you can't make this sh*t up...) causing both power-sections to go **POOF** at the same time.

Our 8 node vSAN 6.2 environment did not really like that but i've cleaned the mess up and it's running again.

All is fine now, but i'm left with 6 components that show up as having an invalid state:

Using RVC and "vsan.cmmds_find" etc. it turns out that these objects no longer seem to exist, as for all these objects in an invalid state, the output looks like this:

That's not a lot of info...

Anyone have a clue on how to get rid of these error messages? They stick like tar. Are these objects really gone or do they still lurk around somewhere? (because the WebGUI keeps displaying them...)

Thanks in advance,

Steve

Deeban · ‎06-30-2016

Hi,

You can try deleting these objects by using /usr/lib/vmware/osfs/bin/objtool from ESXi shell.

Sample command : /usr/lib/vmware/osfs/bin/objtool delete -u <UUID> -f -v 10

Check the obj_info from RVC before you delete these 6 objects to be sure.

Once thios is done, this should not show up in the VSAN Health UI.

Thanks

DBN

Thanks DBN

srodenburg · ‎06-30-2016

Hello Deeban,

What do you mean by "Check the obj_info from RVC" as RVC command's never find the components listed in the WebGUI (see my second screenshot).

I would first like to know what they are, before I decide upon anything related to deleting stuff

Deeban · ‎07-01-2016

Hi Srodenburg,

I meant from RVC vsan.obj_info, but it doesn't take component UUIDs, only DOM_Object UUID.

Instead of the RVC command : vsan.cmmds_find

Could you please try cmmds-tool find -u <Component UUID> from the ESXi Shell?

Thanks

DBN

Thanks DBN

elerium · ‎07-05-2016

You may also want to try vsan.check_state <cluster> --refresh-state in rvc, I had a similar situation after a power outtage and this command corrected my invalid state objects after a few minutes of running it.

DRAGandDROP · ‎07-06-2016

Hi, I had a similar incident in my environment.

after a crash i had 7 inaccessible objects.

i've managed to get all 7 objects accessible again. Maybe this helps for you too.

first i used rvc to get the 7 inaccessible objects.

use vsan.check_state .

this shows the uuids and the esx-hosts they should belong to.

now login to the respective esx-host and execute

vsish -e set /vmkModules/vsan/dom/ownerAbdicate <<Inaccessible UUID>>

this will repair the objects.

Regards

Mario

srodenburg · ‎07-18-2016

Hi Deeban,

"Could you please try cmmds-tool find -u <Component UUID> from the ESXi Shell?"

I tried that with all 7 "failed" components on all nodes: nothing. The command does not give output. It just returns an empty line.

@ DragAndDrop: i'm not talking about inaccessible objects. I don't have those at all. I have components in an invalid state. Before I do anything at all, I first need to find them and find out what they are. And **THAT** is my problem: No command can find them. None so far.

Doing stuff like refresh-state etc. does not help. My 7 pigeons are still there, in all their invalid state glory...

(the VSAN cluster is the cluster numbered #2)

I'm running out of ideas...

elerium · ‎07-18-2016

I've run into this as a cosmetic bug, where if you use cmmds-tool or vsan.cmmds_find on a UUID it returns nothing (different from the usual case where a lookup would return a response). You may want to open a case with VMware just to verify. In my case I'm being told it's a cosmetic bug to be fixed in a future release (I'm on VSAN 6.2). I was able to remove it by destroying/recreating the disk group where the inaccessible object shows up but I was told this was cosmetic and wouldn't affect operation.

VMware support sent me the following 2 months ago:

This issue DISCARDED_COMPONENTS entries in CMMDS that are not being automatically cleaned up has been identified as a bug and VMware engineering is working on a fix.
A resolution date is currently unknown however this case will remain open and I will follow up with both engineering and update you when a resolution is found.

MichaelGi · ‎07-20-2016

We had the same thing happen in our VSAN 6.2 environment. Some of our windows servers were corrupted and were continually running chkdsks. We ended up rebuilding or restoring from backup. We also have inaccessible objects in the vsan. I suppose filing a support request is the next step.

srodenburg · ‎07-21-2016

Everybody keeps talking about "inaccessible objects". I clearly stated several times that is NOT the issue.

Oh well, I hope that this topic is somehow useful to somebody else, because it sure is useless to me...

I hope it's a cosmetic issue which will go away in some next release.

Before anyone starts: no i'm not going to open a ticket. It's no use. It will take them 3 weeks to react in the first place, because it's a dev environment, there is no dataloss or other urgent problem so my ticket ends up at the bottom of the list. And I will end up talking to someone who does not have the experience to help anyway. That's just the way it works. Not complaining, or ranting, just being realistic and pre-answering the inevitable question why I did not open a ticket.

MichaelGi · ‎07-21-2016

The biggest problem I see is that vsan didn't come up gracefully after a power outage for many of us. It's very troubling. I've had the metadata issue also and support informed me that it's cosmetic and will be fixed at some point with an update.

GreatWhiteTec · ‎07-25-2016

Had same/similar issues due to power failures and crashes. I wrote a blog about it, so hopefully this will help

https://greatwhitetec.com/2016/06/07/vsan-6-2-disk-format-upgrade-fails/

MichaelGi · ‎07-25-2016

Thanks for this post and blog. We are also running HP380 G9 servers. I'm going to look into upgrading iLo because I've noticed the servers sometimes don't recognize the SD card when they are rebooted.

VictorQ · ‎12-06-2016

Unfortunately you will experience the same level of service if it is a SEV1 PRODUCTION DOWN case as well.

They will run you around in circles as you stay up with them around all their support offices around the World.

Next thing you know, 4 weeks have passed, your hair is white and your issue is worse.

elerium · ‎12-06-2016

There is a KB to fix invalid state components now

Component metadata health check fails with invalid state error (2145347) | VMware KB

Short story is you'll need to destroy/recreate the affected disk/disk groups. In my own experience, you're better off destroying the affected disk group instead of the individual affected disk. For whatever reason, decomissioning a disk group was faster than decomissioning single disks that had this problem.

depping · ‎12-06-2016

Unfortunately you will experience the same level of service if it is a SEV1 PRODUCTION DOWN case as well.

Actually let me correct that, we recently opened a live queue for Severity 1 issues which means that in these situations you are transferred and will receive someone on the phone. Also the team has doubled in the past 2 months and we will do the same in the upcoming quarters.

zdickinson · ‎12-07-2016

Good afternoon, that is great to hear. Thank you, Zach.

srodenburg · ‎03-07-2017

Well it's party time again here in our 8 node cluster. On host ESX03, one of the 4 rotational disks died. No big deal one would think. After an hour, the objects that got marked as Absent got rebuild elsewhere and the cluster was fine again. All green. No errors. Let me repeat that: no errors.

Then the time came to swap out the broken disk. So the disk was cleanly removed from it's diskgroup on host 03, and a new disk inserted into the chassis and added to it's diskgroup. Suddenly, there are 4 objects on that host in an "invalid state" (they appeared after that broken disk was removed from the diskgroup). Again, these objects cannot be found with ANY tool (and yes, I know them all by now).

So once again, just because one disk in a diskgroup dies, I will once again need to evacuate all data from the entire diskgroup (takes a day), destroy this hosts diskgroup and re-create it. Then re-balance (in essence, move data back in) which again takes quite a while.

I really like vSAN but it's things like this, happening to us down in the trenches, that really get on my nerve... I said it in another thread already: vSAN is fine until a hardware component craps itself. You can replace the hardware, but you'll often have some form of cleaning-up to do after such events (which take a lot of time).

The sheer amount of time that our LAB's vSAN spent in some form of data-evacutation, repair, rebalance, rebuild, re-whatever is staggering. On an array, you lose a disk, replace the disk, it takes a couple of hours to rebuild (it it's a big slow nearline disk) and it's done. Get a beer.

With vSAN: bend over, grab your ankles and be ready to spend another whole-nighter in the datacenter, cleaning-up up messed up objects just because a friggin disk died. Disks die. That's what they do. Why does vSAN rob so much time from me because it cannot look after itself. Honestly...

Sorry for my rant. I just hope VMware sees this and prioritizes vSAN's self-healing capabilities because the way it is now, I would never take vSAN in large scale production. The sheer horror of having to fix large environments because "vSAN the object slayer" left dead bodies everywhere just because a piece of hardware broke, just puts me off. I see the potential of vSAN but it's just not there yet.

elerium · ‎03-07-2017

Cleaning up invalid state objects sucks, I'm still not even sure what causes these as I have clusters that end up with these that don't have hardware failures (I'm on 6.0U2 patch4). I can go through the cleanup procedure (evacuate/destroy/recreate) and when i reboot all the nodes in my cluster (using safe maintenance mode), I'll end up with more invalid state objects. Every time I open a support case for this, the engineer just points me to the KB article I already know about and there's no resolution on why these occur to begin with. Fully agree that stuff like this needs to be auto handled by VSAN for production use. I'm almost certain too my clusters/SSDs have gone through many many more resync writes than actual writes during their whole lifetime.

On the positive side of things, I tested 6.0U3 on my lab cluster and the LOG write improvements in 6.0U3 greatly improved congestion/latency situations during resync operations, previously for me a resync operation caused pretty severe latency problems if i didn't use LSOM Log modifications provided by support.

depping · ‎03-08-2017

Do you have an SR for this issue so that I can point engineering to it?

All

Cleanup after vSAN crash (due to power-failure)