Solved: Check State - VSAN objects inaccessible

RyanH84 · ‎09-15-2015

Hi,

I've recently performed an upgrade of a VSAN cluster running on vCS 5.5 U2 to vCS 6.0 U1. All components have been successful without an issue.

Upon trying to upgrade VSAN on-disk filesystem version from v1 to v2; I've ran through all recommended checks using the RVC console. The Cluster info is fine, the disk check was also reporting everything ok. Upon reporting on the vsan.check_state I receive the following:

2015-09-15 09:42:25 +0100: Step 1: Check for inaccessible VSAN objects

Detected 16 objects to be inaccessible

Detected 373e4e54-f147-c202-e824-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible

Detected f9b82d54-e1ba-6307-903a-ecf4bbc4aef8 on esxihost003.domain.com to be inaccessible

Detected b1b94b54-c02f-fa10-2de3-ecf4bbc4b858 on esxihost003.domain.com to be inaccessible

Detected 682d4d54-987c-4028-06eb-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible

Detected 61603654-0838-e230-40ec-ecf4bbc4aef8 on esxihost003.domain.com to be inaccessible

Detected 07a72954-88fb-633a-f47d-ecf4bbc4aef8 on esxihost003.domain.com to be inaccessible

Detected a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 on esxihost003.domain.com to be inaccessible

Detected 56dc2354-88e8-d141-c60b-ecf4bbc4b858 on esxihost003.domain.com to be inaccessible

Detected 39554f54-4fc7-4863-fd43-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible

Detected 51dc2354-c5bd-7883-34ef-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible

Detected 264b3d54-497b-6da6-440d-ecf4bbc4b858 on esxihost003.domain.com to be inaccessible

Detected c53a4054-9929-fcb5-5d17-ecf4bbc4b858 on esxihost003.domain.com to be inaccessible

Detected 57dc2354-9845-ddcf-ea9e-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible

Detected 57dc2354-a496-e2df-6548-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible

Detected b04f4e54-1748-fbe3-7f89-ecf4bbc4aef8 on esxihost003.domain.com to be inaccessible

Detected 86422454-35ad-2efa-ff43-ecf4bbc4aef8 on esxihost003.domain.com to be inaccessible

2015-09-15 09:42:26 +0100: Step 2: Check for invalid/inaccessible VMs

2015-09-15 09:42:26 +0100: Step 3: Check for VMs for which VC/hostd/vmx are out of sync

Did not find VMs for which VC/hostd/vmx are out of sync

If I query one of the object UIDs, I receive the following:

/localhost/DATACENTER> vsan.object_info ~cluster/ 682d4d54-987c-4028-06eb-ecf4bbc4a6b0

2015-09-15 09:34:02 +0100: Fetching VSAN disk info from esxihost003.domain.com (may take a moment) ...

2015-09-15 09:34:02 +0100: Fetching VSAN disk info from esxihost002.domain.com (may take a moment) ...

2015-09-15 09:34:02 +0100: Fetching VSAN disk info from esxihost001.domain.com (may take a moment) ...

2015-09-15 09:34:05 +0100: Done fetching VSAN disk infos

DOM Object: 682d4d54-987c-4028-06eb-ecf4bbc4a6b0 (v1, owner: esxihost003.domain.com, policy: No POLICY entry found in CMMDS)

RAID_1

Component: 682d4d54-9ab3-924b-8d06-ecf4bbc4a6b0 (state: ABSENT (6), host: esxihost002.domain.com, md: naa.5000c5007680bc7b, ssd: naa.5001e8200279bea0, note: LSOM object not found,

votes: 1)

Component: 682d4d54-9ffb-934b-d42c-ecf4bbc4a6b0 (state: ABSENT (6), host: esxihost001.domain.com, md: naa.5000c500767f53d3, ssd: naa.5001e8200279bfa4, note: LSOM object not found,

votes: 1)

Witness: 682d4d54-45d6-944b-5637-ecf4bbc4a6b0 (state: ABSENT (6), host: esxihost003.domain.com, md: naa.5000c5007680a9af, ssd: naa.5001e8200279c2c4,

votes: 1, usage: 0.0 GB)

Is this something I need to be concerned with? I do not believe I'll be able to proceed with the upgrade with these inaccessible objects. I can see that there is a command to purge vswap files but I do not believe that this is related.

Can anyone assist?

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

RyanH84 · ‎10-24-2015

Just posting the resolution to my problem as I eventually sorted it all last week.No point in going to detail here, but I've listed the exact issue and fix in a blog post which can be read here.

In short, we had to work with GSS to remove the objects manually from the affected node in our cluster.

Thank you all for the help involved!

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

View solution in original post

RyanH84 · ‎09-15-2015

As an update to my progress. I have tried to run the on-disk upgrade: vsan.v2_ondisk_upgrade --allow-reduced-redundancy ~cluster/ (I only have 3 nodes in this environment).

I then decided to use the purge vswap command as prompted:

/localhost/Astro House> vsan.purge_inaccessible_vswp_objects ~cluster/

2015-09-15 10:18:21 +0100: Collecting all inaccessible Virtual SAN objects...

2015-09-15 10:18:21 +0100: Found 16 inaccessbile objects.

2015-09-15 10:18:21 +0100: Selecting vswp objects from inaccessible objects by checking their extended attributes...

2015-09-15 10:18:23 +0100: Found 8 inaccessible vswp objects.

+--------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------------+

| Object UUID | Object Path | Size |

+--------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------------+

| 373e4e54-f147-c202-e824-ecf4bbc4a6b0 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/7ae04454-9c87-b751-1c3f-ecf4bbc4a6b0/view8pdCLS0007-9a1d7133.vswp | 4294967296B (4.00 GB) |

| f9b82d54-e1ba-6307-903a-ecf4bbc4aef8 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/39432454-c2ed-3aa7-bc6a-ecf4bbc4aef8/view8ppdst0004-96d19321.vswp | 4294967296B (4.00 GB) |

| 61603654-0838-e230-40ec-ecf4bbc4aef8 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/f02d3454-54fb-8e8a-9742-ecf4bbc4aef8/view8ppdst0006-b1510132.vswp | 4294967296B (4.00 GB) |

| 07a72954-88fb-633a-f47d-ecf4bbc4aef8 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/2a422454-c645-1a34-fa74-ecf4bbc4aef8/view8ppdst0001-cd35ca2e.vswp | 4294967296B (4.00 GB) |

| 39554f54-4fc7-4863-fd43-ecf4bbc4a6b0 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/8cdc4454-5abc-de04-ffed-ecf4bbc4b858/view10st0004-f6688e6c.vswp | 4294967296B (4.00 GB) |

| 264b3d54-497b-6da6-440d-ecf4bbc4b858 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/70242354-3ecb-965b-5a22-ecf4bbc4b858/view7pdcls0006-718d4b74.vswp | 2197815296B (2.05 GB) |

| c53a4054-9929-fcb5-5d17-ecf4bbc4b858 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/52dc2354-d9d6-07f0-443c-ecf4bbc4b858/view8st0003-d66b41aa.vswp | 4294967296B (4.00 GB) |

| b04f4e54-1748-fbe3-7f89-ecf4bbc4aef8 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/f7fb4454-46ed-6603-2b5c-ecf4bbc4aef8/view8pdCLS0009-627ab9fd.vswp | 4294967296B (4.00 GB) |

+--------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------------+

2015-09-15 10:18:23 +0100: Ready to delete the inaccessible vswp object...

I then deleted all objects checking where they belonged. (They belong to linked clone VM's which I'm not bothered about recomposing).

2015-09-15 10:23:14 +0100: Found 2 inaccessible objects left with only witness

but no active data components. In this case extended attributes cannot be

retrieved to determine whether these are vswp objects, so this command will

query all VMs on virtual SAN datastore to see if they are used as namespace or

virtual disk by any VMs.

+--------------------------------------+--------------------------------------+-------------------+

| Object UUID | Witness UUID | In Use by VM:Path |

+--------------------------------------+--------------------------------------+-------------------+

| b1b94b54-c02f-fa10-2de3-ecf4bbc4b858 | b1b94b54-b048-3815-0b5e-ecf4bbc4b858 | |

| 682d4d54-987c-4028-06eb-ecf4bbc4a6b0 | 682d4d54-45d6-944b-5637-ecf4bbc4a6b0 | |

+--------------------------------------+--------------------------------------+-------------------+

Found 2 objects in above table that are not used as namespace or virtual disk

by any VMs. These are possibly vswp objects. Please make sure all hosts are

connected and not running maintenance mode, and make sure all disks are

correctly plugged in and seen by virtual SAN. This way, if some data components

of an inaccessible object come back active, rerun this command and it will be

able to determine whether the object is vswp object. Otherwise, it may possibly

cause a tentative inactive data component to be deleted by forcibly deleting the

inaccessible objects. If all data components of an inaccessible object are

permanently deleted or missing, it is okay to delete the object because it will

not cause data loss by deleting the leftover witnesses.

Are you sure that you want to delete object b1b94b54-c02f-fa10-2de3-ecf4bbc4b858?

[Y] Yes [N] No [C] Cancel to all: y

That leaves me with 6 inaccessible objects after all have been removed. I cannot run the on-disk upgrade still and am stuck. I believe I need to manually remove these GUIDs. If I now run a check against one of them, I see the following:

/localhost/DATACENTER> vsan.cmmds_find -u a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 -t LSOM_OBJECT ~cluster

+---+------+------+-------+--------+---------+

+---+------+------+-------+--------+---------+

/localhost/DATACENTER> vsan.cmmds_find -u a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 LSOM_OBJECT ~cluster

no matches for "LSOM_OBJECT"

/localhost/DATACENTER> vsan.cmmds_find -u a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 ~cluster

+---+---------------+--------------------------------------+------------------------+---------+------------------------------------------------------------------+

+---+---------------+--------------------------------------+------------------------+---------+------------------------------------------------------------------+

| 1 | DOM_OBJECT | a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 | esxihost0003.domain.com | Healthy | {"type"=>"Configuration", |

| | | | | | {"CSN"=>1, |

| | | | | | "compositeUuid"=>"a2242354-c6a4-f93b-9d4e-ecf4bbc4b858"}, |

| | | | | | {"type"=>"RAID_1", |

| | | | | | "attributes"=>{}, |

| | | | | | {"type"=>"Component", |

| | | | | | {"capacity"=>20971520, |

| | | | | | "faultDomainId"=>"53e2477c-a35d-d9f7-3f41-ecf4bbc4b858"}, |

| | | | | | "componentUuid"=>"a2242354-91fa-2546-db08-ecf4bbc4b858", |

| | | | | | "diskUuid"=>"5201a2a1-09d4-e3fd-27fb-380c286575a9"}, |

| | | | | | {"type"=>"Component", |

| | | | | | {"capacity"=>20971520, |

| | | | | | "faultDomainId"=>"53e24760-dc8e-d5cd-d0f2-ecf4bbc4a6b0"}, |

| | | | | | "componentUuid"=>"a2242354-aa16-2746-ba9d-ecf4bbc4b858", |

| | | | | | "diskUuid"=>"52d74bc4-79db-fd3c-074a-b31b990ab7c6"}}, |

| | | | | | {"type"=>"Witness", |

| | | | | | {"componentState"=>6, |

| | | | | | "faultDomainId"=>"53e24752-24a7-5d32-6148-ecf4bbc4aef8"}, |

| | | | | | "componentUuid"=>"a2242354-d7df-2746-b44e-ecf4bbc4b858", |

| | | | | | "diskUuid"=>"529296ed-43ac-c932-dbba-b0b39a6c51f8"}} |

| 2 | CONFIG_STATUS | a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 | esxihost0003.domain.com | Healthy | {"state"=>13} |

+---+---------------+--------------------------------------+------------------------+---------+------------------------------------------------------------------+

/localhost/DATACENTER> vsan.object_info ~cluster a2242354-c6a4-f93b-9d4e-ecf4bbc4b858

DOM Object: a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 (v1, owner: esxihost0003.domain.com, policy: No POLICY entry found in CMMDS)

RAID_1

Component: a2242354-91fa-2546-db08-ecf4bbc4b858 (state: ACTIVE (5), host: esxihost0003.domain.com, md: naa.5000c500768290ab, ssd: naa.5001e8200279c2c4,

votes: 1, usage: 0.0 GB)

Component: a2242354-aa16-2746-ba9d-ecf4bbc4b858 (state: ABSENT (6), host: esxihost0002.domain.com, md: naa.5000c5007680ee03, ssd: naa.5001e8200279bea0, note: LSOM object not found,

votes: 1)

Witness: a2242354-d7df-2746-b44e-ecf4bbc4b858 (state: ABSENT (6), host: esxihost0001.domain.com, md: naa.5000c50076809aff, ssd: naa.5001e8200279bfa4, note: LSOM object not found,

votes: 1)

Still a little stuck with this now, I've raised a ticket to VMware support asking for assistance and directed them to this post.

Hopefully someone can help me...

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

depping · ‎09-15-2015

I think it is best you wait for support, but in the meanwhile, what is the SR number?

RyanH84 · ‎09-15-2015

Hi Duncan. Thanks, I've submitted the ticket and am waiting for a response from Support. The SR is 15755998509

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

depping · ‎09-15-2015

you can delete objects by the way by going to a host and using the objtool delete command. but personally I would recommend you let GSS do that, because if you make a mistake you could easily mess up the whole cluster.

RyanH84 · ‎09-15-2015

Thanks again, through my research I have seen that this is possible. I discussed it with a colleague and we both decided that it would also be best to have support assist on that as we didn't fancy the risk without having someone on standby!

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

depping · ‎09-16-2015

Fully agree, no point in taking that huge risk.

RyanH84 · ‎09-24-2015

I'm intending to post an update to this soon with the resolution. I've been in talks with our support representative but for some reason the commands to manually remove the objects did not work.

My incident has now been pushed back to engineering asking questions regarding why this is the case. It is unfortunately slowing up all of our upgrade plans for this environment and for Production (following on after). It would be nice to have a proper fix or resolution before moving forward in Production, in case we encounter the same issue.

I am also wondering, given that this is a 3 node cluster. Would it be possible to remove the host (or disk group) away from the VSAN configuration and essentially blatt it all and re-format it? This is a bit of a "steam roll" approach to this problem but we are keen to continue our upgrade.

Any thoughts?

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

joergriether · ‎09-26-2015

I would talk to engineering - since your case was transfered to them - maybe they need to take some additional logs of your system to further understand what is going on and to reproduce it, maybe it´s a good idea to "image" one of the hosts in question so engineering can do some research with it. Maybe you can come to an agreement where they tell you a time frame and two options where the first one would be to wait till it´s fully reproduced and fixed or to assist you in moving the VMs to another store, re-creating VirstoFS, checking it and moving the VMs back.

Best,

Joerg

RyanH84 · ‎10-24-2015

Just posting the resolution to my problem as I eventually sorted it all last week.No point in going to detail here, but I've listed the exact issue and fix in a blog post which can be read here.

In short, we had to work with GSS to remove the objects manually from the affected node in our cluster.

Thank you all for the help involved!

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk