Hi,
I've recently performed an upgrade of a VSAN cluster running on vCS 5.5 U2 to vCS 6.0 U1. All components have been successful without an issue.
Upon trying to upgrade VSAN on-disk filesystem version from v1 to v2; I've ran through all recommended checks using the RVC console. The Cluster info is fine, the disk check was also reporting everything ok. Upon reporting on the vsan.check_state I receive the following:
2015-09-15 09:42:25 +0100: Step 1: Check for inaccessible VSAN objects
Detected 16 objects to be inaccessible
Detected 373e4e54-f147-c202-e824-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible
Detected f9b82d54-e1ba-6307-903a-ecf4bbc4aef8 on esxihost003.domain.com to be inaccessible
Detected b1b94b54-c02f-fa10-2de3-ecf4bbc4b858 on esxihost003.domain.com to be inaccessible
Detected 682d4d54-987c-4028-06eb-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible
Detected 61603654-0838-e230-40ec-ecf4bbc4aef8 on esxihost003.domain.com to be inaccessible
Detected 07a72954-88fb-633a-f47d-ecf4bbc4aef8 on esxihost003.domain.com to be inaccessible
Detected a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 on esxihost003.domain.com to be inaccessible
Detected 56dc2354-88e8-d141-c60b-ecf4bbc4b858 on esxihost003.domain.com to be inaccessible
Detected 39554f54-4fc7-4863-fd43-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible
Detected 51dc2354-c5bd-7883-34ef-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible
Detected 264b3d54-497b-6da6-440d-ecf4bbc4b858 on esxihost003.domain.com to be inaccessible
Detected c53a4054-9929-fcb5-5d17-ecf4bbc4b858 on esxihost003.domain.com to be inaccessible
Detected 57dc2354-9845-ddcf-ea9e-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible
Detected 57dc2354-a496-e2df-6548-ecf4bbc4a6b0 on esxihost003.domain.com to be inaccessible
Detected b04f4e54-1748-fbe3-7f89-ecf4bbc4aef8 on esxihost003.domain.com to be inaccessible
Detected 86422454-35ad-2efa-ff43-ecf4bbc4aef8 on esxihost003.domain.com to be inaccessible
2015-09-15 09:42:26 +0100: Step 2: Check for invalid/inaccessible VMs
2015-09-15 09:42:26 +0100: Step 3: Check for VMs for which VC/hostd/vmx are out of sync
Did not find VMs for which VC/hostd/vmx are out of sync
If I query one of the object UIDs, I receive the following:
/localhost/DATACENTER> vsan.object_info ~cluster/ 682d4d54-987c-4028-06eb-ecf4bbc4a6b0
2015-09-15 09:34:02 +0100: Fetching VSAN disk info from esxihost003.domain.com (may take a moment) ...
2015-09-15 09:34:02 +0100: Fetching VSAN disk info from esxihost002.domain.com (may take a moment) ...
2015-09-15 09:34:02 +0100: Fetching VSAN disk info from esxihost001.domain.com (may take a moment) ...
2015-09-15 09:34:05 +0100: Done fetching VSAN disk infos
DOM Object: 682d4d54-987c-4028-06eb-ecf4bbc4a6b0 (v1, owner: esxihost003.domain.com, policy: No POLICY entry found in CMMDS)
RAID_1
Component: 682d4d54-9ab3-924b-8d06-ecf4bbc4a6b0 (state: ABSENT (6), host: esxihost002.domain.com, md: naa.5000c5007680bc7b, ssd: naa.5001e8200279bea0, note: LSOM object not found,
votes: 1)
Component: 682d4d54-9ffb-934b-d42c-ecf4bbc4a6b0 (state: ABSENT (6), host: esxihost001.domain.com, md: naa.5000c500767f53d3, ssd: naa.5001e8200279bfa4, note: LSOM object not found,
votes: 1)
Witness: 682d4d54-45d6-944b-5637-ecf4bbc4a6b0 (state: ABSENT (6), host: esxihost003.domain.com, md: naa.5000c5007680a9af, ssd: naa.5001e8200279c2c4,
votes: 1, usage: 0.0 GB)
Is this something I need to be concerned with? I do not believe I'll be able to proceed with the upgrade with these inaccessible objects. I can see that there is a command to purge vswap files but I do not believe that this is related.
Can anyone assist?
Just posting the resolution to my problem as I eventually sorted it all last week.No point in going to detail here, but I've listed the exact issue and fix in a blog post which can be read here.
In short, we had to work with GSS to remove the objects manually from the affected node in our cluster.
Thank you all for the help involved!
As an update to my progress. I have tried to run the on-disk upgrade: vsan.v2_ondisk_upgrade --allow-reduced-redundancy ~cluster/ (I only have 3 nodes in this environment).
I then decided to use the purge vswap command as prompted:
/localhost/Astro House> vsan.purge_inaccessible_vswp_objects ~cluster/
2015-09-15 10:18:21 +0100: Collecting all inaccessible Virtual SAN objects...
2015-09-15 10:18:21 +0100: Found 16 inaccessbile objects.
2015-09-15 10:18:21 +0100: Selecting vswp objects from inaccessible objects by checking their extended attributes...
2015-09-15 10:18:23 +0100: Found 8 inaccessible vswp objects.
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------------+
| Object UUID | Object Path | Size |
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------------+
| 373e4e54-f147-c202-e824-ecf4bbc4a6b0 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/7ae04454-9c87-b751-1c3f-ecf4bbc4a6b0/view8pdCLS0007-9a1d7133.vswp | 4294967296B (4.00 GB) |
| f9b82d54-e1ba-6307-903a-ecf4bbc4aef8 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/39432454-c2ed-3aa7-bc6a-ecf4bbc4aef8/view8ppdst0004-96d19321.vswp | 4294967296B (4.00 GB) |
| 61603654-0838-e230-40ec-ecf4bbc4aef8 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/f02d3454-54fb-8e8a-9742-ecf4bbc4aef8/view8ppdst0006-b1510132.vswp | 4294967296B (4.00 GB) |
| 07a72954-88fb-633a-f47d-ecf4bbc4aef8 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/2a422454-c645-1a34-fa74-ecf4bbc4aef8/view8ppdst0001-cd35ca2e.vswp | 4294967296B (4.00 GB) |
| 39554f54-4fc7-4863-fd43-ecf4bbc4a6b0 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/8cdc4454-5abc-de04-ffed-ecf4bbc4b858/view10st0004-f6688e6c.vswp | 4294967296B (4.00 GB) |
| 264b3d54-497b-6da6-440d-ecf4bbc4b858 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/70242354-3ecb-965b-5a22-ecf4bbc4b858/view7pdcls0006-718d4b74.vswp | 2197815296B (2.05 GB) |
| c53a4054-9929-fcb5-5d17-ecf4bbc4b858 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/52dc2354-d9d6-07f0-443c-ecf4bbc4b858/view8st0003-d66b41aa.vswp | 4294967296B (4.00 GB) |
| b04f4e54-1748-fbe3-7f89-ecf4bbc4aef8 | /vmfs/volumes/vsan:525bd4560356279a-3a266724259e7439/f7fb4454-46ed-6603-2b5c-ecf4bbc4aef8/view8pdCLS0009-627ab9fd.vswp | 4294967296B (4.00 GB) |
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------------+
2015-09-15 10:18:23 +0100: Ready to delete the inaccessible vswp object...
I then deleted all objects checking where they belonged. (They belong to linked clone VM's which I'm not bothered about recomposing).
2015-09-15 10:23:14 +0100: Found 2 inaccessible objects left with only witness
but no active data components. In this case extended attributes cannot be
retrieved to determine whether these are vswp objects, so this command will
query all VMs on virtual SAN datastore to see if they are used as namespace or
virtual disk by any VMs.
+--------------------------------------+--------------------------------------+-------------------+
| Object UUID | Witness UUID | In Use by VM:Path |
+--------------------------------------+--------------------------------------+-------------------+
| b1b94b54-c02f-fa10-2de3-ecf4bbc4b858 | b1b94b54-b048-3815-0b5e-ecf4bbc4b858 | |
| 682d4d54-987c-4028-06eb-ecf4bbc4a6b0 | 682d4d54-45d6-944b-5637-ecf4bbc4a6b0 | |
+--------------------------------------+--------------------------------------+-------------------+
Found 2 objects in above table that are not used as namespace or virtual disk
by any VMs. These are possibly vswp objects. Please make sure all hosts are
connected and not running maintenance mode, and make sure all disks are
correctly plugged in and seen by virtual SAN. This way, if some data components
of an inaccessible object come back active, rerun this command and it will be
able to determine whether the object is vswp object. Otherwise, it may possibly
cause a tentative inactive data component to be deleted by forcibly deleting the
inaccessible objects. If all data components of an inaccessible object are
permanently deleted or missing, it is okay to delete the object because it will
not cause data loss by deleting the leftover witnesses.
Are you sure that you want to delete object b1b94b54-c02f-fa10-2de3-ecf4bbc4b858?
[Y] Yes [N] No [C] Cancel to all: y
That leaves me with 6 inaccessible objects after all have been removed. I cannot run the on-disk upgrade still and am stuck. I believe I need to manually remove these GUIDs. If I now run a check against one of them, I see the following:
/localhost/DATACENTER> vsan.cmmds_find -u a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 -t LSOM_OBJECT ~cluster
+---+------+------+-------+--------+---------+
| # | Type | UUID | Owner | Health | Content |
+---+------+------+-------+--------+---------+
+---+------+------+-------+--------+---------+
/localhost/DATACENTER> vsan.cmmds_find -u a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 LSOM_OBJECT ~cluster
no matches for "LSOM_OBJECT"
/localhost/DATACENTER> vsan.cmmds_find -u a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 ~cluster
+---+---------------+--------------------------------------+------------------------+---------+------------------------------------------------------------------+
| # | Type | UUID | Owner | Health | Content |
+---+---------------+--------------------------------------+------------------------+---------+------------------------------------------------------------------+
| 1 | DOM_OBJECT | a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 | esxihost0003.domain.com | Healthy | {"type"=>"Configuration", |
| | | | | | "attributes"=> |
| | | | | | {"CSN"=>1, |
| | | | | | "addressSpace"=>20971520, |
| | | | | | "scrubStartTime"=>1442306168623596, |
| | | | | | "muxGroup"=>506752508677059820, |
| | | | | | "compositeUuid"=>"a2242354-c6a4-f93b-9d4e-ecf4bbc4b858"}, |
| | | | | | "child-1"=> |
| | | | | | {"type"=>"RAID_1", |
| | | | | | "attributes"=>{}, |
| | | | | | "child-1"=> |
| | | | | | {"type"=>"Component", |
| | | | | | "attributes"=> |
| | | | | | {"capacity"=>20971520, |
| | | | | | "addressSpace"=>20971520, |
| | | | | | "componentState"=>5, |
| | | | | | "componentStateTS"=>1411589282, |
| | | | | | "faultDomainId"=>"53e2477c-a35d-d9f7-3f41-ecf4bbc4b858"}, |
| | | | | | "componentUuid"=>"a2242354-91fa-2546-db08-ecf4bbc4b858", |
| | | | | | "diskUuid"=>"5201a2a1-09d4-e3fd-27fb-380c286575a9"}, |
| | | | | | "child-2"=> |
| | | | | | {"type"=>"Component", |
| | | | | | "attributes"=> |
| | | | | | {"capacity"=>20971520, |
| | | | | | "addressSpace"=>20971520, |
| | | | | | "componentState"=>6, |
| | | | | | "componentStateTS"=>1442306168, |
| | | | | | "faultDomainId"=>"53e24760-dc8e-d5cd-d0f2-ecf4bbc4a6b0"}, |
| | | | | | "componentUuid"=>"a2242354-aa16-2746-ba9d-ecf4bbc4b858", |
| | | | | | "diskUuid"=>"52d74bc4-79db-fd3c-074a-b31b990ab7c6"}}, |
| | | | | | "child-2"=> |
| | | | | | {"type"=>"Witness", |
| | | | | | "attributes"=> |
| | | | | | {"componentState"=>6, |
| | | | | | "componentStateTS"=>1442306168, |
| | | | | | "isWitness"=>1, |
| | | | | | "faultDomainId"=>"53e24752-24a7-5d32-6148-ecf4bbc4aef8"}, |
| | | | | | "componentUuid"=>"a2242354-d7df-2746-b44e-ecf4bbc4b858", |
| | | | | | "diskUuid"=>"529296ed-43ac-c932-dbba-b0b39a6c51f8"}} |
| 2 | CONFIG_STATUS | a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 | esxihost0003.domain.com | Healthy | {"state"=>13} |
+---+---------------+--------------------------------------+------------------------+---------+------------------------------------------------------------------+
/localhost/DATACENTER> vsan.object_info ~cluster a2242354-c6a4-f93b-9d4e-ecf4bbc4b858
DOM Object: a2242354-c6a4-f93b-9d4e-ecf4bbc4b858 (v1, owner: esxihost0003.domain.com, policy: No POLICY entry found in CMMDS)
RAID_1
Component: a2242354-91fa-2546-db08-ecf4bbc4b858 (state: ACTIVE (5), host: esxihost0003.domain.com, md: naa.5000c500768290ab, ssd: naa.5001e8200279c2c4,
votes: 1, usage: 0.0 GB)
Component: a2242354-aa16-2746-ba9d-ecf4bbc4b858 (state: ABSENT (6), host: esxihost0002.domain.com, md: naa.5000c5007680ee03, ssd: naa.5001e8200279bea0, note: LSOM object not found,
votes: 1)
Witness: a2242354-d7df-2746-b44e-ecf4bbc4b858 (state: ABSENT (6), host: esxihost0001.domain.com, md: naa.5000c50076809aff, ssd: naa.5001e8200279bfa4, note: LSOM object not found,
votes: 1)
Still a little stuck with this now, I've raised a ticket to VMware support asking for assistance and directed them to this post.
Hopefully someone can help me...
I think it is best you wait for support, but in the meanwhile, what is the SR number?
Hi Duncan. Thanks, I've submitted the ticket and am waiting for a response from Support. The SR is 15755998509
you can delete objects by the way by going to a host and using the objtool delete command. but personally I would recommend you let GSS do that, because if you make a mistake you could easily mess up the whole cluster.
Thanks again, through my research I have seen that this is possible. I discussed it with a colleague and we both decided that it would also be best to have support assist on that as we didn't fancy the risk without having someone on standby!
Fully agree, no point in taking that huge risk.
I'm intending to post an update to this soon with the resolution. I've been in talks with our support representative but for some reason the commands to manually remove the objects did not work.
My incident has now been pushed back to engineering asking questions regarding why this is the case. It is unfortunately slowing up all of our upgrade plans for this environment and for Production (following on after). It would be nice to have a proper fix or resolution before moving forward in Production, in case we encounter the same issue.
I am also wondering, given that this is a 3 node cluster. Would it be possible to remove the host (or disk group) away from the VSAN configuration and essentially blatt it all and re-format it? This is a bit of a "steam roll" approach to this problem but we are keen to continue our upgrade.
Any thoughts?
I would talk to engineering - since your case was transfered to them - maybe they need to take some additional logs of your system to further understand what is going on and to reproduce it, maybe it´s a good idea to "image" one of the hosts in question so engineering can do some research with it. Maybe you can come to an agreement where they tell you a time frame and two options where the first one would be to wait till it´s fully reproduced and fixed or to assist you in moving the VMs to another store, re-creating VirstoFS, checking it and moving the VMs back.
Best,
Joerg
Just posting the resolution to my problem as I eventually sorted it all last week.No point in going to detail here, but I've listed the exact issue and fix in a blog post which can be read here.
In short, we had to work with GSS to remove the objects manually from the affected node in our cluster.
Thank you all for the help involved!