vSAN Pickle - EVC Move

JeremeyWise · ‎08-26-2021

I think I got myself into a pickle here and need to figure out way to get out of it.

Inventory:

3 physical nodes . Two are Sandybridge, one is westmere (did not know that when built it out)

vSAN with SSD and 10Gb.

vSphere 7 and vCenter 7 latest.

One Cluster one vSAN vCenter as VM on vSAN VMFS.

Single DVSwitch for mgmt and guest VM port groups. One Standard switch dedicated for 10Gb vSAN / VMotion

###

Issue. When I created cluster I did not enable EVC. vMotion now broken. Trying to fix. Shut down all VMs but vCenter but would not let me change mode of existing cluster (kept saying nodes not in maintenance mode).

I created a new cluster. Put one node into Maintenance mode.. moved it to that new cluster with EVC mode set correctly (Westmere for lowest common denominator for hosts). Took out of maintenance mode.

Took second node.. without ANY vms.. put into maintenance mode to move to cluster .. (goal is rolling move to new cluster with EVC.) but once I took second node into maintenance mode... vCneter went down (VSAN insisted on being in maintenance mode when host does)

I should have known better. But now trying to get hosts to come up: Here is state:

###########

[root@odin:~] esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates
Value of IgnoreClusterMemberListUpdates is 1
[root@odin:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-26T15:09:23Z
Local Node UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Local Node Type: NORMAL
Local Node State: BACKUP
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster Backup UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster UUID: 52f2089b-1819-0833-66e0-5c9b09f7312b
Sub-Cluster Membership Entry Revision: 1
Sub-Cluster Member Count: 2
Sub-Cluster Member UUIDs: 60f591d6-36a6-1390-8715-98be9459fea0, 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster Member HostNames: medusa.penguinpages.local, odin.penguinpages.local
Sub-Cluster Membership UUID: 1da62761-d003-66d8-2ea5-98be9459fea0
Unicast Mode Enabled: true
Maintenance Mode State: ON
Config Generation: d55a702d-74ec-462f-9f64-9a3e7a82144e 14 2021-08-26T13:50:11.716
Mode: REGULAR

####

[root@thor:~] esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates
Value of IgnoreClusterMemberListUpdates is 1
[root@thor:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-26T15:09:25Z
Local Node UUID: 60f584a0-1d04-3c42-154b-a0423f377a7e
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 527f51ca-7912-41e1-ff9b-a4482942519d
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Member HostNames: thor.penguinpages.local
Sub-Cluster Membership UUID: 5ba62761-ea26-404e-2a01-a0423f377a7e
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: d55a702d-74ec-462f-9f64-9a3e7a82144e 2 2021-08-26T13:46:18.927
Mode: REGULAR
[root@thor:~]

###

[root@medusa:~] esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates
Value of IgnoreClusterMemberListUpdates is 1
[root@medusa:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-26T15:09:18Z
Local Node UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster Backup UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster UUID: 52f2089b-1819-0833-66e0-5c9b09f7312b
Sub-Cluster Membership Entry Revision: 1
Sub-Cluster Member Count: 2
Sub-Cluster Member UUIDs: 60f591d6-36a6-1390-8715-98be9459fea0, 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster Member HostNames: medusa.penguinpages.local, odin.penguinpages.local
Sub-Cluster Membership UUID: 1da62761-d003-66d8-2ea5-98be9459fea0
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: d55a702d-74ec-462f-9f64-9a3e7a82144e 14 2021-08-26T13:50:11.709
Mode: REGULAR
[root@medusa:~]

Suggestions?

What would make this easiest would be that the esxi hosts booted up thinking they are under "newcluster" with subsequent EVC requirement.. but I see no CLI means to do that...

Nerd needing coffee

TheBobkin · ‎08-27-2021

@JeremeyWise, vCenter went down because you placed the node in Maintenance Mode with 'No Action' option while the data was functionally FTT=0 - that could be easily remediated by simply taking Odin out of MM.

Anyway, that won't really help here as you still won't be able to vMotion vCenter to the lower EVC cluster - better would be just to get all the nodes joined to the new cluster and power-on all the VMs, this is quite simple to achieve:

On both Medusa and Odin:

# esxcli vsan cluster leave
# esxcli vsan cluster join -u 527f51ca-7912-41e1-ff9b-a4482942519d
# esxcli vsan cluster unicastagent add -U 1 -t node -u 60f584a0-1d04-3c42-154b-a0423f377a7e -a <vSAN-IP-of-Thor>

On Thor:
# esxcli vsan cluster unicastagent add -U 1 -t node -u 60f591d6-36a6-1390-8715-98be9459fea0 -a <vSAN-IP-of-Medusa>
# esxcli vsan cluster unicastagent add -U 1 -t node -u 60f58b67-5dc7-eada-b179-a0423f35e8ee -a <vSAN-IP-of-Odin>

More information on these commands:
https://kb.vmware.com/s/article/2150303

Odin may or may not be just in vSAN Decom-State but not in vSphere MM - if this is the case then that can be remediated by putting it in vSphere MM (No Action option) and then taking it out of MM.

Move Medusa and Odin into the new vSphere cluster (should be possible one at a time using MM with EA option or you can use disconnect, remove and re-add trick) - vSAN Health will complain about 'vCenter is Authoritative' (because the unicastagent list was changed manually) - set the IgnoreClusterMemberListUpdates back to 0 on all nodes and then click 'Update ESXi configuration' (or whatever the button is called in your version), if the lists were updated correctly this shouldn't cause cluster partition nor change anything really (maybe it will populate the vmk, port and cert fields, but these are not mandatory when manually configuring).

Let me know how the above goes.

EDIT: I used incorrect UUIDs for populating unicastagent entries, corrected this.

JeremeyWise · ‎08-27-2021

Very much appreciated. I am still learning aspects of how / what set of commands puts data at risk .

Let me post back what I believe the commands did so I can learn here a bit:

[root@thor:~]
[root@thor:~] esxcli vsan cluster unicastagent add -U 1 -t node -u 60f591d6-36a6-1390-8715-98be9459fea0 -a 172.16.100.102 # 60f591d6-36a6-1390-8715-98be9459fea0 is ID of odin and medusa. Add that cluster ID.. then specify by IP odin on 10Gb to join
[root@thor:~] esxcli vsan cluster unicastagent add -U 1 -t node -u 60f591d6-36a6-1390-8715-98be9459fea0 -a 172.16.100.103 ## same as first command just to join Medusa to join
[root@thor:~]

# Below Commands:

[root@odin:~] esxcli vsan cluster leave ### Leave cluster 52f2089b-1819-0833-66e0-5c9b09f7312b which odin and medusa are in
[root@odin:~] esxcli vsan cluster join -u 527f51ca-7912-41e1-ff9b-a4482942519d ## Join ID of Thor. Node moved to new datacneter EVC enabled cluster.
[root@odin:~] esxcli vsan cluster unicastagent add -U 1 -t node -u 60f584a0-1d04-3c42-154b-a0423f377a7e -a 172.16.101.101 ## tell vSAN to reestablish connection to node 60f584a0-1d04-3c42-154b-a0423f377a7e which is thor's ID vi the 10Gb NIC on IP 172.16.101.101
[root@odin:~]

[root@medusa:~] esxcli vsan cluster leave ## Same as odin ##
[root@medusa:~] esxcli vsan cluster join -u 527f51ca-7912-41e1-ff9b-a4482942519d ## Same as odin ##

[root@medusa:~] esxcli vsan cluster unicastagent add -U 1 -t node -u 60f584a0-1d04-3c42-154b-a0423f377a7e -a 172.16.101.101 ## Same as odin ##

VMs did not come back online so I reboot hosts... thinking this would let vSAN restart services it needed. Nodes came back

[root@thor:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-27T20:35:24Z
Local Node UUID: 60f584a0-1d04-3c42-154b-a0423f377a7e
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 527f51ca-7912-41e1-ff9b-a4482942519d
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Member HostNames: thor.penguinpages.local
Sub-Cluster Membership UUID: 1c4b2961-a063-d046-5df1-a0423f377a7e
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: d55a702d-74ec-462f-9f64-9a3e7a82144e 4 2021-08-27T20:17:37.0
Mode: REGULAR
[root@thor:~]

[root@odin:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-27T20:35:30Z
Local Node UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 527f51ca-7912-41e1-ff9b-a4482942519d
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster Member HostNames: odin.penguinpages.local
Sub-Cluster Membership UUID: 164b2961-740c-aafa-c860-a0423f35e8ee
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 60f58b67-5dc7-eada-b179-a0423f35e8ee 1 2021-08-27T20:16:08.0
Mode: REGULAR
[root@odin:~]

[root@medusa:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-27T20:35:34Z
Local Node UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 527f51ca-7912-41e1-ff9b-a4482942519d
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster Member HostNames: medusa.penguinpages.local
Sub-Cluster Membership UUID: eb4a2961-1b56-4741-ad68-98be9459fea0
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 60f591d6-36a6-1390-8715-98be9459fea0 1 2021-08-27T20:17:28.0
Mode: REGULAR
[root@medusa:~]

I think this means each of the three nodes considers itself independent. I think I need to run similar sequence again to get them all to be in ONE same "subcluster ID" All three system states are "Normal Master Heathy".

really appreciate response. And if their is some RTFM on this please post.

Nerd needing coffee

TheBobkin · ‎08-27-2021

@JeremeyWise

No, you did that incorrect - I specified to input the UUIDs as they are and correct ("60f591d6-36a6-1390-8715-98be9459fea0" is the node-UUID of Medusa, "60f58b67-5dc7-eada-b179-a0423f35e8ee" is the node-UUID of Odin) and substitute just the IPs, you used the same node-UUID and 2 different IPs which makes no sense, remove the incorrect entry and re-add it on Thor:

# esxcli vsan cluster unicastagent -a 172.16.100.102 #(ASSUMING Medusa is 172.16.100.103)
# esxcli vsan cluster unicastagent add -U 1 -t node -u 60f58b67-5dc7-eada-b179-a0423f35e8ee -a 172.16.100.102

JeremeyWise · ‎08-27-2021

I think i follow

SubClusterUuid is "527f51ca-7912-41e1-ff9b-a4482942519d" which all members in the cluster share this in common each node has a unique NodeUuid that has to be in the list with its defined replication IP.

My site

60f584a0-1d04-3c42-154b-a0423f377a7e -> thor -> 172.16.101.101

60f58b67-5dc7-eada-b179-a0423f35e8ee -> odin -> 172.16.101.102

60f591d6-36a6-1390-8715-98be9459fea0 -> medusa -> 172.16.101.103

527f51ca-7912-41e1-ff9b-a4482942519d --> Cluster ID for "vSAN"

I cleaned up

[root@thor:~] esxcli vsan cluster unicastagent remove -a 172.16.100.102

## Then list is now correct

[root@thor:~] esxcli vsan cluster unicastagent add -U 1 -t node -u 60f584a0-1d04-3c42-154b-a0423f377a7e -a 172.16.101.101
[root@thor:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- --------------
60f591d6-36a6-1390-8715-98be9459fea0 0 true 172.16.101.103 12321 527f51ca-7912-41e1-ff9b-a4482942519d
60f58b67-5dc7-eada-b179-a0423f35e8ee 0 true 172.16.101.102 12321 527f51ca-7912-41e1-ff9b-a4482942519d
[root@thor:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-28T02:19:10Z
Local Node UUID: 60f584a0-1d04-3c42-154b-a0423f377a7e
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Backup UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster UUID: 527f51ca-7912-41e1-ff9b-a4482942519d
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 60f584a0-1d04-3c42-154b-a0423f377a7e, 60f591d6-36a6-1390-8715-98be9459fea0, 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster Member HostNames: thor.penguinpages.local, medusa.penguinpages.local, odin.penguinpages.local
Sub-Cluster Membership UUID: 1c4b2961-a063-d046-5df1-a0423f377a7e
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: d55a702d-74ec-462f-9f64-9a3e7a82144e 14 2021-08-28T02:18:48.0
Mode: REGULAR
[root@thor:~]

[root@odin:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-28T02:24:04Z
Local Node UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Local Node Type: NORMAL
Local Node State: AGENT
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Backup UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster UUID: 527f51ca-7912-41e1-ff9b-a4482942519d
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 60f584a0-1d04-3c42-154b-a0423f377a7e, 60f591d6-36a6-1390-8715-98be9459fea0, 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster Member HostNames: thor.penguinpages.local, medusa.penguinpages.local, odin.penguinpages.local
Sub-Cluster Membership UUID: 1c4b2961-a063-d046-5df1-a0423f377a7e
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 60f58b67-5dc7-eada-b179-a0423f35e8ee 1 2021-08-27T20:16:08.0
Mode: REGULAR
[root@odin:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- --------------
60f584a0-1d04-3c42-154b-a0423f377a7e 0 true 172.16.101.101 12321 527f51ca-7912-41e1-ff9b-a4482942519d
[root@odin:~]

[root@medusa:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- --------------
60f584a0-1d04-3c42-154b-a0423f377a7e 0 true 172.16.101.101 12321 527f51ca-7912-41e1-ff9b-a4482942519d
[root@medusa:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-28T02:24:37Z
Local Node UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Local Node Type: NORMAL
Local Node State: BACKUP
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Backup UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster UUID: 527f51ca-7912-41e1-ff9b-a4482942519d
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 60f584a0-1d04-3c42-154b-a0423f377a7e, 60f591d6-36a6-1390-8715-98be9459fea0, 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster Member HostNames: thor.penguinpages.local, medusa.penguinpages.local, odin.penguinpages.local
Sub-Cluster Membership UUID: 1c4b2961-a063-d046-5df1-a0423f377a7e
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 60f591d6-36a6-1390-8715-98be9459fea0 1 2021-08-27T20:17:28.0
Mode: REGULAR
[root@medusa:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- --------------
60f584a0-1d04-3c42-154b-a0423f377a7e 0 true 172.16.101.101 12321 527f51ca-7912-41e1-ff9b-a4482942519d
[root@medusa:~]

# I reboot after above but "Virtual Machines" still listed like "/vmfs/volumes/vsan:52f2089b18190833-66e05c9b09f7312b/....." and if you browse data store "vsanDatastore (1)" it shows folders as number strings with no content.

## Seems to be what is expected.. all three host's UUIDs are in "Sub-Cluster Member UUIDs" list. Only thing odd is that thor lists two and other two hosts only know of thor as target IPs. Do I need to run command on odin and medusa pointing to other nodes in a mesh format.

Example: above odin knows of thor but I need to add medusa:

[root@odin:~] esxcli vsan cluster unicastagent add -U 1 -t node -u 60f591d6-36a6-1390-8715-98be9459fea0 -a 172.16.101.103

Example: above medusa knows of thor but I need to add odin:

[root@medusa:~] esxcli vsan cluster unicastagent add -U 1 -t node -u 60f591d6-36a6-1390-8715-98be9459fea0 -a 172.16.101.102

It looked all clean then.. but no change of VMs showing up with correct names / inventory. I placed all three in maintenance mode and took back out.. that had no effect.

Nerd needing coffee

JeremeyWise · ‎08-30-2021

I really appreciate VMWare support offering help on this kind of community forum. Hoping to NOT just wipe and rebuild but learn repair as well as show others what not to do 😛 so others can learn also.

<<< Updates>>>

Update:

With each of the three nodes listing IP and UUID of all three nodes within same context, but vSAN not coming back online (even after reboot), I went down path to deploy temporary vCenter VM (IP only as DNS servers down). "vcenter02" is now with all three hosts in a EVC correctly enabled cluster.

I ran wizard to allow new vCenter system to take over vSAN and then tell it to remediate hosts. Below is image with details of where it is and how things were done

Please feel free to jump in and correct... recommend how things SHOULD have been done better. As this is a POC space for my education and others... I have luxury of more time to repair.... so if this was production data recovery from backup would be faster.. but most likely it is due to my mis-steps than deficiency of VSAN.. but open to correction.

Nerd needing coffee

TheBobkin · ‎08-30-2021

@JeremeyWise ,Just for awareness, I don't answer on here in any official GS capacity and 95% in my own time (because apparently trying to fix broken clusters is fun for me 😂 ).

To clarify what a unicastagent list should contain (and looks to still be a bit incorrect in previous post) - each node here should have ONLY the unicast entries for the other 2 nodes (e.g. their UUIDs and IPs) but a node should never have their own entry (e.g. no-one needs their own phone number written down, just their friends numbers). Fun tidbit: adding a nodes own unicastagent entry to itself in older versions could cause the node to PSOD.
The cluster looks to have only formed because the current Master had both other nodes unicastagent entries correctly (and if Master changed then the remaining nodes would become isolated from one another so this won't work long-term).

The reason you see "vsanDatastore (1)" is likely a result of enabling vSAN on the 'new' cluster prior to moving nodes into it - this created new vSAN cluster UUID and datastore ID, this can pose problematic in scenarios where the original "vsanDatastore" is not accessible, there are (admittedly high-effort) workarounds for such scenarios including changing all data-Objects path so they point to the new location e.g. changing the lookup path from /vmfs/volumes/oldvsandatastore/12345678-abcdefgh-1234-1234abcd5678/VMname.vmx to /vmfs/volumes/newvsandatastore/12345678-abcdefgh-1234-1234abcd5678/VMname.vmx or creating dummy VMs and repointing all the of the vmdks to the original Objects.

But these are not good first options where it is now, so here is what I would advise:
Please get the cluster fully formed by remediating the unicastagent entries - you can do this manually or simply by disabling /VSAN/IgnoreClusterMemberListUpdates flag and click 'Update ESXi Configuration' via the vCenter is Authoritative health test - if this didn't update last time you clicked it, it was likely due to /VSAN/IgnoreClusterMemberListUpdates still set to '1'.
Try registering some VMs via the .vmx you can see in the VM folders in datastore browser - it is possibly just seeing the .vmdk descriptor sizes (e.g. not the size of the backing objects)and thus size in byte/kb can be normal.
If still no joy then 'going back' by joining the 'old' vSAN cluster (52f2089b-1819-0833-66e0-5c9b09f7312b) may allow it to see "vsanDatastore" as the only/primary datastore.

Out of interest: when you click on "vsanDatastore" and go to Hosts tab, which hosts does it show with this attached to?

JeremeyWise · ‎08-30-2021

Help is very much apprecaited. All help is "as is" and totally understand. I am using this to learn how to debug and repair so if this was production... I can know better how to root cause repair / debug.

I think I do have state correct on vSAN.

##############

## THOR state
[root@thor:~]
[root@thor:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-30T19:58:39Z
Local Node UUID: 60f584a0-1d04-3c42-154b-a0423f377a7e
Local Node Type: NORMAL
Local Node State: AGENT
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster Backup UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster UUID: 527f51ca-7912-41e1-ff9b-a4482942519d
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 60f591d6-36a6-1390-8715-98be9459fea0, 60f58b67-5dc7-eada-b179-a0423f35e8ee, 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Member HostNames: medusa.penguinpages.local, odin.penguinpages.local, thor.penguinpages.local
Sub-Cluster Membership UUID: e6192d61-3982-5ce5-80dc-98be9459fea0
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: d55a702d-74ec-462f-9f64-9a3e7a82144e 14 2021-08-28T02:18:48.0
Mode: REGULAR
[root@thor:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- --------------
60f591d6-36a6-1390-8715-98be9459fea0 0 true 172.16.101.103 12321 527f51ca-7912-41e1-ff9b-a4482942519d
60f58b67-5dc7-eada-b179-a0423f35e8ee 0 true 172.16.101.102 12321 527f51ca-7912-41e1-ff9b-a4482942519d
[root@thor:~] esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates
Value of IgnoreClusterMemberListUpdates is 1
[root@thor:~]

### ODIN state
[root@odin:~]
[root@odin:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-30T19:59:51Z
Local Node UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Local Node Type: NORMAL
Local Node State: BACKUP
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster Backup UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster UUID: 527f51ca-7912-41e1-ff9b-a4482942519d
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 60f591d6-36a6-1390-8715-98be9459fea0, 60f58b67-5dc7-eada-b179-a0423f35e8ee, 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Member HostNames: medusa.penguinpages.local, odin.penguinpages.local, thor.penguinpages.local
Sub-Cluster Membership UUID: e6192d61-3982-5ce5-80dc-98be9459fea0
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 60f58b67-5dc7-eada-b179-a0423f35e8ee 2 2021-08-28T02:37:24.0
Mode: REGULAR
[root@odin:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- --------------
60f584a0-1d04-3c42-154b-a0423f377a7e 0 true 172.16.101.101 12321 527f51ca-7912-41e1-ff9b-a4482942519d
60f591d6-36a6-1390-8715-98be9459fea0 0 true 172.16.101.103 12321 527f51ca-7912-41e1-ff9b-a4482942519d
[root@odin:~] esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates
Value of IgnoreClusterMemberListUpdates is 1
[root@odin:~]

### MEDUSA state
[root@medusa:~]
[root@medusa:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-08-30T20:00:35Z
Local Node UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 60f591d6-36a6-1390-8715-98be9459fea0
Sub-Cluster Backup UUID: 60f58b67-5dc7-eada-b179-a0423f35e8ee
Sub-Cluster UUID: 527f51ca-7912-41e1-ff9b-a4482942519d
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 60f591d6-36a6-1390-8715-98be9459fea0, 60f58b67-5dc7-eada-b179-a0423f35e8ee, 60f584a0-1d04-3c42-154b-a0423f377a7e
Sub-Cluster Member HostNames: medusa.penguinpages.local, odin.penguinpages.local, thor.penguinpages.local
Sub-Cluster Membership UUID: e6192d61-3982-5ce5-80dc-98be9459fea0
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 60f591d6-36a6-1390-8715-98be9459fea0 5 2021-08-28T02:41:49.0
Mode: REGULAR
[root@medusa:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- --------------
60f584a0-1d04-3c42-154b-a0423f377a7e 0 true 172.16.101.101 12321 527f51ca-7912-41e1-ff9b-a4482942519d
60f58b67-5dc7-eada-b179-a0423f35e8ee 0 true 172.16.101.102 12321 527f51ca-7912-41e1-ff9b-a4482942519d
[root@medusa:~] esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates
Value of IgnoreClusterMemberListUpdates is 1
[root@medusa:~]

##########

What I think next step is is to :

"...- if this didn't update last time you clicked it, it was likely due to /VSAN/IgnoreClusterMemberListUpdates still set to '1'...." What should it be set to.

"..Update ESXi Configuration' via the vCenter is Authoritative health test..." -> I "Update ESXi Configuration' via the vCenter is Authoritative health test" .. ran this on each node and it said completed but no change.

"..Try registering some VMs via the .vmx.." -> only host where I can browse VMFS "vsanDatastore (1)" -> folders browse VM Ex: ns01 ) import did fine. but to power on I get error:

Operation failed!

Task namePower On virtual machine
Targetns01
StatusUnable to enumerate all disks.

Nerd needing coffee

JeremeyWise · ‎08-30-2021

Found this article: VMware Knowledge Base

So ran command on each node:

esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates

Ran "Update ESXi Configuration" from temp vCenter VM.

Medusa seems to still have problem

172.16.100.103

Different VC (60f591d6-36a6-1390-8715-98be9459fea0)2021-08-28 02:41:49 UTCIf vCenter Server was replaced/recovered from backup, and current host list in vCenter Server is correct, then perform action 'Update ESXi configuration'

Nerd needing coffee

JeremeyWise · ‎08-31-2021

I wanted to post back to this thread that the vSAN cluster , and full EVC / vMotion is now working.

I was provided some additional assistance from the community and it was/ is greatly appreciated.

Step 1) Get All Nodes back to original vSAN ID

try join the nodes back to the original cluster UUID - it is clearly currently confused about the access route to the namespaces but the data clearly is still there as otherwise it wouldn't be showing as it does in the Health Data check.

Set ALL nodes to ignore vCenter membership lists once more:
# esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates

On ALL nodes clear the current unicastagent lists (because I think they may try to use the current cluster UUID and I am unsure where they are pulling this metadata from):
# esxcli vsan cluster unicastagent clear

On ALL nodes leave current cluster (NOTE:something has either changed in 7.0 U1 or the HOL I am using to test this has issues, it wouldn't leave when tried):
# esxcli vsan cluster leave

On ALL nodes, join old cluster UUID (if this doesn't work then sorry but outcome looks bad or at least not a simple fix by any means):
# esxcli vsan cluster join -u 52f2089b-1819-0833-66e0-5c9b09f7312b

On thor:
# esxcli vsan cluster unicastagent add -U 1 -t node -u 60f58b67-5dc7-eada-b179-a0423f35e8ee -a 172.16.101.102
# esxcli vsan cluster unicastagent add -U 1 -t node -u 60f591d6-36a6-1390-8715-98be9459fea0 -a 172.16.101.103

On odin:
# esxcli vsan cluster unicastagent add -U 1 -t node -u 60f584a0-1d04-3c42-154b-a0423f377a7e -a 172.16.101.101
# esxcli vsan cluster unicastagent add -U 1 -t node -u 60f591d6-36a6-1390-8715-98be9459fea0 -a 172.16.101.103

on medusa:
# esxcli vsan cluster unicastagent add -U 1 -t node -u 60f584a0-1d04-3c42-154b-a0423f377a7e -a 172.16.101.101
# esxcli vsan cluster unicastagent add -U 1 -t node -u 60f58b67-5dc7-eada-b179-a0423f35e8ee -a 172.16.101.102

Validate cluster is formed:
# esxcli vsan cluster get

Validate if the namespaces are now normally accessible and see if you can register and power on a VM via the vSphere client.

Issue with above was host thor would not leave vSAN. It halts for 40 seconds or so then claims it could not exit. I had to reboot it get the command to complete. Once I did this then the commands above on all notes worked. This allowed the original vcenter VM to show up. Others were listed but "offline". Removed one at a time.. and then imported them.

Then it was a matter of playing around to fix where thor was in new cluster and other two were in old (non-EVC cluster). I got EVC (with all VMs off) on old cluster enabled. Then it for some reason allowed medusa to move over to new cluster... then after reboot and removing VMs from inventory.. got odin in cluster. Once all nodes in the cluster.. it was just matter to start things up and repair broken DVswitch and backing devices. Hope this thread helps others trying to repair vSAN.. Or realize that EVC mode really NEEDS to be set before you start vSAN. vCenter outside cluster is always preferred, but in my case, that is just not a reasonable option.

Thanks TheBobkin for responses and community support

Nerd needing coffee