VMware Cloud Community
future2000
Enthusiast
Enthusiast

Cannot disable or stop vSAN

ESXi 6.5.1, vCenter 6.5.1.

vSAN Stretched Cluster with 2 Hosts, 1 Witness.

Applied Host Profile to ESXi Host which destroyed the host. Needed to reset the ESXi Host completely.

Re-configured the failed hosts added all the required vmkernel adapters.

Health check would insist the ESXi Host in question did not have a vmk enabled for vSAN. Rebooted vCenter. Rebooted all hosts again. Re-added another vmkernel. No difference.

Attempts to disable vSAN fail with Failed to remove witness host from cluster.

Attempts to add another primary fault domain (which is where the failed host existed) fail.

Attempts to add fault domain via cli result in

Unable to set fault domain name: Failed to update fault domain entry in CMMDS: Operation not allowed because the VMKernel is shutting down.

I've migrated everything in the cluster to NFS. The Health check now insists my witness doesn't have a vmkernel adapter enabled for vSAN. It does!!

I'm now unable to disable vSAN or fix the mess. I was hoping 6.5 Update 1 would not have issues like these.

Anyone seen anything similar.

4 Replies
jameseydoyle
VMware Employee
VMware Employee

Hi,

Th rebuilding of the new host will not rebuild the broken cluster settings. The cmmds directory on the other hosts in the cluster will have a record of the previously configured host settings, including entries with UUIDs for the VMkernel ports used by each host. If you are trying to make your newly rebuilt host look like the same host that was there before, it will not work, as its VMkernel adapter will have a completely different UUID. With Stretched/2-node Clusters, this is more complex as each host has a very specific role to play in the cluster.

The best course of action to take here would be to create a new cluster and move your hosts into it. Configure the new cluster as a Stretched 2-node Cluster once the hosts are in place. Destroy the old cluster once it has been evacuated of hosts.

GreatWhiteTec
VMware Employee
VMware Employee

Ideally, you want to remove the disk groups prior to disabling vSAN. You want to do the clean up ahead of time. Deleting the disk groups will delete the vSAN partitions.

0 Kudos
future2000
Enthusiast
Enthusiast

Hi James,

Thanks for the information. I see the issue now.

I was able, after a complete re-install of ESXi on the failed host, able to get the cluster operational again with the full capacity of the two hosts. My problem was I couldn't add the Primary fault domain back again, after the rebuild I could then add the Primary and add the host to that domain. Once that was fixed getting the cluster up and running was ok.

My only lingering issue is a Failed Data Test now and vSAN object health. 21 items are said to be inaccessible. I migrated everything across to NFS so I guess that must be what was there before the problem. Is there a method to clean that up?

Cheers

0 Kudos
TheBobkin
Champion
Champion

Hello future2000,

I had intended to write key points on 'completely dismantling vSAN' here a few days ago but I think my colleagues jameseydoyle and GreatWhiteTec pretty much covered everything.

'Ghost' references after incomplete decommissioning appear to be more prevalent in Stretched clusters - there are more elements involved here and over more sites so it stands to reason that this is more likely to occur.

You can check the current state of the remaining Objects by using the clipboard function in the Health GUI (on the bottom-right) to get the UUIDs, then from any host in the cluster use cmmds-tool to identify the state of their components:

# cmmds-tool find -f json -u <Object UUID>

Further identification (including Friendly name of Object and/or Object Path) can be done using objtool:

# /usr/lib/vmware/osfs/bin/objtool getAttr -u <Object UUID>

If the Object only has one active component (state 5 in json format) you will have to query the active component UUID from the host that this resides on (should be the DOM Owner of the Object which is shown in the output of the 1st command):

# /usr/lib/vmware/osfs/bin/objtool getAttr --bypassDom -u <active component UUID> -c

Object deletion using objtool can be done using this command (may have to run from the DOM Owner host):

# /usr/lib/vmware/osfs/bin/objtool delete -u <Object UUID> -f

Note to anyone reading this comment in future: Object deletion using the above command is PERMANENT and IRREVERSIBLE, please confirm (multiple times!) that you know what you are deleting as this responsibility is completely your own.

Bob