Solved: How to reinitialize a host that has previously bee...

wsanders11 · ‎09-20-2018

I am trying to get a host (call it "new" host) that was previously a member of a different VSAN cluster to join an existing cluster. "esxcli vsan cluster join -u <existing UUID>" works, but the new host ends up as its own master with the same UUID as the existing cluster. The VSAN disk view sees the new host but none of the disks on it.

Existing cluster:

Cluster Information

Enabled: true

Current Local Time: 2018-09-20T19:31:01Z

Local Node UUID: 593fcd1f-04aa-a2e4-ad96-0cc47ad3fb4e

Local Node Type: NORMAL

Local Node State: AGENT

Local Node Health State: HEALTHY

Sub-Cluster Master UUID: 59440460-205e-0c14-63fd-0cc47ad3f8de

Sub-Cluster Backup UUID: 5943b368-15f8-8c06-a9d7-0cc47ad353ae

Sub-Cluster UUID: 52aac3f6-0daf-b3df-2ab7-f444ee7a223a

Sub-Cluster Membership Entry Revision: 13

Sub-Cluster Member Count: 8

Sub-Cluster Member UUIDs: 5943b368-15f8-8c06-a9d7-0cc47ad353ae, 59440460-205e-0c14-63fd-0cc47ad3f8de, 5943b0f8-59f6-941c-fff1-0cc47ad3f8d2, 593fe58f-8160-c534-2522-0cc47ad39596, 593f09f8-dd5d-0478-dd5e-0cc47ad3f8ea, 59401516-e700-fb7a-2517-0cc47ad3fb52, 593fcd1f-04aa-a2e4-ad96-0cc47ad3fb4e, 593ff9b8-6eb7-f2f4-73ea-0cc47ad35846

New Host:

Enabled: true

Current Local Time: 2018-09-20T20:33:29Z

Local Node UUID: 594420d7-0fe4-c512-f94b-0cc47ad3960a

Local Node Type: NORMAL

Local Node State: MASTER

Local Node Health State: HEALTHY

Sub-Cluster Master UUID: 594420d7-0fe4-c512-f94b-0cc47ad3960a

Sub-Cluster Backup UUID:

Sub-Cluster UUID: 52aac3f6-0daf-b3df-2ab7-f444ee7a223a

Sub-Cluster Membership Entry Revision: 0

Sub-Cluster Member Count: 1

Sub-Cluster Member UUIDs: 594420d7-0fe4-c512-f94b-0cc47ad3960a

How can I wipe the new host clean so it can join the existing cluster? "esxcli vsan cluster leave" does not work for this, and it cause the host to immediately trigger a "vSphere HA host status" failure and a "Host connection and power state" failure even though the host is fine.

All hosts are running VMware ESXi, 6.5.0, 8294253.

w

TheBobkin · ‎09-20-2018

Hello wsanders11,

So it clustered fine following removing the vSAN disk partitions, correct? Out of interest, what on-disk format were they? FYI: it won't allow you to wipe the installation medium via this (thus why I didn't mention it as precaution).

Bob

View solution in original post

TheBobkin · ‎09-20-2018

Hello wsanders11,

Are you trimming some information from that esxcli vsan cluster get information? I ask as based on the build version you indicated these should also provide info regarding whether Unicast Mode is enabled and Maintenance Mode state.

Has the new node been added to the vSphere-level cluster or is it currently residing somewhere else?

Obviously you should verify your network configuration and ensure that you can ping between the vSAN-enabled interfaces on the existing and new node e.g.:

# vmkping -I vmkX xxx.xxx.xxx.xxx

You could also verify that the new node receives traffic over 12321 using tcpdump-uw

What On-disk format version do the disks have? Some versions can prevent addition to a cluster with newer versions/Unicast - if you are POSITIVE that you won't ever need whatever data was/is on those disks then wipe them via the Web Client: Host > Configure > Storage Adapters > Select adapter > Select disk > All Actions > Erase partitions

If all of the above is okay (e.g. residing in cluster, no partitions on disks) then check the Unicast agent list on the node that it contains the correct info for all other cluster members and that they all contain the new nodes info - if they don't have the correct info then manually add these or remediate the cluster via Cluster > Monitor > vSAN > Health > 'vCenter state is authoritative' > remediate cluster

If remediation via GUI doesn't work then check that this is set to 0 on all nodes:

# esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates

Probably not part of the problem but why does there appear to be a 1 hour time difference between the nodes there?

Bob

wsanders11 · ‎09-20-2018

Just now I've uncovered the erase partitions option. That was the trick.

The host is booted off a USB drive; deleting partitions (skipping the USB drive of course) cleared out all but one SATA disk on each host. That SATA disk has a vmkdump on it that cant be deleted even though the vmkdump device is set to the USB drive. So I'm working on that for now.

[later]

Aha! It was configured as the scratch partition somehow, which is not supposed to happen in a USB-booted host AFAIK.

TheBobkin · ‎09-20-2018

Hello wsanders11,

So it clustered fine following removing the vSAN disk partitions, correct? Out of interest, what on-disk format were they? FYI: it won't allow you to wipe the installation medium via this (thus why I didn't mention it as precaution).

Bob

wsanders11 · ‎09-20-2018

Yes, I was able to add all but one disk after removing partitions. If took a few steps, but eventually I was able to configure scratch to /tmp/scratch, and the scratch config survived a reboot. We send syslog to Log Insights, so the relevant parameters are:

ScratchConfig.ConfiguredScratchLocation: /tmp

ScratchConfig.CurrentScratchLocation: /tmp

Syslog.global.logDir: []/scratch/log

Syslog.global.logHost: udp://loginsights

I could not find a way to completely zero out ScratchConfig.ConfiguredScratchLocation and /etc/vmware/locker.conf. There are still log files being written to /tmp/log, which concerns me somewhat given that I want to preserve the life of my USB drive as much as possible.

After a reboot, deleting the VMFS that previously was busy worked, and the disk could be added to the pool of VSAN devices.

All

How to reinitialize a host that has previously been in a VSAN cluster