I am trying to get a host (call it "new" host) that was previously a member of a different VSAN cluster to join an existing cluster. "esxcli vsan cluster join -u <existing UUID>" works, but the new host ends up as its own master with the same UUID as the existing cluster. The VSAN disk view sees the new host but none of the disks on it.
Existing cluster:
Cluster Information
Enabled: true
Current Local Time: 2018-09-20T19:31:01Z
Local Node UUID: 593fcd1f-04aa-a2e4-ad96-0cc47ad3fb4e
Local Node Type: NORMAL
Local Node State: AGENT
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 59440460-205e-0c14-63fd-0cc47ad3f8de
Sub-Cluster Backup UUID: 5943b368-15f8-8c06-a9d7-0cc47ad353ae
Sub-Cluster UUID: 52aac3f6-0daf-b3df-2ab7-f444ee7a223a
Sub-Cluster Membership Entry Revision: 13
Sub-Cluster Member Count: 8
Sub-Cluster Member UUIDs: 5943b368-15f8-8c06-a9d7-0cc47ad353ae, 59440460-205e-0c14-63fd-0cc47ad3f8de, 5943b0f8-59f6-941c-fff1-0cc47ad3f8d2, 593fe58f-8160-c534-2522-0cc47ad39596, 593f09f8-dd5d-0478-dd5e-0cc47ad3f8ea, 59401516-e700-fb7a-2517-0cc47ad3fb52, 593fcd1f-04aa-a2e4-ad96-0cc47ad3fb4e, 593ff9b8-6eb7-f2f4-73ea-0cc47ad35846
New Host:
Enabled: true
Current Local Time: 2018-09-20T20:33:29Z
Local Node UUID: 594420d7-0fe4-c512-f94b-0cc47ad3960a
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 594420d7-0fe4-c512-f94b-0cc47ad3960a
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52aac3f6-0daf-b3df-2ab7-f444ee7a223a
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 594420d7-0fe4-c512-f94b-0cc47ad3960a
How can I wipe the new host clean so it can join the existing cluster? "esxcli vsan cluster leave" does not work for this, and it cause the host to immediately trigger a "vSphere HA host status" failure and a "Host connection and power state" failure even though the host is fine.
All hosts are running VMware ESXi, 6.5.0, 8294253.
w
Hello wsanders11,
So it clustered fine following removing the vSAN disk partitions, correct? Out of interest, what on-disk format were they? FYI: it won't allow you to wipe the installation medium via this (thus why I didn't mention it as precaution).
Bob
Hello wsanders11,
Are you trimming some information from that esxcli vsan cluster get information? I ask as based on the build version you indicated these should also provide info regarding whether Unicast Mode is enabled and Maintenance Mode state.
Has the new node been added to the vSphere-level cluster or is it currently residing somewhere else?
Obviously you should verify your network configuration and ensure that you can ping between the vSAN-enabled interfaces on the existing and new node e.g.:
# vmkping -I vmkX xxx.xxx.xxx.xxx
You could also verify that the new node receives traffic over 12321 using tcpdump-uw
What On-disk format version do the disks have? Some versions can prevent addition to a cluster with newer versions/Unicast - if you are POSITIVE that you won't ever need whatever data was/is on those disks then wipe them via the Web Client: Host > Configure > Storage Adapters > Select adapter > Select disk > All Actions > Erase partitions
If all of the above is okay (e.g. residing in cluster, no partitions on disks) then check the Unicast agent list on the node that it contains the correct info for all other cluster members and that they all contain the new nodes info - if they don't have the correct info then manually add these or remediate the cluster via Cluster > Monitor > vSAN > Health > 'vCenter state is authoritative' > remediate cluster
If remediation via GUI doesn't work then check that this is set to 0 on all nodes:
# esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates
Probably not part of the problem but why does there appear to be a 1 hour time difference between the nodes there?
Bob
Just now I've uncovered the erase partitions option. That was the trick.
The host is booted off a USB drive; deleting partitions (skipping the USB drive of course) cleared out all but one SATA disk on each host. That SATA disk has a vmkdump on it that cant be deleted even though the vmkdump device is set to the USB drive. So I'm working on that for now.
[later]
Aha! It was configured as the scratch partition somehow, which is not supposed to happen in a USB-booted host AFAIK.
Hello wsanders11,
So it clustered fine following removing the vSAN disk partitions, correct? Out of interest, what on-disk format were they? FYI: it won't allow you to wipe the installation medium via this (thus why I didn't mention it as precaution).
Bob
Yes, I was able to add all but one disk after removing partitions. If took a few steps, but eventually I was able to configure scratch to /tmp/scratch, and the scratch config survived a reboot. We send syslog to Log Insights, so the relevant parameters are:
ScratchConfig.ConfiguredScratchLocation: /tmp
ScratchConfig.CurrentScratchLocation: /tmp
Syslog.global.logDir: []/scratch/log
Syslog.global.logHost: udp://loginsights
I could not find a way to completely zero out ScratchConfig.ConfiguredScratchLocation and /etc/vmware/locker.conf. There are still log files being written to /tmp/log, which concerns me somewhat given that I want to preserve the life of my USB drive as much as possible.
After a reboot, deleting the VMFS that previously was busy worked, and the disk could be added to the pool of VSAN devices.