cannot add a node back into VSAN cluster

vmsysadmin20111 · ‎04-21-2017

Hi all,

I have a 4 node VSAN cluster. One of the nodes (apparently it was a master node) was improperly removed from the cluster (disconnected then removed). The three remaining nodes are fine, but I'm now not able to return the missing node into the cluster. When I add it back, it appears to be creating another cluster with it being a single member, and as a result, I'm not able to browse the VSAN datastore in the vCenter client (VMs are OK though).

Node 1 before joining the VSAN cluster:

[root@fx2-esxi-01:~] esxcli vsan cluster get

Virtual SAN Clustering is not enabled on this host

Working VSAN cluster before node1 is joined:

[root@fx2-esxi-04:~] esxcli vsan cluster get

Cluster Information

Enabled: true

Current Local Time: 2017-04-21T16:17:33Z

Local Node UUID: 58d04aea-1952-3758-4c9d-107d1a8fb9a7

Local Node Type: NORMAL

Local Node State: MASTER

Local Node Health State: HEALTHY

Sub-Cluster Master UUID: 58d04aea-1952-3758-4c9d-107d1a8fb9a7

Sub-Cluster Backup UUID: 58cc1241-1e61-a964-ed3f-107d1a8fb3ef

Sub-Cluster UUID: 52311b70-024e-7173-ac6e-92638c796a1a

Sub-Cluster Membership Entry Revision: 12

Sub-Cluster Member Count: 3

Sub-Cluster Member UUIDs: 58d04aea-1952-3758-4c9d-107d1a8fb9a7, 58cc1241-1e61-a964-ed3f-107d1a8fb3ef, 58cc1fa4-bc1c-71ad-9f0d-107d1a8fb369

Sub-Cluster Membership UUID: c193f958-30b3-2c5e-833c-107d1a8fb9a7

Node 1 after it is added back into the cluster:

[root@fx2-esxi-01:~] esxcli vsan cluster get

Cluster Information

Enabled: true

Current Local Time: 2017-04-21T16:48:43Z

Local Node UUID: 58cc0c20-eddb-7b02-7e25-107d1a8fb301

Local Node Type: NORMAL

Local Node State: MASTER

Local Node Health State: HEALTHY

Sub-Cluster Master UUID: 58cc0c20-eddb-7b02-7e25-107d1a8fb301

Sub-Cluster Backup UUID:

Sub-Cluster UUID: 52311b70-024e-7173-ac6e-92638c796a1a

Sub-Cluster Membership Entry Revision: 0

Sub-Cluster Member Count: 1

Sub-Cluster Member UUIDs: 58cc0c20-eddb-7b02-7e25-107d1a8fb301

Sub-Cluster Membership UUID: e837fa58-35df-3663-af01-107d1a8fb301

Note that the Sub-Cluster UUID is the same, but the cluster member count is 1. VSAN health check does show cluster partitioning and multicast issues:

I don't think there are multicast issues since everything was working fine before the first node was removed. Looking at packets tab, it looks like heartbeats from the original master (.201) are received by all 4 nodes, but heartbeats from the new master (.204) are received only by 3 surviving nodes and not by .201 (same group though):

All nodes are connected using a single 10GB uplink (the second one is in standby) to an internal switch on Dell FX2 system (Dell PowerEdge FN 410S IOM in standalone mode)

Dell#sh ip igmp snooping groups detail

Interface Vlan 227

Group 224.1.2.3

Uptime 4w2d

Expires 00:02:05

Router mode EXCLUDE

Last reporter 192.168.227.204

Last reporter mode EXCLUDE

Last report received IS_EXCL

Group source list

Source address Uptime Expires

Interface Vlan 227

Group 224.2.3.4

Uptime 4w2d

Expires 00:02:05

Router mode EXCLUDE

Last reporter 192.168.227.204

Last reporter mode EXCLUDE

Last report received IS_EXCL

Group source list

Source address Uptime Expires

Dell#

Any thoughts on what else to check? Thanks in advance!

admin · ‎04-23-2017

Hi

if your host have been moving out to cluster, I think you need to re-created DG..

put host into MM

delete DG

remove host from cluster.

exit host from MM

add vmkernel for vSAN

join cluster. ensure everything works fine.

re-create DG

ensure, your VM is OK and nothing resynced before put host into MM..

vmsysadmin20111 · ‎04-25-2017

Thanks, ended up removing the DG on the failed host and then added it back, but still continued to have multicast issues. In the end, had to switch both FX2 FN-410S modules from standalone to PMUX mode, which disables igmp snooping on the IOMs, and configure snooping and querier on the TOR Brocade VDX switch. This fixed the multicast issues.