VMware Cloud Community
rectorat-vers
Contributor
Contributor

VSAN Cluster partition problem

Hello,


I recently installed a full SSD cluster (3 nodes Dell R630, PERC H730P) in VSAN 6.7 / ESXI 6.7 on a switch DELL S4048-ON
The network configuration of the nodes/switch is ok. A 10G interface is well dedicated to the VSAN traffic, the nodes ping each other well on this link.

[root@esx-vsan-1:~] vmkping -I vmk1 -s 8000 169.254.160.2
PING 169.254.160.2 (169.254.160.2): 8000 data bytes
8008 bytes from 169.254.160.2: icmp_seq=0 ttl=64 time=0.273 ms
8008 bytes from 169.254.160.2: icmp_seq=1 ttl=64 time=0.252 ms
8008 bytes from 169.254.160.2: icmp_seq=2 ttl=64 time=0.210 ms

--- 169.254.160.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.210/0.245/0.273 ms

[root@esx-vsan-2:~] vmkping -I vmk1 -s 8000 169.254.160.1
PING 169.254.160.1 (169.254.160.1): 8000 data bytes
8008 bytes from 169.254.160.1: icmp_seq=0 ttl=64 time=0.244 ms
8008 bytes from 169.254.160.1: icmp_seq=1 ttl=64 time=0.227 ms
8008 bytes from 169.254.160.1: icmp_seq=2 ttl=64 time=0.204 ms

--- 169.254.160.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.204/0.225/0.244 ms


The unicast agent is also visible from each node. No problem with the firewall either.

[root@esx-vsan-1:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name
------------------------------------  ---------  ----------------  -------------  -----  ----------
00000000-0000-0000-0000-000af7be4a16          0              true  169.254.160.2  12321
00000000-0000-0000-0000-000af7be4bb0          0              true  169.254.160.3  12321
[root@esx-vsan-2:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name
------------------------------------  ---------  ----------------  -------------  -----  ----------
00000000-0000-0000-0000-000af7be4bba          0              true  169.254.160.1  12321
00000000-0000-0000-0000-000af7be4bb0          0              true  169.254.160.3  12321
[root@esx-vsan-3:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name
------------------------------------  ---------  ----------------  -------------  -----  ----------
00000000-0000-0000-0000-000af7be4bba          0              true  169.254.160.1  12321
00000000-0000-0000-0000-000af7be4a16          0              true  169.254.160.2  12321



Despite this the configuration wizard tells me a failure on the network configuration test and a partitioning problem.

vSphere_Web_Client_-_Google_Chrome_2018-05-02_14-20-29.png


Indeed, each node appears MASTER of the cluster (same Sub-Cluster UUID). I do not understand why, if anyone has an idea ?

[root@esx-vsan-1:~] esxcli vsan cluster get
Cluster Information
    Enabled: true
    Current Local Time: 2018-05-02T12:26:58Z
    Local Node UUID: 00000000-0000-0000-0000-000af7be4bba
    Local Node Type: NORMAL
    Local Node State: MASTER
    Local Node Health State: HEALTHY
    Sub-Cluster Master UUID: 00000000-0000-0000-0000-000af7be4bba
    Sub-Cluster Backup UUID:
    Sub-Cluster UUID: 52544668-9b49-8737-f8db-7c41f895ccd8
    Sub-Cluster Membership Entry Revision: 0
    Sub-Cluster Member Count: 1
    Sub-Cluster Member UUIDs: 00000000-0000-0000-0000-000af7be4bba
    Sub-Cluster Membership UUID: 39d2e85a-bc79-cda9-d709-000af7be4bba
    Unicast Mode Enabled: true
    Maintenance Mode State: OFF
    Config Generation: ae613ac5-ac9e-4058-971e-bdda071cae52 5 2018-05-02T10:27:04.304

[root@esx-vsan-2:~] esxcli vsan cluster get
Cluster Information
    Enabled: true
    Current Local Time: 2018-05-02T12:27:08Z
    Local Node UUID: 00000000-0000-0000-0000-000af7be4a16
    Local Node Type: NORMAL
    Local Node State: MASTER
    Local Node Health State: HEALTHY
    Sub-Cluster Master UUID: 00000000-0000-0000-0000-000af7be4a16
    Sub-Cluster Backup UUID:
    Sub-Cluster UUID: 52544668-9b49-8737-f8db-7c41f895ccd8
    Sub-Cluster Membership Entry Revision: 0
    Sub-Cluster Member Count: 1
    Sub-Cluster Member UUIDs: 00000000-0000-0000-0000-000af7be4a16
    Sub-Cluster Membership UUID: 3cd2e85a-1c9a-7ce7-692f-000af7be4a16
    Unicast Mode Enabled: true
    Maintenance Mode State: OFF
    Config Generation: ae613ac5-ac9e-4058-971e-bdda071cae52 5 2018-05-02T10:27:04.365

[root@esx-vsan-3:~] esxcli vsan cluster get
Cluster Information
    Enabled: true
    Current Local Time: 2018-05-02T12:27:11Z
    Local Node UUID: 00000000-0000-0000-0000-000af7be4bb0
    Local Node Type: NORMAL
    Local Node State: MASTER
    Local Node Health State: HEALTHY
    Sub-Cluster Master UUID: 00000000-0000-0000-0000-000af7be4bb0
    Sub-Cluster Backup UUID:
    Sub-Cluster UUID: 52544668-9b49-8737-f8db-7c41f895ccd8
    Sub-Cluster Membership Entry Revision: 0
    Sub-Cluster Member Count: 1
    Sub-Cluster Member UUIDs: 00000000-0000-0000-0000-000af7be4bb0
    Sub-Cluster Membership UUID: 3cd2e85a-281a-bb75-8556-000af7be4bb0
    Unicast Mode Enabled: true
    Maintenance Mode State: OFF
    Config Generation: ae613ac5-ac9e-4058-971e-bdda071cae52 5 2018-05-02T10:27:04.404

10 Replies
TheBobkin
Champion
Champion

Hello rectorat-vers​,

How was this cluster set-up? (e.g. CLI, PowerCLI, Web Client)

Is the vCenter managing this cluster 6.7 also?

"The unicast agent is also visible from each node. No problem with the firewall either."

You should be able to verify that the hosts are actually able to communicate over the port used for Unicast (udp 12321) by running this on the hosts while attempting to join cluster you should see traffic:

# tcpdump-uw -i vmk1 -n udp port 12321 &

# esxcli vsan cluster join -u <Sub-Cluster UUID>

If you don't see any inter-node traffic then potentially something is preventing this.

Unlikely implicated here but do you have vCenter set as source of truth here and are the hosts all in the same vSphere cluster?

# esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates

Bob

Reply
0 Kudos
rectorat-vers
Contributor
Contributor

Yes, the vCenter Server is also 6.7 and the VSAN configuration was made by GUI.

I also check UDP traffic and is ok:

[root@esx-vsan-1:~] nc -z -w 1 -s 169.254.200.1 -u 169.254.200.2 12321
[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 12321
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:22:18.389966 IP 169.254.200.1.40327 > 169.254.200.2.12321: UDP, length 1
2 packets captured 169.254.200.1.40327 > 169.254.200.2.12321: UDP, length 1
2 packets received by filter
0 packets dropped by kernel

[root@esx-vsan-1:~] nc -z -w 1 -s 169.254.200.1 -u 169.254.200.2 12345
[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 12345
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:22:35.750290 IP 169.254.200.1.53870 > 169.254.200.2.12345: UDP, length 1
10:22:35.750656 IP 169.254.200.1.53870 > 169.254.200.2.12345: UDP, length 1
2 packets captured
2 packets received by filter
0 packets dropped by kernel

[root@esx-vsan-1:~] nc -z -w 1 -s 169.254.200.1 -u 169.254.200.2 23451
[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 23451
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:22:58.126649 IP 169.254.200.1.19016 > 169.254.200.2.23451: UDP, length 1
10:22:58.129136 IP 169.254.200.1.19016 > 169.254.200.2.23451: UDP, length 1
2 packets captured
2 packets received by filter
0 packets dropped by kernel


but when I unjoin/join a node, I don't see traffic:

[root@esx-vsan-2:~] esxcli vsan cluster leave
[root@esx-vsan-2:~] esxcli vsan cluster join -u 52544668-9b49-8737-f8db-7c41f895ccd8


[root@esx-vsan-1:~] tcpdump-uw -i vmk1 -n udp port 12321

tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode

listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes

the vCenter is the primary source of truth and the hosts are on the same cluster.

[root@esx-vsan-2:~] esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates

Value of IgnoreClusterMemberListUpdates is 0

Reply
0 Kudos
mschubi
Enthusiast
Enthusiast

did you enable accidentally an other kernel port for vSAN?

Reply
0 Kudos
rectorat-vers
Contributor
Contributor

No, I've only one vmkernel for VSAN on each node:

2018-05-03_14-18-36.png

When I check network connections, I don't see anything on port 12321 ?

[root@esx-vsan-1:~] esxcli network ip connection list | grep 169.254

tcp         0       0  169.254.160.1:427                              0.0.0.0:0             LISTEN        2100930  newreno

tcp         0       0  169.254.5.25:427                               0.0.0.0:0             LISTEN        2100930  newreno

Reply
0 Kudos
rectorat-vers
Contributor
Contributor

Eureka! it works !!

Simply, don't use APIPA address for your VSAN network.

I modified IP from 169.254.160.0 to 10.254.10.0 and immediately I see traffic on UDP port

[root@esx-vsan-1:~] esxcli network ip connection list | grep 12321

udp         0       0  10.254.160.1:12321                             0.0.0.0:0                           2102161           VSAN_0x4321713395d8_CMMDSProces

[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 12321

13:23:45.084288 IP 10.254.160.3.12321 > 10.254.160.2.12321: UDP, length 200

13:23:45.262885 IP 10.254.160.1.12321 > 10.254.160.2.12321: UDP, length 200

13:23:45.293362 IP 10.254.160.2.12321 > 10.254.160.1.12321: UDP, length 472

13:23:45.293380 IP 10.254.160.2.12321 > 10.254.160.3.12321: UDP, length 472

13:23:46.084309 IP 10.254.160.3.12321 > 10.254.160.2.12321: UDP, length 200

13:23:46.262896 IP 10.254.160.1.12321 > 10.254.160.2.12321: UDP, length 200

13:23:46.293372 IP 10.254.160.2.12321 > 10.254.160.1.12321: UDP, length 472

13:23:46.293394 IP 10.254.160.2.12321 > 10.254.160.3.12321: UDP, length 472

13:23:47.084296 IP 10.254.160.3.12321 > 10.254.160.2.12321: UDP, length 200

13:23:47.262913 IP 10.254.160.1.12321 > 10.254.160.2.12321: UDP, length 200

Curiously, I have not seen any documentation or KB about this fact.

iliketurbos
Enthusiast
Enthusiast

after 4 hours of troubleshooting this saved my butt, thanks for reporting back...

Same problem vSAN 6.7

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello iliketurbos

Sorry to hear you struggled with that.

vSAN traffic using APIPA addresses is unsupported and also won't function on any release >6.1 - I could have *sworn* we had it documented externally but I had a look just now in a few places and cannot find it stated, I will see what I can do about getting  public documentation sorted.

Bob

Reply
0 Kudos
briantilburgs2
Contributor
Contributor

I do have the same problem after a network problem: My distributed switch config was broken by an external orchestrator, the VSAN nodes became isolated.

vCenter is still in the VSAN datastore but has died due to the fact that VSAN is unavailable.

Is it possible to recreate the cluster without the vCenter?

 

[root@esx01:~] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2021-07-08T08:39:46Z
   Local Node UUID: 59c41bd7-2fff-b50c-3516-ac1f6b16cdb4
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 59c41bd7-2fff-b50c-3516-ac1f6b16cdb4
   Sub-Cluster Backup UUID: 
   Sub-Cluster UUID: 52dc2bda-1552-6d58-c502-32641a99e2b9
   Sub-Cluster Membership Entry Revision: 0
   Sub-Cluster Member Count: 1
   Sub-Cluster Member UUIDs: 59c41bd7-2fff-b50c-3516-ac1f6b16cdb4
   Sub-Cluster Member HostNames: esx01.tilburgs.eu
   Sub-Cluster Membership UUID: e4b0e660-3b5e-1334-c03f-ac1f6b16cdb4
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: d341678b-752c-4b21-9e26-aa17c2797817 490 2021-06-22T06:31:57.810
[root@esx01:~] 


[root@esx02:~] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2021-07-08T08:39:11Z
   Local Node UUID: 59c38eb3-4a8c-3490-33e9-ac1f6b16ce7a
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 59c38eb3-4a8c-3490-33e9-ac1f6b16ce7a
   Sub-Cluster Backup UUID: 
   Sub-Cluster UUID: 52dc2bda-1552-6d58-c502-32641a99e2b9
   Sub-Cluster Membership Entry Revision: 0
   Sub-Cluster Member Count: 1
   Sub-Cluster Member UUIDs: 59c38eb3-4a8c-3490-33e9-ac1f6b16ce7a
   Sub-Cluster Member HostNames: esx02.tilburgs.eu
   Sub-Cluster Membership UUID: bbb0e660-3c71-4337-5927-ac1f6b16ce7a
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: d341678b-752c-4b21-9e26-aa17c2797817 490 2021-06-22T06:31:57.814
[root@esx02:~] 


[root@esx04:~] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2021-07-08T10:30:45Z
   Local Node UUID: 5f79f311-d555-53c0-f336-bc4a563e6dca
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 5f79f311-d555-53c0-f336-bc4a563e6dca
   Sub-Cluster Backup UUID: 
   Sub-Cluster UUID: 52dc2bda-1552-6d58-c502-32641a99e2b9
   Sub-Cluster Membership Entry Revision: 0
   Sub-Cluster Member Count: 1
   Sub-Cluster Member UUIDs: 5f79f311-d555-53c0-f336-bc4a563e6dca
   Sub-Cluster Member HostNames: esx04.tilburgs.eu
   Sub-Cluster Membership UUID: eeb0e660-f23b-a739-0457-bc4a5645bd1a
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: d341678b-752c-4b21-9e26-aa17c2797817 490 2021-06-22T06:31:57.794
[root@esx04:~] 

 

I did move the VSAN interface to a vSwitch to get it all running again:

esxcli network vswitch standard add --vswitch-name=vSwitch0
esxcli network vswitch standard uplink add --uplink-name=vmnic7 --vswitch-name=vSwitch0 
esxcli network vswitch standard portgroup add --portgroup-name=tmp-vsan --vswitch-name=vSwitch0
esxcli network vswitch standard portgroup set -p tmp-vsan --vlan-id 55
esxcli network ip interface add --interface-name=vmk15 --portgroup-name=tmp-vsan
esxcli network ip interface ipv4 set --interface-name=vmk15 --ipv4=10.100.15.12 --netmask=255.255.255.0 --type=static
esxcli vsan network ip add -i vmk15 -T vsan
esxcli vsan network ip remove -i vmk5
services.sh restart

 

Any body an idea on how to repair this?

 

Reply
0 Kudos
TheBobkin
Champion
Champion

@briantilburgs2 "Is it possible to recreate the cluster without the vCenter?"

Yes, assuming they can all ping each other on the newly created vSAN IPs (correct MTU end-to-end, VLAN etc.) you should just need to populate the unicastagent list on each node so that it has all the other nodes new vSAN IPs:

https://kb.vmware.com/s/article/2150303

briantilburgs2
Contributor
Contributor

Wow, Thanks!! That was indded the problem, the unicast list was filled with the old adresses.

 

Problem solved

Reply
0 Kudos