rectorat-vers
Contributor
Contributor

VSAN Cluster partition problem

Hello,


I recently installed a full SSD cluster (3 nodes Dell R630, PERC H730P) in VSAN 6.7 / ESXI 6.7 on a switch DELL S4048-ON
The network configuration of the nodes/switch is ok. A 10G interface is well dedicated to the VSAN traffic, the nodes ping each other well on this link.

[root@esx-vsan-1:~] vmkping -I vmk1 -s 8000 169.254.160.2
PING 169.254.160.2 (169.254.160.2): 8000 data bytes
8008 bytes from 169.254.160.2: icmp_seq=0 ttl=64 time=0.273 ms
8008 bytes from 169.254.160.2: icmp_seq=1 ttl=64 time=0.252 ms
8008 bytes from 169.254.160.2: icmp_seq=2 ttl=64 time=0.210 ms

--- 169.254.160.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.210/0.245/0.273 ms

[root@esx-vsan-2:~] vmkping -I vmk1 -s 8000 169.254.160.1
PING 169.254.160.1 (169.254.160.1): 8000 data bytes
8008 bytes from 169.254.160.1: icmp_seq=0 ttl=64 time=0.244 ms
8008 bytes from 169.254.160.1: icmp_seq=1 ttl=64 time=0.227 ms
8008 bytes from 169.254.160.1: icmp_seq=2 ttl=64 time=0.204 ms

--- 169.254.160.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.204/0.225/0.244 ms


The unicast agent is also visible from each node. No problem with the firewall either.

[root@esx-vsan-1:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name
------------------------------------  ---------  ----------------  -------------  -----  ----------
00000000-0000-0000-0000-000af7be4a16          0              true  169.254.160.2  12321
00000000-0000-0000-0000-000af7be4bb0          0              true  169.254.160.3  12321
[root@esx-vsan-2:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name
------------------------------------  ---------  ----------------  -------------  -----  ----------
00000000-0000-0000-0000-000af7be4bba          0              true  169.254.160.1  12321
00000000-0000-0000-0000-000af7be4bb0          0              true  169.254.160.3  12321
[root@esx-vsan-3:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name
------------------------------------  ---------  ----------------  -------------  -----  ----------
00000000-0000-0000-0000-000af7be4bba          0              true  169.254.160.1  12321
00000000-0000-0000-0000-000af7be4a16          0              true  169.254.160.2  12321



Despite this the configuration wizard tells me a failure on the network configuration test and a partitioning problem.

vSphere_Web_Client_-_Google_Chrome_2018-05-02_14-20-29.png


Indeed, each node appears MASTER of the cluster (same Sub-Cluster UUID). I do not understand why, if anyone has an idea ?

[root@esx-vsan-1:~] esxcli vsan cluster get
Cluster Information
    Enabled: true
    Current Local Time: 2018-05-02T12:26:58Z
    Local Node UUID: 00000000-0000-0000-0000-000af7be4bba
    Local Node Type: NORMAL
    Local Node State: MASTER
    Local Node Health State: HEALTHY
    Sub-Cluster Master UUID: 00000000-0000-0000-0000-000af7be4bba
    Sub-Cluster Backup UUID:
    Sub-Cluster UUID: 52544668-9b49-8737-f8db-7c41f895ccd8
    Sub-Cluster Membership Entry Revision: 0
    Sub-Cluster Member Count: 1
    Sub-Cluster Member UUIDs: 00000000-0000-0000-0000-000af7be4bba
    Sub-Cluster Membership UUID: 39d2e85a-bc79-cda9-d709-000af7be4bba
    Unicast Mode Enabled: true
    Maintenance Mode State: OFF
    Config Generation: ae613ac5-ac9e-4058-971e-bdda071cae52 5 2018-05-02T10:27:04.304

[root@esx-vsan-2:~] esxcli vsan cluster get
Cluster Information
    Enabled: true
    Current Local Time: 2018-05-02T12:27:08Z
    Local Node UUID: 00000000-0000-0000-0000-000af7be4a16
    Local Node Type: NORMAL
    Local Node State: MASTER
    Local Node Health State: HEALTHY
    Sub-Cluster Master UUID: 00000000-0000-0000-0000-000af7be4a16
    Sub-Cluster Backup UUID:
    Sub-Cluster UUID: 52544668-9b49-8737-f8db-7c41f895ccd8
    Sub-Cluster Membership Entry Revision: 0
    Sub-Cluster Member Count: 1
    Sub-Cluster Member UUIDs: 00000000-0000-0000-0000-000af7be4a16
    Sub-Cluster Membership UUID: 3cd2e85a-1c9a-7ce7-692f-000af7be4a16
    Unicast Mode Enabled: true
    Maintenance Mode State: OFF
    Config Generation: ae613ac5-ac9e-4058-971e-bdda071cae52 5 2018-05-02T10:27:04.365

[root@esx-vsan-3:~] esxcli vsan cluster get
Cluster Information
    Enabled: true
    Current Local Time: 2018-05-02T12:27:11Z
    Local Node UUID: 00000000-0000-0000-0000-000af7be4bb0
    Local Node Type: NORMAL
    Local Node State: MASTER
    Local Node Health State: HEALTHY
    Sub-Cluster Master UUID: 00000000-0000-0000-0000-000af7be4bb0
    Sub-Cluster Backup UUID:
    Sub-Cluster UUID: 52544668-9b49-8737-f8db-7c41f895ccd8
    Sub-Cluster Membership Entry Revision: 0
    Sub-Cluster Member Count: 1
    Sub-Cluster Member UUIDs: 00000000-0000-0000-0000-000af7be4bb0
    Sub-Cluster Membership UUID: 3cd2e85a-281a-bb75-8556-000af7be4bb0
    Unicast Mode Enabled: true
    Maintenance Mode State: OFF
    Config Generation: ae613ac5-ac9e-4058-971e-bdda071cae52 5 2018-05-02T10:27:04.404

7 Replies
TheBobkin
VMware Employee
VMware Employee

Hello rectorat-vers​,

How was this cluster set-up? (e.g. CLI, PowerCLI, Web Client)

Is the vCenter managing this cluster 6.7 also?

"The unicast agent is also visible from each node. No problem with the firewall either."

You should be able to verify that the hosts are actually able to communicate over the port used for Unicast (udp 12321) by running this on the hosts while attempting to join cluster you should see traffic:

# tcpdump-uw -i vmk1 -n udp port 12321 &

# esxcli vsan cluster join -u <Sub-Cluster UUID>

If you don't see any inter-node traffic then potentially something is preventing this.

Unlikely implicated here but do you have vCenter set as source of truth here and are the hosts all in the same vSphere cluster?

# esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates

Bob

0 Kudos
rectorat-vers
Contributor
Contributor

Yes, the vCenter Server is also 6.7 and the VSAN configuration was made by GUI.

I also check UDP traffic and is ok:

[root@esx-vsan-1:~] nc -z -w 1 -s 169.254.200.1 -u 169.254.200.2 12321
[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 12321
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:22:18.389966 IP 169.254.200.1.40327 > 169.254.200.2.12321: UDP, length 1
2 packets captured 169.254.200.1.40327 > 169.254.200.2.12321: UDP, length 1
2 packets received by filter
0 packets dropped by kernel

[root@esx-vsan-1:~] nc -z -w 1 -s 169.254.200.1 -u 169.254.200.2 12345
[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 12345
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:22:35.750290 IP 169.254.200.1.53870 > 169.254.200.2.12345: UDP, length 1
10:22:35.750656 IP 169.254.200.1.53870 > 169.254.200.2.12345: UDP, length 1
2 packets captured
2 packets received by filter
0 packets dropped by kernel

[root@esx-vsan-1:~] nc -z -w 1 -s 169.254.200.1 -u 169.254.200.2 23451
[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 23451
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:22:58.126649 IP 169.254.200.1.19016 > 169.254.200.2.23451: UDP, length 1
10:22:58.129136 IP 169.254.200.1.19016 > 169.254.200.2.23451: UDP, length 1
2 packets captured
2 packets received by filter
0 packets dropped by kernel


but when I unjoin/join a node, I don't see traffic:

[root@esx-vsan-2:~] esxcli vsan cluster leave
[root@esx-vsan-2:~] esxcli vsan cluster join -u 52544668-9b49-8737-f8db-7c41f895ccd8


[root@esx-vsan-1:~] tcpdump-uw -i vmk1 -n udp port 12321

tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode

listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes

the vCenter is the primary source of truth and the hosts are on the same cluster.

[root@esx-vsan-2:~] esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates

Value of IgnoreClusterMemberListUpdates is 0

0 Kudos
mschubi
Enthusiast
Enthusiast

did you enable accidentally an other kernel port for vSAN?

0 Kudos
rectorat-vers
Contributor
Contributor

No, I've only one vmkernel for VSAN on each node:

2018-05-03_14-18-36.png

When I check network connections, I don't see anything on port 12321 ?

[root@esx-vsan-1:~] esxcli network ip connection list | grep 169.254

tcp         0       0  169.254.160.1:427                              0.0.0.0:0             LISTEN        2100930  newreno

tcp         0       0  169.254.5.25:427                               0.0.0.0:0             LISTEN        2100930  newreno

0 Kudos
rectorat-vers
Contributor
Contributor

Eureka! it works !!

Simply, don't use APIPA address for your VSAN network.

I modified IP from 169.254.160.0 to 10.254.10.0 and immediately I see traffic on UDP port

[root@esx-vsan-1:~] esxcli network ip connection list | grep 12321

udp         0       0  10.254.160.1:12321                             0.0.0.0:0                           2102161           VSAN_0x4321713395d8_CMMDSProces

[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 12321

13:23:45.084288 IP 10.254.160.3.12321 > 10.254.160.2.12321: UDP, length 200

13:23:45.262885 IP 10.254.160.1.12321 > 10.254.160.2.12321: UDP, length 200

13:23:45.293362 IP 10.254.160.2.12321 > 10.254.160.1.12321: UDP, length 472

13:23:45.293380 IP 10.254.160.2.12321 > 10.254.160.3.12321: UDP, length 472

13:23:46.084309 IP 10.254.160.3.12321 > 10.254.160.2.12321: UDP, length 200

13:23:46.262896 IP 10.254.160.1.12321 > 10.254.160.2.12321: UDP, length 200

13:23:46.293372 IP 10.254.160.2.12321 > 10.254.160.1.12321: UDP, length 472

13:23:46.293394 IP 10.254.160.2.12321 > 10.254.160.3.12321: UDP, length 472

13:23:47.084296 IP 10.254.160.3.12321 > 10.254.160.2.12321: UDP, length 200

13:23:47.262913 IP 10.254.160.1.12321 > 10.254.160.2.12321: UDP, length 200

Curiously, I have not seen any documentation or KB about this fact.

0 Kudos
iliketurbos
Enthusiast
Enthusiast

after 4 hours of troubleshooting this saved my butt, thanks for reporting back...

Same problem vSAN 6.7

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Hello iliketurbos

Sorry to hear you struggled with that.

vSAN traffic using APIPA addresses is unsupported and also won't function on any release >6.1 - I could have *sworn* we had it documented externally but I had a look just now in a few places and cannot find it stated, I will see what I can do about getting  public documentation sorted.

Bob

0 Kudos