Hello,
I recently installed a full SSD cluster (3 nodes Dell R630, PERC H730P) in VSAN 6.7 / ESXI 6.7 on a switch DELL S4048-ON
The network configuration of the nodes/switch is ok. A 10G interface is well dedicated to the VSAN traffic, the nodes ping each other well on this link.
[root@esx-vsan-1:~] vmkping -I vmk1 -s 8000 169.254.160.2
PING 169.254.160.2 (169.254.160.2): 8000 data bytes
8008 bytes from 169.254.160.2: icmp_seq=0 ttl=64 time=0.273 ms
8008 bytes from 169.254.160.2: icmp_seq=1 ttl=64 time=0.252 ms
8008 bytes from 169.254.160.2: icmp_seq=2 ttl=64 time=0.210 ms
--- 169.254.160.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.210/0.245/0.273 ms
[root@esx-vsan-2:~] vmkping -I vmk1 -s 8000 169.254.160.1
PING 169.254.160.1 (169.254.160.1): 8000 data bytes
8008 bytes from 169.254.160.1: icmp_seq=0 ttl=64 time=0.244 ms
8008 bytes from 169.254.160.1: icmp_seq=1 ttl=64 time=0.227 ms
8008 bytes from 169.254.160.1: icmp_seq=2 ttl=64 time=0.204 ms
--- 169.254.160.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.204/0.225/0.244 ms
The unicast agent is also visible from each node. No problem with the firewall either.
[root@esx-vsan-1:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name
------------------------------------ --------- ---------------- ------------- ----- ----------
00000000-0000-0000-0000-000af7be4a16 0 true 169.254.160.2 12321
00000000-0000-0000-0000-000af7be4bb0 0 true 169.254.160.3 12321
[root@esx-vsan-2:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name
------------------------------------ --------- ---------------- ------------- ----- ----------
00000000-0000-0000-0000-000af7be4bba 0 true 169.254.160.1 12321
00000000-0000-0000-0000-000af7be4bb0 0 true 169.254.160.3 12321
[root@esx-vsan-3:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name
------------------------------------ --------- ---------------- ------------- ----- ----------
00000000-0000-0000-0000-000af7be4bba 0 true 169.254.160.1 12321
00000000-0000-0000-0000-000af7be4a16 0 true 169.254.160.2 12321
Despite this the configuration wizard tells me a failure on the network configuration test and a partitioning problem.
Indeed, each node appears MASTER of the cluster (same Sub-Cluster UUID). I do not understand why, if anyone has an idea ?
[root@esx-vsan-1:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2018-05-02T12:26:58Z
Local Node UUID: 00000000-0000-0000-0000-000af7be4bba
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 00000000-0000-0000-0000-000af7be4bba
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52544668-9b49-8737-f8db-7c41f895ccd8
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 00000000-0000-0000-0000-000af7be4bba
Sub-Cluster Membership UUID: 39d2e85a-bc79-cda9-d709-000af7be4bba
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: ae613ac5-ac9e-4058-971e-bdda071cae52 5 2018-05-02T10:27:04.304
[root@esx-vsan-2:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2018-05-02T12:27:08Z
Local Node UUID: 00000000-0000-0000-0000-000af7be4a16
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 00000000-0000-0000-0000-000af7be4a16
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52544668-9b49-8737-f8db-7c41f895ccd8
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 00000000-0000-0000-0000-000af7be4a16
Sub-Cluster Membership UUID: 3cd2e85a-1c9a-7ce7-692f-000af7be4a16
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: ae613ac5-ac9e-4058-971e-bdda071cae52 5 2018-05-02T10:27:04.365
[root@esx-vsan-3:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2018-05-02T12:27:11Z
Local Node UUID: 00000000-0000-0000-0000-000af7be4bb0
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 00000000-0000-0000-0000-000af7be4bb0
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52544668-9b49-8737-f8db-7c41f895ccd8
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 00000000-0000-0000-0000-000af7be4bb0
Sub-Cluster Membership UUID: 3cd2e85a-281a-bb75-8556-000af7be4bb0
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: ae613ac5-ac9e-4058-971e-bdda071cae52 5 2018-05-02T10:27:04.404
Hello rectorat-vers,
How was this cluster set-up? (e.g. CLI, PowerCLI, Web Client)
Is the vCenter managing this cluster 6.7 also?
"The unicast agent is also visible from each node. No problem with the firewall either."
You should be able to verify that the hosts are actually able to communicate over the port used for Unicast (udp 12321) by running this on the hosts while attempting to join cluster you should see traffic:
# tcpdump-uw -i vmk1 -n udp port 12321 &
# esxcli vsan cluster join -u <Sub-Cluster UUID>
If you don't see any inter-node traffic then potentially something is preventing this.
Unlikely implicated here but do you have vCenter set as source of truth here and are the hosts all in the same vSphere cluster?
# esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates
Bob
Yes, the vCenter Server is also 6.7 and the VSAN configuration was made by GUI.
I also check UDP traffic and is ok:
[root@esx-vsan-1:~] nc -z -w 1 -s 169.254.200.1 -u 169.254.200.2 12321
[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 12321
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:22:18.389966 IP 169.254.200.1.40327 > 169.254.200.2.12321: UDP, length 1
2 packets captured 169.254.200.1.40327 > 169.254.200.2.12321: UDP, length 1
2 packets received by filter
0 packets dropped by kernel
[root@esx-vsan-1:~] nc -z -w 1 -s 169.254.200.1 -u 169.254.200.2 12345
[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 12345
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:22:35.750290 IP 169.254.200.1.53870 > 169.254.200.2.12345: UDP, length 1
10:22:35.750656 IP 169.254.200.1.53870 > 169.254.200.2.12345: UDP, length 1
2 packets captured
2 packets received by filter
0 packets dropped by kernel
[root@esx-vsan-1:~] nc -z -w 1 -s 169.254.200.1 -u 169.254.200.2 23451
[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 23451
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:22:58.126649 IP 169.254.200.1.19016 > 169.254.200.2.23451: UDP, length 1
10:22:58.129136 IP 169.254.200.1.19016 > 169.254.200.2.23451: UDP, length 1
2 packets captured
2 packets received by filter
0 packets dropped by kernel
but when I unjoin/join a node, I don't see traffic:
[root@esx-vsan-2:~] esxcli vsan cluster leave
[root@esx-vsan-2:~] esxcli vsan cluster join -u 52544668-9b49-8737-f8db-7c41f895ccd8
[root@esx-vsan-1:~] tcpdump-uw -i vmk1 -n udp port 12321
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk1, link-type EN10MB (Ethernet), capture size 262144 bytes
the vCenter is the primary source of truth and the hosts are on the same cluster.
[root@esx-vsan-2:~] esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates
Value of IgnoreClusterMemberListUpdates is 0
did you enable accidentally an other kernel port for vSAN?
No, I've only one vmkernel for VSAN on each node:
When I check network connections, I don't see anything on port 12321 ?
[root@esx-vsan-1:~] esxcli network ip connection list | grep 169.254
tcp 0 0 169.254.160.1:427 0.0.0.0:0 LISTEN 2100930 newreno
tcp 0 0 169.254.5.25:427 0.0.0.0:0 LISTEN 2100930 newreno
Eureka! it works !!
Simply, don't use APIPA address for your VSAN network.
I modified IP from 169.254.160.0 to 10.254.10.0 and immediately I see traffic on UDP port
[root@esx-vsan-1:~] esxcli network ip connection list | grep 12321
udp 0 0 10.254.160.1:12321 0.0.0.0:0 2102161 VSAN_0x4321713395d8_CMMDSProces
[root@esx-vsan-2:~] tcpdump-uw -i vmk1 port 12321
13:23:45.084288 IP 10.254.160.3.12321 > 10.254.160.2.12321: UDP, length 200
13:23:45.262885 IP 10.254.160.1.12321 > 10.254.160.2.12321: UDP, length 200
13:23:45.293362 IP 10.254.160.2.12321 > 10.254.160.1.12321: UDP, length 472
13:23:45.293380 IP 10.254.160.2.12321 > 10.254.160.3.12321: UDP, length 472
13:23:46.084309 IP 10.254.160.3.12321 > 10.254.160.2.12321: UDP, length 200
13:23:46.262896 IP 10.254.160.1.12321 > 10.254.160.2.12321: UDP, length 200
13:23:46.293372 IP 10.254.160.2.12321 > 10.254.160.1.12321: UDP, length 472
13:23:46.293394 IP 10.254.160.2.12321 > 10.254.160.3.12321: UDP, length 472
13:23:47.084296 IP 10.254.160.3.12321 > 10.254.160.2.12321: UDP, length 200
13:23:47.262913 IP 10.254.160.1.12321 > 10.254.160.2.12321: UDP, length 200
Curiously, I have not seen any documentation or KB about this fact.
after 4 hours of troubleshooting this saved my butt, thanks for reporting back...
Same problem vSAN 6.7
Hello iliketurbos
Sorry to hear you struggled with that.
vSAN traffic using APIPA addresses is unsupported and also won't function on any release >6.1 - I could have *sworn* we had it documented externally but I had a look just now in a few places and cannot find it stated, I will see what I can do about getting public documentation sorted.
Bob
I do have the same problem after a network problem: My distributed switch config was broken by an external orchestrator, the VSAN nodes became isolated.
vCenter is still in the VSAN datastore but has died due to the fact that VSAN is unavailable.
Is it possible to recreate the cluster without the vCenter?
[root@esx01:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-07-08T08:39:46Z
Local Node UUID: 59c41bd7-2fff-b50c-3516-ac1f6b16cdb4
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 59c41bd7-2fff-b50c-3516-ac1f6b16cdb4
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52dc2bda-1552-6d58-c502-32641a99e2b9
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 59c41bd7-2fff-b50c-3516-ac1f6b16cdb4
Sub-Cluster Member HostNames: esx01.tilburgs.eu
Sub-Cluster Membership UUID: e4b0e660-3b5e-1334-c03f-ac1f6b16cdb4
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: d341678b-752c-4b21-9e26-aa17c2797817 490 2021-06-22T06:31:57.810
[root@esx01:~]
[root@esx02:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-07-08T08:39:11Z
Local Node UUID: 59c38eb3-4a8c-3490-33e9-ac1f6b16ce7a
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 59c38eb3-4a8c-3490-33e9-ac1f6b16ce7a
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52dc2bda-1552-6d58-c502-32641a99e2b9
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 59c38eb3-4a8c-3490-33e9-ac1f6b16ce7a
Sub-Cluster Member HostNames: esx02.tilburgs.eu
Sub-Cluster Membership UUID: bbb0e660-3c71-4337-5927-ac1f6b16ce7a
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: d341678b-752c-4b21-9e26-aa17c2797817 490 2021-06-22T06:31:57.814
[root@esx02:~]
[root@esx04:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2021-07-08T10:30:45Z
Local Node UUID: 5f79f311-d555-53c0-f336-bc4a563e6dca
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 5f79f311-d555-53c0-f336-bc4a563e6dca
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52dc2bda-1552-6d58-c502-32641a99e2b9
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 5f79f311-d555-53c0-f336-bc4a563e6dca
Sub-Cluster Member HostNames: esx04.tilburgs.eu
Sub-Cluster Membership UUID: eeb0e660-f23b-a739-0457-bc4a5645bd1a
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: d341678b-752c-4b21-9e26-aa17c2797817 490 2021-06-22T06:31:57.794
[root@esx04:~]
I did move the VSAN interface to a vSwitch to get it all running again:
esxcli network vswitch standard add --vswitch-name=vSwitch0
esxcli network vswitch standard uplink add --uplink-name=vmnic7 --vswitch-name=vSwitch0
esxcli network vswitch standard portgroup add --portgroup-name=tmp-vsan --vswitch-name=vSwitch0
esxcli network vswitch standard portgroup set -p tmp-vsan --vlan-id 55
esxcli network ip interface add --interface-name=vmk15 --portgroup-name=tmp-vsan
esxcli network ip interface ipv4 set --interface-name=vmk15 --ipv4=10.100.15.12 --netmask=255.255.255.0 --type=static
esxcli vsan network ip add -i vmk15 -T vsan
esxcli vsan network ip remove -i vmk5
services.sh restart
Any body an idea on how to repair this?
@briantilburgs2 "Is it possible to recreate the cluster without the vCenter?"
Yes, assuming they can all ping each other on the newly created vSAN IPs (correct MTU end-to-end, VLAN etc.) you should just need to populate the unicastagent list on each node so that it has all the other nodes new vSAN IPs:
Wow, Thanks!! That was indded the problem, the unicast list was filled with the old adresses.
Problem solved