YanSi
Contributor
Contributor

VSAN 6.6 Stretched in Nested Environment:Network Partition

When the number of Stretched cluster host to 8 when the network partition will occur.
All Nested Host is ESXi 6.5 GA Bulid 5310538(VSAN 6.6)
I deploy 7 Nested ESXi hosts and then make it 3 + 3 + 1 vSAN Stretched Cluster(Site-A : 3 Hosts , Site-B : 3 Hosts and 1 for Witness Appliance),No network partition problems occurred.

Host yellow exclamation mark is because I enabled SSH

pastedImage_0.png

Once the VSAN cluster is 8 hosts(Site-A : 4 Hosts , Site-B : 4 Hosts and 1 for Witness Appliance), there will be a network partition problem.

pastedImage_1.png

pastedImage_2.png

About LAB Environmental Description:

Site A Hosts
esxi01-a.corp.demo.com
MGMT VMK=192.168.50.1/24 GW=192.168.50.254
VSAN VMK=172.30.0.11/24 GW=172.30.0.1

esxi02-a.corp.demo.com
MGMT VMK=192.168.50.2/24 GW=192.168.50.254
VSAN VMK=172.30.0.12/24 GW=172.30.0.1

esxi03-a.corp.demo.com
MGMT VMK=192.168.50.3/24 GW=192.168.50.254
VSAN VMK=172.30.0.13/24 GW=172.30.0.1

esxi04-a.corp.demo.com
MGMT VMK=192.168.50.4/24 GW=192.168.50.254
VSAN VMK=172.30.0.14/24 GW=172.30.0.1

Before Enabling VSAN, configure the Gateway commands for the esxi01-a to esxi04-a host:

esxcli network ip route ipv4 add -n 147.80.0.0/24 -g 172.30.0.1

Site B Hosts
esxi01-b.corp.demo.com
MGMT VMK=192.168.50.6/24 GW=192.168.50.254
VSAN VMK=172.30.0.21/24 GW=172.30.0.254

esxi02-b.corp.demo.com
MGMT VMK=192.168.50.7/24 GW=192.168.50.254
VSAN VMK=172.30.0.22/24 GW=172.30.0.254

esxi03-b.corp.demo.com
MGMT VMK=192.168.50.8/24 GW=192.168.50.254
VSAN VMK=172.30.0.23/24 GW=172.30.0.254

esxi04-b.corp.demo.com
MGMT VMK=192.168.50.9/24 GW=192.168.50.254
VSAN VMK=172.30.0.24/24 GW=172.30.0.254

Before Enabling VSAN, configure the Gateway commands for the esxi01-b to esxi04-b host:

esxcli network ip route ipv4 add -n 147.80.0.0/24 -g 172.30.0.254

Site C Hosts
witness.corp.demo.com
MGMT VMK=192.168.50.250/24 GW=192.168.50.254
VSAN VMK=147.80.0.15/24 GW=147.80.0.1 or 147.80.0.254

Before Enabling VSAN, configure the Gateway commands for the witness host:

esxcli network ip route ipv4 add -n 172.30.0.11/32 -g 147.80.0.1
esxcli network ip route ipv4 add -n 172.30.0.12/32 -g 147.80.0.1
esxcli network ip route ipv4 add -n 172.30.0.13/32 -g 147.80.0.1
esxcli network ip route ipv4 add -n 172.30.0.14/32 -g 147.80.0.1

esxcli network ip route ipv4 add -n 172.30.0.21/32 -g 147.80.0.254
esxcli network ip route ipv4 add -n 172.30.0.22/32 -g 147.80.0.254
esxcli network ip route ipv4 add -n 172.30.0.23/32 -g 147.80.0.254
esxcli network ip route ipv4 add -n 172.30.0.24/32 -g 147.80.0.254

That's My lab Topology

pastedImage_0.png

I use 4 Linux virtual machines as a static router and use 3 vSwitches.

I tested on the witness host as follows:

Trace Site-A All Host's VSAN VMKs(Will go 147.80.0.1 Gateway)

[root@witness:~] traceroute -i vmk1 172.30.0.11

traceroute to 172.30.0.11 (172.30.0.11), 30 hops max, 40 byte packets

1  147.80.0.1 (147.80.0.1)  0.174 ms  0.128 ms  0.154 ms

2  10.10.0.1 (10.10.0.1)  0.275 ms  0.241 ms  0.162 ms

3  172.30.0.11 (172.30.0.11)  0.420 ms  0.549 ms  0.280 ms

[root@witness:~] traceroute -i vmk1 172.30.0.12

traceroute to 172.30.0.12 (172.30.0.12), 30 hops max, 40 byte packets

1  147.80.0.1 (147.80.0.1)  0.129 ms  0.077 ms  0.102 ms

2  10.10.0.1 (10.10.0.1)  0.181 ms  0.100 ms  0.094 ms

3  172.30.0.12 (172.30.0.12)  0.324 ms  0.391 ms  0.343 ms

[root@witness:~] traceroute -i vmk1 172.30.0.13

traceroute to 172.30.0.13 (172.30.0.13), 30 hops max, 40 byte packets

1  * 147.80.0.1 (147.80.0.1)  0.250 ms  0.149 ms

2  10.10.0.1 (10.10.0.1)  0.166 ms  0.163 ms  0.124 ms

3  172.30.0.13 (172.30.0.13)  0.577 ms  0.808 ms  0.344 ms

[root@witness:~] traceroute -i vmk1 172.30.0.14

traceroute to 172.30.0.14 (172.30.0.14), 30 hops max, 40 byte packets

1  147.80.0.1 (147.80.0.1)  0.179 ms  0.068 ms  0.149 ms

2  10.10.0.1 (10.10.0.1)  0.168 ms  0.199 ms *

3  172.30.0.14 (172.30.0.14)  0.493 ms  0.378 ms  0.310 ms

Trace Site-B All Host's VSAN VMKs(Will go 147.80.0.254 Gateway)

[root@witness:~] traceroute -i vmk1 172.30.0.21

traceroute to 172.30.0.21 (172.30.0.21), 30 hops max, 40 byte packets

1  147.80.0.254 (147.80.0.254)  0.197 ms  0.151 ms  0.075 ms

2  10.10.0.254 (10.10.0.254)  0.241 ms  0.248 ms  0.147 ms

3  172.30.0.21 (172.30.0.21)  0.433 ms  0.489 ms  0.305 ms

[root@witness:~] traceroute -i vmk1 172.30.0.22

traceroute to 172.30.0.22 (172.30.0.22), 30 hops max, 40 byte packets

1  147.80.0.254 (147.80.0.254)  0.129 ms  0.118 ms  0.080 ms

2  10.10.0.254 (10.10.0.254)  0.163 ms  0.130 ms  0.119 ms

3  172.30.0.22 (172.30.0.22)  0.388 ms  0.467 ms  0.323 ms

[root@witness:~] traceroute -i vmk1 172.30.0.23

traceroute to 172.30.0.23 (172.30.0.23), 30 hops max, 40 byte packets

1  * 147.80.0.254 (147.80.0.254)  0.200 ms  0.075 ms

2  10.10.0.254 (10.10.0.254)  0.194 ms  0.170 ms  0.131 ms

3  172.30.0.23 (172.30.0.23)  0.359 ms  0.623 ms  0.420 ms

[root@witness:~] traceroute -i vmk1 172.30.0.24

traceroute to 172.30.0.24 (172.30.0.24), 30 hops max, 40 byte packets

1  147.80.0.254 (147.80.0.254)  0.227 ms  0.163 ms  0.151 ms

2  10.10.0.254 (10.10.0.254)  0.203 ms  0.185 ms  0.190 ms

3  172.30.0.24 (172.30.0.24)  0.384 ms  0.363 ms  0.291 ms

[root@witness:~]

I do not know how to solve this problem of network partitioning?

If it can not be achieved, I will not be able to test the VSAN RAID-5/6 strategy.

Thank you for all the help Smiley Happy

16 Replies
AishR
VMware Employee
VMware Employee

During network partition, components in the active site appear to be absent.

During a network partition in a vSAN 2 host or stretched cluster, the vSphere Web Client might display a view of the cluster from the perspective of the non-active site. You might see active components in the primary site displayed as absent.

Workaround: Use RVC commands to query the state of objects in the cluster.

0 Kudos
YanSi
Contributor
Contributor

I don't understand why esxi04-b host became master role

/localhost/Datacenter/computers> vsan.cluster_info 0
2017-07-15 00:45:58 +0800: Fetching host info from esxi01-a.corp.demo.com (may take a moment) ...
2017-07-15 00:45:58 +0800: Fetching host info from esxi02-b.corp.demo.com (may take a moment) ...
2017-07-15 00:45:58 +0800: Fetching host info from esxi03-b.corp.demo.com (may take a moment) ...
2017-07-15 00:45:58 +0800: Fetching host info from esxi04-b.corp.demo.com (may take a moment) ...
2017-07-15 00:45:58 +0800: Fetching host info from esxi03-a.corp.demo.com (may take a moment) ...
2017-07-15 00:45:58 +0800: Fetching host info from esxi01-b.corp.demo.com (may take a moment) ...
2017-07-15 00:45:58 +0800: Fetching host info from witness.corp.demo.com (may take a moment) ...
2017-07-15 00:45:58 +0800: Fetching host info from esxi04-a.corp.demo.com (may take a moment) ...
2017-07-15 00:45:58 +0800: Fetching host info from esxi02-a.corp.demo.com (may take a moment) ...
Host: esxi01-a.corp.demo.com
  Product: VMware ESXi 6.5.0 build-5310538
  vSAN enabled: yes
  Cluster info:
    Cluster role: master
    Cluster UUID: 52576d7e-8094-d951-745b-885a36c778dd
    Node UUID: 596302bf-1a29-4839-97e4-000c29c1488d
    Member UUIDs: ["596302c3-1d16-741b-cf50-000c2902b1ce", "596302ce-5c6c-2e35-df9b-000c299b5777", "596302bf-1a29-4839-97e4-000c29c1488d", "596302cb-5fd0-745f-9d9b-000c299654c0", "596302c9-137c-5bca-e942-000c2992f9eb", "596302ce-7cb1-b0fd-3d36-000c29
ef1351", "596302d5-90f8-92e6-be9b-000c296de5aa", "596302ce-dfe2-ddb3-1768-000c29ddb5e5"] (8)
  Node evacuated: no
  Storage info:
    Auto claim: no
    Disk Mappings:
      SSD: Local VMware Disk (mpx.vmhba1:C0:T1:L0) - 40 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T2:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T4:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T3:L0) - 60 GB, v5
  FaultDomainInfo:
    Preferred
  NetworkInfo:
    Adapter: vmk1 (172.30.0.11)

Host: esxi02-a.corp.demo.com
  Product: VMware ESXi 6.5.0 build-5310538
  vSAN enabled: yes
  Cluster info:
    Cluster role: agent
    Cluster UUID: 52576d7e-8094-d951-745b-885a36c778dd
    Node UUID: 596302c9-137c-5bca-e942-000c2992f9eb
    Member UUIDs: ["596302c3-1d16-741b-cf50-000c2902b1ce", "596302ce-5c6c-2e35-df9b-000c299b5777", "596302bf-1a29-4839-97e4-000c29c1488d", "596302cb-5fd0-745f-9d9b-000c299654c0", "596302c9-137c-5bca-e942-000c2992f9eb", "596302ce-7cb1-b0fd-3d36-000c29
ef1351", "596302d5-90f8-92e6-be9b-000c296de5aa", "596302ce-dfe2-ddb3-1768-000c29ddb5e5"] (8)
  Node evacuated: no
  Storage info:
    Auto claim: no
    Disk Mappings:
      SSD: Local VMware Disk (mpx.vmhba1:C0:T1:L0) - 40 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T2:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T4:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T3:L0) - 60 GB, v5
  FaultDomainInfo:
    Preferred
  NetworkInfo:
    Adapter: vmk1 (172.30.0.12)

Host: esxi03-a.corp.demo.com
  Product: VMware ESXi 6.5.0 build-5310538
  vSAN enabled: yes
  Cluster info:
    Cluster role: agent
    Cluster UUID: 52576d7e-8094-d951-745b-885a36c778dd
    Node UUID: 596302ce-5c6c-2e35-df9b-000c299b5777
    Member UUIDs: ["596302c3-1d16-741b-cf50-000c2902b1ce", "596302ce-5c6c-2e35-df9b-000c299b5777", "596302bf-1a29-4839-97e4-000c29c1488d", "596302cb-5fd0-745f-9d9b-000c299654c0", "596302c9-137c-5bca-e942-000c2992f9eb", "596302ce-7cb1-b0fd-3d36-000c29
ef1351", "596302d5-90f8-92e6-be9b-000c296de5aa", "596302ce-dfe2-ddb3-1768-000c29ddb5e5"] (8)
  Node evacuated: no
  Storage info:
    Auto claim: no
    Disk Mappings:
      SSD: Local VMware Disk (mpx.vmhba1:C0:T1:L0) - 40 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T2:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T4:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T3:L0) - 60 GB, v5
  FaultDomainInfo:
    Preferred
  NetworkInfo:
    Adapter: vmk1 (172.30.0.13)

Host: esxi01-b.corp.demo.com
  Product: VMware ESXi 6.5.0 build-5310538
  vSAN enabled: yes
  Cluster info:
    Cluster role: backup
    Cluster UUID: 52576d7e-8094-d951-745b-885a36c778dd
    Node UUID: 596302c3-1d16-741b-cf50-000c2902b1ce
    Member UUIDs: ["596302c3-1d16-741b-cf50-000c2902b1ce", "596302ce-5c6c-2e35-df9b-000c299b5777", "596302bf-1a29-4839-97e4-000c29c1488d", "596302cb-5fd0-745f-9d9b-000c299654c0", "596302c9-137c-5bca-e942-000c2992f9eb", "596302ce-7cb1-b0fd-3d36-000c29
ef1351", "596302d5-90f8-92e6-be9b-000c296de5aa", "596302ce-dfe2-ddb3-1768-000c29ddb5e5"] (8)
  Node evacuated: no
  Storage info:
    Auto claim: no
    Disk Mappings:
      SSD: Local VMware Disk (mpx.vmhba1:C0:T1:L0) - 40 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T2:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T4:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T3:L0) - 60 GB, v5
  FaultDomainInfo:
    Secondary
  NetworkInfo:
    Adapter: vmk1 (172.30.0.21)

Host: esxi02-b.corp.demo.com
  Product: VMware ESXi 6.5.0 build-5310538
  vSAN enabled: yes
  Cluster info:
    Cluster role: agent
    Cluster UUID: 52576d7e-8094-d951-745b-885a36c778dd
    Node UUID: 596302cb-5fd0-745f-9d9b-000c299654c0
    Member UUIDs: ["596302c3-1d16-741b-cf50-000c2902b1ce", "596302ce-5c6c-2e35-df9b-000c299b5777", "596302bf-1a29-4839-97e4-000c29c1488d", "596302cb-5fd0-745f-9d9b-000c299654c0", "596302c9-137c-5bca-e942-000c2992f9eb", "596302ce-7cb1-b0fd-3d36-000c29
ef1351", "596302d5-90f8-92e6-be9b-000c296de5aa", "596302ce-dfe2-ddb3-1768-000c29ddb5e5"] (8)
  Node evacuated: no
  Storage info:
    Auto claim: no
    Disk Mappings:
      SSD: Local VMware Disk (mpx.vmhba1:C0:T1:L0) - 40 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T2:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T4:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T3:L0) - 60 GB, v5
  FaultDomainInfo:
    Secondary
  NetworkInfo:
    Adapter: vmk1 (172.30.0.22)

Host: esxi03-b.corp.demo.com
  Product: VMware ESXi 6.5.0 build-5310538
  vSAN enabled: yes
  Cluster info:
    Cluster role: agent
    Cluster UUID: 52576d7e-8094-d951-745b-885a36c778dd
    Node UUID: 596302ce-7cb1-b0fd-3d36-000c29ef1351
    Member UUIDs: ["596302c3-1d16-741b-cf50-000c2902b1ce", "596302ce-5c6c-2e35-df9b-000c299b5777", "596302bf-1a29-4839-97e4-000c29c1488d", "596302cb-5fd0-745f-9d9b-000c299654c0", "596302c9-137c-5bca-e942-000c2992f9eb", "596302ce-7cb1-b0fd-3d36-000c29
ef1351", "596302d5-90f8-92e6-be9b-000c296de5aa", "596302ce-dfe2-ddb3-1768-000c29ddb5e5"] (8)
  Node evacuated: no
  Storage info:
    Auto claim: no
    Disk Mappings:
      SSD: Local VMware Disk (mpx.vmhba1:C0:T1:L0) - 40 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T2:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T4:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T3:L0) - 60 GB, v5
  FaultDomainInfo:
    Secondary
  NetworkInfo:
    Adapter: vmk1 (172.30.0.23)

Host: witness.corp.demo.com
  Product: VMware ESXi 6.5.0 build-5310538
  vSAN enabled: yes
  Cluster info:
    Cluster role: agent
    Cluster UUID: 52576d7e-8094-d951-745b-885a36c778dd
    Node UUID: 596302d5-90f8-92e6-be9b-000c296de5aa
    Node Type: Witness
    Member UUIDs: ["596302c3-1d16-741b-cf50-000c2902b1ce", "596302ce-5c6c-2e35-df9b-000c299b5777", "596302bf-1a29-4839-97e4-000c29c1488d", "596302cb-5fd0-745f-9d9b-000c299654c0", "596302c9-137c-5bca-e942-000c2992f9eb", "596302ce-7cb1-b0fd-3d36-000c29
ef1351", "596302d5-90f8-92e6-be9b-000c296de5aa", "596302ce-dfe2-ddb3-1768-000c29ddb5e5"] (8)
  Node evacuated: no
  Storage info:
    Auto claim: no
    Disk Mappings:
      SSD: Local VMware Disk (mpx.vmhba1:C0:T1:L0) - 40 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T2:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T4:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T3:L0) - 60 GB, v5
  FaultDomainInfo:
    Not configured
  NetworkInfo:
    Adapter: vmk1 (147.80.0.10)

Host: esxi04-a.corp.demo.com
  Product: VMware ESXi 6.5.0 build-5310538
  vSAN enabled: yes
  Cluster info:
    Cluster role: agent
    Cluster UUID: 52576d7e-8094-d951-745b-885a36c778dd
    Node UUID: 596302ce-dfe2-ddb3-1768-000c29ddb5e5
    Member UUIDs: ["596302c3-1d16-741b-cf50-000c2902b1ce", "596302ce-5c6c-2e35-df9b-000c299b5777", "596302bf-1a29-4839-97e4-000c29c1488d", "596302cb-5fd0-745f-9d9b-000c299654c0", "596302c9-137c-5bca-e942-000c2992f9eb", "596302ce-7cb1-b0fd-3d36-000c29
ef1351", "596302d5-90f8-92e6-be9b-000c296de5aa", "596302ce-dfe2-ddb3-1768-000c29ddb5e5"] (8)
  Node evacuated: no
  Storage info:
    Auto claim: no
    Disk Mappings:
      SSD: Local VMware Disk (mpx.vmhba1:C0:T1:L0) - 40 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T2:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T4:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T3:L0) - 60 GB, v5
  FaultDomainInfo:
    Not configured
  NetworkInfo:
    Adapter: vmk1 (172.30.0.14)

Host: esxi04-b.corp.demo.com
  Product: VMware ESXi 6.5.0 build-5310538
  vSAN enabled: yes
  Cluster info:
    Cluster role: master
    Cluster UUID: 52576d7e-8094-d951-745b-885a36c778dd
    Node UUID: 596302ce-5a99-6494-aa70-000c29302a2c
    Member UUIDs: ["596302ce-5a99-6494-aa70-000c29302a2c"] (1)
  Node evacuated: no
  Storage info:
    Auto claim: no
    Disk Mappings:
      SSD: Local VMware Disk (mpx.vmhba1:C0:T1:L0) - 40 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T2:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T4:L0) - 60 GB, v5
      MD: Local VMware Disk (mpx.vmhba1:C0:T3:L0) - 60 GB, v5
  FaultDomainInfo:
    Not configured
  NetworkInfo:
    Adapter: vmk1 (172.30.0.24)


Cluster has fault domains configured:
Preferred: esxi01-a.corp.demo.com, esxi02-a.corp.demo.com, esxi03-a.corp.demo.com
Secondary: esxi01-b.corp.demo.com, esxi02-b.corp.demo.com, esxi03-b.corp.demo.com
Not Configured: esxi04-a.corp.demo.com, esxi04-b.corp.demo.com
Preferred fault domain: Preferred
Preferred fault domain UUID: a054ccb4-ff68-4c73-cbc2-d272d45e32df
/localhost/Datacenter/computers>

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Hello YanSi,

It is Master because it is isolated from the rest of the cluster and thus it is the only cluster-member that it knows of.

Can you show the output of 'esxcli vsan cluster get' on a host from the main partition and the isolated one?

Also show the output from 'esxcli vsan cluster unicastagent list' from both.

If the unicastagent list is blank or has null entries then you should just need to add the unicastagent addresses for all the other nodes on the isolated node:

e.g. esxcli vsan cluster unicastagent add -i vmk1 -a 172.30.0.11 -t node -U 1

(sometimes also need to specify -u <Local node UUID> of the other hosts too)

http://pubs.vmware.com/vsphere-6-0/index.jsp?topic=%2Fcom.vmware.vcli.ref.doc%2Fesxcli_vsan.html

Let me know if this is the case or if something else is at fault here.

Bob

0 Kudos
YanSi
Contributor
Contributor

Hi Bob

main partition host show:
[root@esxi01-a:~] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2017-07-16T16:27:00Z
   Local Node UUID: 596302bf-1a29-4839-97e4-000c29c1488d
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 596302bf-1a29-4839-97e4-000c29c1488d
   Sub-Cluster Backup UUID: 596302c3-1d16-741b-cf50-000c2902b1ce
   Sub-Cluster UUID: 5272ac4f-8b03-fe07-8f56-813ebab4d8c5
   Sub-Cluster Membership Entry Revision: 7
   Sub-Cluster Member Count: 8
   Sub-Cluster Member UUIDs: 596302c3-1d16-741b-cf50-000c2902b1ce, 596302bf-1a29-4839-97e4-000c29c1488d, 596302cb-5fd0-745f-9d9b-000c299654c0, 596302ce-5a99-6494-aa70-000c29302a2c, 596302ce-dfe2-ddb3-1768-000c29ddb5e5, 596302c9-137c-5bca-e942-000c2992f9eb, 596302ce-7cb1-b0fd-3d36-000c29ef1351, 596302d5-90f8-92e6-be9b-000c296de5aa
   Sub-Cluster Membership UUID: a68c6b59-0e79-f9a6-bbc0-000c29c1488d
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
[root@esxi01-a:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address    Port  Iface Name
------------------------------------  ---------  ----------------  -----------  -----  ----------
00000000-0000-0000-0000-000000000000          1              true  147.80.0.10  12321
596302ce-5c6c-2e35-df9b-000c299b5777          0              true  172.30.0.13  12321
596302c9-137c-5bca-e942-000c2992f9eb          0              true  172.30.0.12  12321
596302cb-5fd0-745f-9d9b-000c299654c0          0              true  172.30.0.22  12321
596302c3-1d16-741b-cf50-000c2902b1ce          0              true  172.30.0.21  12321
596302ce-5a99-6494-aa70-000c29302a2c          0              true  172.30.0.24  12321
596302ce-dfe2-ddb3-1768-000c29ddb5e5          0              true  172.30.0.14  12321
596302ce-7cb1-b0fd-3d36-000c29ef1351          0              true  172.30.0.23  12321
[root@esxi01-a:~]


isolated partition host show:
[root@esxi03-a:~] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2017-07-16T16:27:04Z
   Local Node UUID: 596302ce-5c6c-2e35-df9b-000c299b5777
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 596302ce-5c6c-2e35-df9b-000c299b5777
   Sub-Cluster Backup UUID:
   Sub-Cluster UUID: 5272ac4f-8b03-fe07-8f56-813ebab4d8c5
   Sub-Cluster Membership Entry Revision: 2
   Sub-Cluster Member Count: 1
   Sub-Cluster Member UUIDs: 596302ce-5c6c-2e35-df9b-000c299b5777
   Sub-Cluster Membership UUID: 69936b59-fd9b-5775-475d-000c299b5777
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
[root@esxi03-a:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address    Port  Iface Name
------------------------------------  ---------  ----------------  -----------  -----  ----------
00000000-0000-0000-0000-000000000000          1              true  147.80.0.10  12321
596302c3-1d16-741b-cf50-000c2902b1ce          0              true  172.30.0.21  12321
596302c9-137c-5bca-e942-000c2992f9eb          0              true  172.30.0.12  12321
596302cb-5fd0-745f-9d9b-000c299654c0          0              true  172.30.0.22  12321
596302ce-5a99-6494-aa70-000c29302a2c          0              true  172.30.0.24  12321
596302ce-dfe2-ddb3-1768-000c29ddb5e5          0              true  172.30.0.14  12321
596302ce-7cb1-b0fd-3d36-000c29ef1351          0              true  172.30.0.23  12321
[root@esxi03-a:~]

I do not understand how to fix it?

[root@esxi03-a:~] esxcli vsan cluster unicastagent list

NodeUuid                              IsWitness  Supports Unicast  IP Address    Port  Iface Name

------------------------------------  ---------  ----------------  -----------  -----  ----------

00000000-0000-0000-0000-000000000000          1              true  147.80.0.10  12321

596302c3-1d16-741b-cf50-000c2902b1ce          0              true  172.30.0.21  12321

596302c9-137c-5bca-e942-000c2992f9eb          0              true  172.30.0.12  12321

596302cb-5fd0-745f-9d9b-000c299654c0          0              true  172.30.0.22  12321

596302ce-5a99-6494-aa70-000c29302a2c          0              true  172.30.0.24  12321

596302ce-dfe2-ddb3-1768-000c29ddb5e5          0              true  172.30.0.14  12321

596302ce-7cb1-b0fd-3d36-000c29ef1351          0              true  172.30.0.23  12321

[root@esxi03-a:~] esxcli vsan cluster unicastagent add -a 172.30.0.11 -t node -U 0 -u 596302bf-1a29-4839-97e4-000c29c

1488d

[root@esxi03-a:~] esxcli vsan cluster unicastagent list

NodeUuid                              IsWitness  Supports Unicast  IP Address    Port  Iface Name

------------------------------------  ---------  ----------------  -----------  -----  ----------

00000000-0000-0000-0000-000000000000          1              true  147.80.0.10  12321

596302c3-1d16-741b-cf50-000c2902b1ce          0              true  172.30.0.21  12321

596302c9-137c-5bca-e942-000c2992f9eb          0              true  172.30.0.12  12321

596302cb-5fd0-745f-9d9b-000c299654c0          0              true  172.30.0.22  12321

596302ce-5a99-6494-aa70-000c29302a2c          0              true  172.30.0.24  12321

596302ce-dfe2-ddb3-1768-000c29ddb5e5          0              true  172.30.0.14  12321

596302ce-7cb1-b0fd-3d36-000c29ef1351          0              true  172.30.0.23  12321

596302bf-1a29-4839-97e4-000c29c1488d          0             false  172.30.0.11  12321

[root@esxi03-a:~]

What does it mean?

Thank your help

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Hello,

This is esxi03-a, it was  esxi04-b that was isolated before, did you get that host back in cluster normally?

Anyway, so you added that esxi01a (172.30.0.11) incorrectly, I see you specified -U 0 which means Unicast = false,

remove the current entry for 172.30.0.11 and add it back but with -U 1 (which specifies Unicast = true).

Bob

0 Kudos
YanSi
Contributor
Contributor

[root@esxi03-a:~] esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2017-07-16T19:47:26Z

   Local Node UUID: 596302ce-5c6c-2e35-df9b-000c299b5777

   Local Node Type: NORMAL

   Local Node State: MASTER

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 596302ce-5c6c-2e35-df9b-000c299b5777

   Sub-Cluster Backup UUID:

   Sub-Cluster UUID: 52b0d964-ca50-1969-02ca-0e527e56508b

   Sub-Cluster Membership Entry Revision: 2

   Sub-Cluster Member Count: 1

   Sub-Cluster Member UUIDs: 596302ce-5c6c-2e35-df9b-000c299b5777

   Sub-Cluster Membership UUID: f2c16b59-b9a0-1d63-05e2-000c299b5777

   Unicast Mode Enabled: true

   Maintenance Mode State: OFF

[root@esxi03-a:~] esxcli vsan cluster unicastagent list

NodeUuid                              IsWitness  Supports Unicast  IP Address    Port  Iface Name

------------------------------------  ---------  ----------------  -----------  -----  ----------

00000000-0000-0000-0000-000000000000          1              true  147.80.0.10  12321

596302c9-137c-5bca-e942-000c2992f9eb          0              true  172.30.0.12  12321

596302c3-1d16-741b-cf50-000c2902b1ce          0              true  172.30.0.21  12321

596302ce-dfe2-ddb3-1768-000c29ddb5e5          0              true  172.30.0.14  12321

596302cb-5fd0-745f-9d9b-000c299654c0          0              true  172.30.0.22  12321

596302ce-7cb1-b0fd-3d36-000c29ef1351          0              true  172.30.0.23  12321

596302ce-5a99-6494-aa70-000c29302a2c          0              true  172.30.0.24  12321

596302bf-1a29-4839-97e4-000c29c1488d          0              true  172.30.0.11  12321

596302ce-5c6c-2e35-df9b-000c299b5777          0              true  172.30.0.13  12321

[root@esxi03-a:~]

but role not change

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Hello,

Hosts do not need an entry for local node (themself) in unicastagent list so remove this, then try leave and join, looks like you have wrong Sub-cluster UUID:

# esxcli vsan cluster leave

# esxcli vsan cluster join -u 5272ac4f-8b03-fe07-8f56-813ebab4d8c5  (double check that this is the current Sub-cluster UUID of the non-isolated hosts)

Bob

0 Kudos
YanSi
Contributor
Contributor

I think that is caused by the communication mechanism of the district, once joined the first eight hosts, will randomly cause a host is isolated.

0 Kudos
YanSi
Contributor
Contributor

Perhaps the text is not intuitive, I will show the screenshot.

this screenshot is show 7 hosts, there is no network partition problem.

7 hosts.png

When adding an eighth host, a network partitioning problem occurs.

esxi03-a this time is main partition master host.
esxi04-b this time is isolated partition master host.

Sub-Cluster UUID is the same.
Sub-Cluster UUID: 52dcf5b7-fb2c-803b-9af9-cbf8e900cc4f

8 hosts1.png

8 hosts 2.png

So that the issue is very strange Smiley Sad

0 Kudos
djanne
Contributor
Contributor

I believe I have the same issue.

Just configured a stretched cluster composed of 8 hosts(4 per site) + 1 witness.

One host keeps getting partitioned.

If I remove any host from the cluster it is fine, but as soon as I add it back so it becomes 8 then it fails.

I also noticed that the Sub-Cluster UUID is the same in the cluster as well as on the partitioned host.

Seems very strange.

0 Kudos
djanne
Contributor
Contributor

Update:

Just noticed that 8 seems to be the magical number...

Because when its configured as stretched, then the witness is of course a member.

So "esxcli vsan cluster get" shows 8 members. Its when I add the 9th that it breaks.

I tried recreating the cluster as a regular one without fault domains, and without the witness.

I could add 8 regular hosts, but when I add the 9th again it becomes partitioned...

0 Kudos
depping
Leadership
Leadership

Strange, rolling out a vSAN 6.6 lab myself to see what happens when I do this with 10 hosts

0 Kudos
GreatWhiteTec
VMware Employee
VMware Employee

Do you see any dropped packets in your network?

Aside from mis-configurations, you can have network partitioning if the network is overloaded. vSAN only tolerates a small number of dropped packets within the vSAN network.

You can use esxtop>n (network view)> look for %DRPRX

A+, DCSE, MCP, MCSA, MCSE, MCTS, MCITP, MCDBA, NCDA, NCIE-SAN, NCIE-BR, VCP4, VCP5, VCP5-DT, VCAP5-DCA _____________________ If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful.
0 Kudos
depping
Leadership
Leadership

I just tested it with vSAN 6.6 (vSphere 6.5 U1), worked fine for me going from 8 nodes to 10. I would recommend to phone up support and let them have a look.

0 Kudos
MarcHuppert
Enthusiast
Enthusiast

I have exactly the same issue. Cannot configure more than 8 vSAN members inside my cluster.

I have increased the RAM from 6 to 8GB, tested it with encryption and Dedupe Compression, nothing works.

@Duncan

How is your configuration different from ours?

VCDX #181, VSP, VTSP, VCA, VCP-DCV(2+3+4+5+6+6.5+6.7+2019), VCP-DT, VCP-NV, VCAP(DCA4+5+DCD4+5), VCIX-NV, VCIX-DCV, VCI, vExpert, vEpxert NSX, vExpert VSAN and VCDX
0 Kudos
MarcHuppert
Enthusiast
Enthusiast

Problem is fixed. Each Nested ESXi host needs 16GB RAM.....

I have configured my Stretched 15+15+1 nested cluster including Dedupe/Compression and Encryption

VCDX #181, VSP, VTSP, VCA, VCP-DCV(2+3+4+5+6+6.5+6.7+2019), VCP-DT, VCP-NV, VCAP(DCA4+5+DCD4+5), VCIX-NV, VCIX-DCV, VCI, vExpert, vEpxert NSX, vExpert VSAN and VCDX