VMware Cloud Community
kastlr
Expert
Expert

Weird vSAN stretched Cluster issue

Hi wise ones,

currently running in a strange issue with an eight node vSAN Cluster (6.7.17167734) when creating a stretched vSAN Cluster.

We do use WTS for traffic separation and are able to vmkping between ESXi vSAN vmks and Witness vmk1 (and vice versa).

But still receive a cluster partition message when running Skyline health checks.

 
 

But as you could see, the witness Node is listed twice in this report!

kastlr_0-1611611058799.png

 

Anybody ever seen a similar behavior with vSAN stretched Cluster?

And if so, how to get it fixed?

Would really appreciate any feedback on this.

 

Regards,

Ralf

 


Hope this helps a bit.
Greetings from Germany. (CEST)
0 Kudos
3 Replies
TheBobkin
Champion
Champion

@kastlr,This could be caused by numerous things and I would advise checking these first:

- Is the witness vmk (on the Witness node) tagged for 'vsan' and 'witness' traffic (should be only 'vsan' (yes, we are aware that is a tad confusing perhaps))?(if so, remove witness traffic tag on vmk1 if using vmk0)

- Does the Witness node have multiple node tagged for 'vsan' traffic - it should have only one (and comes with vmk1 tagged when deployed)? (if so, remove the one)

- Is there any disparity in static routes between nodes if using these to communicate with the Witness node? (esxcfg-route -l informs of these)

- Are there any other Network health checks showing what is failing to communicate from vmkX to vmkX?

kastlr
Expert
Expert

Hi,

first let me say thanks for joining the ride.

Not sure if I got all of your points correctly, but here're some more details.

  1. Output from Witness
    esxcli vsan network list
    Interface
    VmkNic Name: vmk1
    IP Protocol: IP
    Interface UUID: df2d0f60-4cc9-8566-8f84-005056a080b6
    Agent Group Multicast Address: 224.2.3.4
    Agent Group IPv6 Multicast Address: ff19::2:3:4
    Agent Group Multicast Port: 23451
    Master Group Multicast Address: 224.1.2.3
    Master Group IPv6 Multicast Address: ff19::1:2:3
    Master Group Multicast Port: 12345
    Host Unicast Channel Bound Port: 12321
    Multicast TTL: 5
    Traffic Type: vsan

  2. I'm unsure what you mean with "Does the Witness node have multiple node tagged for 'vsan' traffic".
    We do not plan to assign multiple stretched Clusters to a single vSAN Witness VM.

    Currently we don't have any vSAN stretched Cluster in our environment, but we do plan to use multiple of them.
    But the idea would be that each vSAN stretched Cluster will get his dedicated Witness VM (1:1 relationship).
    Per stretched Cluster we'll create 3 small dedicated vLANs, WTS for Site 1, WTS for Site 2, WTS für Witness vmk1.

    In case I did misinterpret your statement please let me know.

  3. Output from Node on Site 1
    VMkernel Routes:
    Network Netmask Gateway Interface
    192.168.193.240 255.255.255.240 192.168.193.190 vmk5  --> WTS        (vLAN on Site 1 only)
    192.168.193.64 255.255.255.224 Local Subnet vmk2      --> Management (vLAN accross both Sites)
    192.168.193.96 255.255.255.224 Local Subnet vmk3      --> vSAN       (vLAN accross both Sites)
    192.168.193.160 255.255.255.224 Local Subnet vmk5
    default 0.0.0.0 192.168.193.94 vmk2

    Output from Node on Site 2
    VMkernel Routes:
    Network Netmask Gateway Interface
    192.168.193.240 255.255.255.240 192.168.193.222 vmk5  --> WTS        (vLAN on Site 2 only)
    192.168.193.64 255.255.255.224 Local Subnet vmk2      --> Management (vLAN accross both Sites)
    192.168.193.96 255.255.255.224 Local Subnet vmk3      --> vSAN       (vLAN accross both Sites)
    192.168.193.192 255.255.255.224 Local Subnet vmk5
    default 0.0.0.0 192.168.193.94 vmk2

    Output from Witness vSAN VM
    VMkernel Routes:
    Network Netmask Gateway Interface
    192.168.193.224 255.255.255.240 Local Subnet vmk0      --> Management (vLAN on Witness Site only)
    192.168.193.240 255.255.255.240 Local Subnet vmk1      --> WTS        (vLAN on Witness Site only)
    192.168.193.160 255.255.255.224 192.168.193.254 vmk1
    192.168.193.192 255.255.255.224 192.168.193.254 vmk1
    default 0.0.0.0 192.168.193.238 vmk0
  4. As you could see from the screenshot in the initial post, any other network related Skyline Health Check was marked with a green check mark.

Thanks again for providing feedback on this weird issue.

 

Regards,

Ralf

 


Hope this helps a bit.
Greetings from Germany. (CEST)
0 Kudos
Jasemccarty
Immortal
Immortal

On your vSAN Witness Appliance, are vmk0 & vmk1 on the same network segment?

If so, it would be necessary to untag "vSAN Traffic" on vmk1 and tag it on vmk0. 
vSAN uses the same TCP stack as management, and in this situation where multi-homing comes into play. (https://kb.vmware.com/kb/2010877)
While vmk1 is tagged for vSAN Traffic, it actually uses vmk0.

The discrepancy can cause a partition. 

Jase McCarty - @jasemccarty
0 Kudos