Solved: 2-Host ROBO Cluster with vSAN Cluster Partition er...

GatorMania93 · ‎03-20-2018

Been wrestling with this for weeks now.

I have a vSAN 6.6 2-host cluster with external witness.

When I run the configuration check, everything shows green except for the "vSAN cluster partition error" as seen below:

I see that the witness host is running on partition 1 and my two vsan hosts are running on partition 2. Is this the cause of the failure?

I cannot seem to troubleshoot this successfully. Any help would be greatly appreciated. All other tests pass in the Configuration Test.

GatorMania93 · ‎03-21-2018

I found what I did wrong here.

TheBobkin

When adding the command esxcli vsan network ipv4 add -i vmk0 -T=witness.... to each of my two remote hosts, for some reason I also added this to my witness host, who's witness traffic runs over vmk1. Ran esxcli vsan network ipv4 add -i vmk1 -T=witness on my witness host which resolved the error.

Thanks Bob for pointing me in the right direction with those ping tests.

View solution in original post

MohamadAlhousse · ‎03-20-2018

Hi GatorMania93

Did you configure the static routes from your 2-nodes ROBO data site to the witness site using esxcfg-route -a commands ?

Regards,

GatorMania93 · ‎03-20-2018

Currently I have all components running in the same VLAN.

MohamadAlhousse · ‎03-20-2018

You should have stretched L2 network for VSAN network in your 2-nodes ROBO site, and another VLAN for VSAN traffic in the witness site.

Static routes should be configured between VSAN vmkernel ports in the ROBO site and the VSAN vmkernel port in the witness site.

This is a network requirement for VSAN ROBO configuration. Please see below:

Network Design for Stretched Clusters

Regards,

TheBobkin · ‎03-20-2018

Hello GatorMania93,

Welcome to Communities! Some useful info on participating here:

https://communities.vmware.com/docs/DOC-12286

"Been wrestling with this for weeks now."

Sorry to hear that, what have you tried/checked so far?

"I see that the witness host is running on partition 1 and my two vsan hosts are running on partition 2. Is this the cause of the failure?"

Cluster members need to be able to communicate with one another and should never be network partitioned.

This cluster is on 6.6 so going to assume Unicast.

Check the cluster config on all three node to ensure they are all *trying* to be part of the same cluster and all have Unicast mode enabled:true :

# esxcli vsan cluster get

Check the unicastagent lists on each node:

# esxcli vsan cluster unicastagent list

Each node should have the 2 other nodes in their list (don't worry if witness shows as 0000 for UUID just look at the IP, these should state if they have Unicast enabled)If these are all good then check the network connectivity from the vSAN-enabled vmk on each host to the IP of the vmk on the others:

Get the IP of the vSAN interface on each node:

#esxcfg-vmknic -l

Confirm how this is configured (in case you have multiple or Witness Traffic Seperation in use):

# esxcli vsan network list

Ping the other interfaces from data-nodes to Witness:

# vmkping -I vmk# <Other_nodes_vsan_IP>

Check this BOTH directions.

If this fails then start looking at your network configuration and gateways, other issues such as busted vmk interfaces can rarely occur so remove and reconfigure this on Witness might be an approach.

FYI Witness appliances are very simple to redeploy in 6.6 and there is an in-built check for basic network configuration etc. when adding this to a node.

Bob

GatorMania93 · ‎03-20-2018

Thanks Bob.

I simply enabled vSAN on my cluster, was able to set up the storage successfully, passed all of the checks, then when I enabled stretched cluster and set up my two fault domains, I was able to add the witness with no errors either. So, I'm not sure why this is happening. Here's a few screen shots:

From Witness:

From Host 1:

From Host 2:

I've tried adding 3 different witness hosts, all of which were installed from scratch.

I'll get to work on trying some of your troubleshooting tips

TheBobkin · ‎03-20-2018

Hello GatorMania93,

Are you installing these Witnesses with the same ESXi build as the data-nodes?

How are your interfaces configured?

Are these all on the same L2 network in same subnets?

If there is no communication between nodes over these interfaces then try untagging the existing vsan and/or witness interfaces, try creating a new interface on the Witness in the same subnet as the vSAN-enabled vmk on the data-nodes and just tag it for vsan traffic not witness (-T=vsan) and see can they communicate.

Edit: Yes I am fully aware this is not how a 2-node DirectConnect should ideally be configured but testing where the issue is here.

Bob

GatorMania93 · ‎03-20-2018

Looks like that might be my issue, Bob. I can ping the VMKernel NICs between hosts, but cannot ping the Witness from either host (or vice versa).

I should also mention that the VMKernel NICs are direct connected between both hosts. However, I thought that route could be substantiated by running

esxcli vsan network ipv4 add -i vmk0 -T=witness on each host, whereas vmk0 is my management interface.

GatorMania93 · ‎03-21-2018

I found what I did wrong here.

TheBobkin

When adding the command esxcli vsan network ipv4 add -i vmk0 -T=witness.... to each of my two remote hosts, for some reason I also added this to my witness host, who's witness traffic runs over vmk1. Ran esxcli vsan network ipv4 add -i vmk1 -T=witness on my witness host which resolved the error.

Thanks Bob for pointing me in the right direction with those ping tests.

All

2-Host ROBO Cluster with vSAN Cluster Partition error