Basic vSAN network issue running nested ESXi

derSpielmann · ‎04-23-2020

Hello y'all!

I am taking my first steps with workload management capabilities. I am not a vSAN admin. But in order to get to the workload management piece, I need to get my lab to have a functional vSAN cluster. My questions might seem basic to most of you, please bear with me.

I am running 2 ESXi 7.0.0 beta hosts nested on a physical ESXi 6.7.

1. I deployed vCenter 7 appliance

2. I used the "Quick Start" wizard to configure a cluster

After I select the physical cache and capacity disks for each of the two hosts, the wizard completes without errors.

However, the next health check surfaces a partitioned network:

[root@esxi7-alpha:~] esxcli vsan health cluster list

Health Test Name Status

-------------------------------------------------- ------

Overall health red (Network misconfiguration)

Network red

Hosts with connectivity issues green

vSAN cluster partition red

All hosts have a vSAN vmknic configured green

vSAN: Basic (unicast) connectivity check green

vSAN: MTU check (ping with large packet size) green

vMotion: Basic (unicast) connectivity check green

vMotion: MTU check (ping with large packet size) green

Network latency check green

Performance service yellow

Performance service status yellow

Physical disk green

Operation health green

Disk capacity green

Congestion green

Component limit health green

Component metadata health green

Memory pools (heaps) green

Memory pools (slabs) green

Data green

vSAN object health green

Cluster green

Advanced vSAN configuration in sync green

vSAN daemon liveness green

vSAN Disk Balance green

Resync operations throttling green

Software version compatibility green

Disk format version green

Capacity utilization green

Disk space green

Read cache reservations green

Component green

What if the most consumed host fails green

[root@esxi7-alpha:~]

I have verified using ping and vmkping that all nodes can ping each other (which also seems confirmed by the green status of all the connectivity checks).

Many have reported this particular issue and found the root cause to be duplicate host UUIDs because they cloned their nested ESXi instances. I did NOT do that but installed both individually and they both have unique UUIDs.

I have removed and re-added one of the hosts from and to the cluster as suggested by some and end up in the same state. I validated a consistent 9000 MTU setting on the VDS and the VMKs.

There is one message in the UI hinting at a connectivity issue, but I am at my wits end as to where that issue would be and how I could test for it and fix it, given that all the connectivity tests pass:

I conclude that my issue is so basic, that I haven't even considered to look in the right place.

I welcome any pointers!

Cheers.

Volker

TheBobkin · ‎04-24-2020

Hello Volker,

Welcome to Communities and vSAN.

"I conclude that my issue is so basic, that I haven't even considered to look in the right place."

Not necessarily - I have been playing with vSAN 7.0 (Beta and GA) hosts at home on VMware Workstation and have noted a difference between this and 6.7 that insufficient memory assigned the VM-hosts actually can fail network join on cluster set-up (as opposed to just when creating Disk-Groups as it was in previous versions and expected).

Please if you can inform how much memory these hosts have available and increase it if possible (anything less than 6GB just won't work from what I have seen).

If you could try configuring a cluster normally, not through QuickStart, this will rule out any potential issues with this aspect.

You could also validate whether manually configuring a cluster via the CLI as opposed to via vCenter works (though it should work given adequate resources).

How exactly are you testing ping between the hosts? You should be doing a vmkping -I from the vmk with vSAN-enabled to the IP of the vSAN-enabled vmk on the other host, using -s <MTU-configured-minus-28> -d

e.g. with jumbo frames disabled:

# vmkping -I vmk2 192.168.164.32 -s 1472 -d

Bob

derSpielmann · ‎04-24-2020

Hi Bob,

thank you for your answer and suggestions!

Both my ESXi 7 VMs have 32GB of memory. I have read in some threads that sometimes memory issues on a single host can disrupt a whole cluster. But I figured 32GB hard reserved memory would be enough.

Here are my vmkping results:

[root@esxi7-alpha:~] vmkping -I vmk2 169.254.86.39 -s 8972 -d

PING 169.254.86.39 (169.254.86.39): 8972 data bytes

8980 bytes from 169.254.86.39: icmp_seq=0 ttl=64 time=0.620 ms

8980 bytes from 169.254.86.39: icmp_seq=1 ttl=64 time=0.726 ms

8980 bytes from 169.254.86.39: icmp_seq=2 ttl=64 time=0.755 ms

--- 169.254.86.39 ping statistics ---

3 packets transmitted, 3 packets received, 0% packet loss

round-trip min/avg/max = 0.620/0.700/0.755 ms

[root@esxi7-alpha:~]

and

[root@esxi7-bravo:~] vmkping -I vmk2 169.254.192.23 -s 8972 -d

PING 169.254.192.23 (169.254.192.23): 8972 data bytes

8980 bytes from 169.254.192.23: icmp_seq=0 ttl=64 time=0.638 ms

8980 bytes from 169.254.192.23: icmp_seq=1 ttl=64 time=0.732 ms

8980 bytes from 169.254.192.23: icmp_seq=2 ttl=64 time=0.831 ms

--- 169.254.192.23 ping statistics ---

3 packets transmitted, 3 packets received, 0% packet loss

round-trip min/avg/max = 0.638/0.734/0.831 ms

[root@esxi7-bravo:~]

Cheers.

Volker

a_p_ · ‎04-24-2020

A while back I created a nested Streched vSAN Cluster on an ESXi host, and had a similar issue, although everything seemed to be configured properly. IIRC the setup worked after enabling either Forged Transmits, or MAC address changes on the physical host's vSwitch(es). Maybe worth a try!?

André

derSpielmann · ‎04-24-2020

Thank you much, André, for your suggestions!

I have already had the following settings:

Please let me know if anything else comes to mind!

Cheers.

Volker

TheBobkin · ‎04-25-2020

Hello Volker,

32GB memory should be more than adequate, so very unlikely you are encountering the same issue I mentioned.

[root@esxi7-alpha:~] vmkping -I vmk2 169.254.86.39 -s 8972 -d

...

[root@esxi7-bravo:~] vmkping -I vmk2 169.254.192.23 -s 8972 -d

Is this a 255.255.0.0 (or similar) subnet or are the vmks on each host in different subnets?

They need to be in the same subnet.

Bob

derSpielmann · ‎04-27-2020

Hi Bob,

Yes, I am using a 255.255.0.0 subnet. I went all DHCP and this is what the quick start wizard chose by default.

Cheers.

Volker

TheBobkin · ‎04-27-2020

Hello Volker,

Using DHCP for vSAN traffic isn't supported unless reservations are set.

To rule out a few things, can you set static IPs for vSAN traffic and try configuring the cluster without QuickStart?

Note that the Health UI in the vSphere Client generally gives more verbose information as to the source of the issue than the esxcli host version (e.g. vCenter is authorititive should be green or your unicastlists are incomplete).

If the above doesn't yield a cluster, I would like to validate whether we can manually configure this from the CLI.

To get more insight if you could share the current output on both nodes of:

# esxcli vsan cluster get

# esxcli vsan network list

# esxcli vsan cluster unicastagent list

Bob