VMware Cloud Community
derSpielmann
Contributor
Contributor

Basic vSAN network issue running nested ESXi

Hello y'all!

I am taking my first steps with workload management capabilities. I am not a vSAN admin. But in order to get to the workload management piece, I need to get my lab to have a functional vSAN cluster. My questions might seem basic to most of you, please bear with me.

I am running 2 ESXi 7.0.0 beta hosts nested on a physical ESXi 6.7.

1. I deployed vCenter 7 appliance

2. I used the "Quick Start" wizard to configure a cluster

After I select the physical cache and capacity disks for each of the two hosts, the wizard completes without errors.

However, the next health check surfaces a partitioned network:

[root@esxi7-alpha:~] esxcli vsan health cluster list

Health Test Name                                    Status

--------------------------------------------------  ------

Overall health                                      red (Network misconfiguration)

Network                                             red

  Hosts with connectivity issues                    green

  vSAN cluster partition                            red

  All hosts have a vSAN vmknic configured           green

  vSAN: Basic (unicast) connectivity check          green

  vSAN: MTU check (ping with large packet size)     green

  vMotion: Basic (unicast) connectivity check       green

  vMotion: MTU check (ping with large packet size)  green

  Network latency check                             green

Performance service                                 yellow

  Performance service status                        yellow

Physical disk                                       green

  Operation health                                  green

  Disk capacity                                     green

  Congestion                                        green

  Component limit health                            green

  Component metadata health                         green

  Memory pools (heaps)                              green

  Memory pools (slabs)                              green

Data                                                green

  vSAN object health                                green

Cluster                                             green

  Advanced vSAN configuration in sync               green

  vSAN daemon liveness                              green

  vSAN Disk Balance                                 green

  Resync operations throttling                      green

  Software version compatibility                    green

  Disk format version                               green

Capacity utilization                                green

  Disk space                                        green

  Read cache reservations                           green

  Component                                         green

  What if the most consumed host fails              green

[root@esxi7-alpha:~]

I have verified using ping and vmkping that all nodes can ping each other (which also seems confirmed by the green status of all the connectivity checks).

Many have reported this particular issue and found the root cause to be duplicate host UUIDs because they cloned their nested ESXi instances. I did NOT do that but installed both individually and they both have unique UUIDs.

I have removed and re-added one of the hosts from and to the cluster as suggested by some and end up in the same state. I validated a consistent 9000 MTU setting on the VDS and the VMKs.

There is one message in the UI hinting at a connectivity issue, but I am at my wits end as to where that issue would be and how I could test for it and fix it, given that all the connectivity tests pass:

Screen Shot 2020-04-23 at 1.30.36 PM.png

I conclude that my issue is so basic, that I haven't even considered to look in the right place.

I welcome any pointers!

Cheers.

Volker

0 Kudos
7 Replies
TheBobkin
Champion
Champion

Hello Volker,

Welcome to Communities and vSAN.

"I conclude that my issue is so basic, that I haven't even considered to look in the right place."

Not necessarily - I have been playing with vSAN 7.0 (Beta and GA) hosts at home on VMware Workstation and have noted a difference between this and 6.7 that insufficient memory assigned the VM-hosts actually can fail network join on cluster set-up (as opposed to just when creating Disk-Groups as it was in previous versions and expected).

Please if you can inform how much memory these hosts have available and increase it if possible (anything less than 6GB just won't work from what I have seen).

If you could try configuring a cluster normally, not through QuickStart, this will rule out any potential issues with this aspect.

You could also validate whether manually configuring a cluster via the CLI as opposed to via vCenter works (though it should work given adequate resources).

How exactly are you testing ping between the hosts? You should be doing a vmkping -I from the vmk with vSAN-enabled to the IP of the vSAN-enabled vmk on the other host, using -s <MTU-configured-minus-28> -d

e.g. with jumbo frames disabled:

# vmkping -I vmk2 192.168.164.32 -s 1472 -d

Bob

0 Kudos
derSpielmann
Contributor
Contributor

Hi Bob,

thank you for your answer and suggestions!

Both my ESXi 7 VMs have 32GB of memory. I have read in some threads that sometimes memory issues on a single host can disrupt a whole cluster. But I figured 32GB hard reserved memory would be enough.

Here are my vmkping results:

[root@esxi7-alpha:~] vmkping -I vmk2 169.254.86.39 -s 8972 -d

PING 169.254.86.39 (169.254.86.39): 8972 data bytes

8980 bytes from 169.254.86.39: icmp_seq=0 ttl=64 time=0.620 ms

8980 bytes from 169.254.86.39: icmp_seq=1 ttl=64 time=0.726 ms

8980 bytes from 169.254.86.39: icmp_seq=2 ttl=64 time=0.755 ms

--- 169.254.86.39 ping statistics ---

3 packets transmitted, 3 packets received, 0% packet loss

round-trip min/avg/max = 0.620/0.700/0.755 ms

[root@esxi7-alpha:~]

and

[root@esxi7-bravo:~] vmkping -I vmk2 169.254.192.23 -s 8972 -d

PING 169.254.192.23 (169.254.192.23): 8972 data bytes

8980 bytes from 169.254.192.23: icmp_seq=0 ttl=64 time=0.638 ms

8980 bytes from 169.254.192.23: icmp_seq=1 ttl=64 time=0.732 ms

8980 bytes from 169.254.192.23: icmp_seq=2 ttl=64 time=0.831 ms

--- 169.254.192.23 ping statistics ---

3 packets transmitted, 3 packets received, 0% packet loss

round-trip min/avg/max = 0.638/0.734/0.831 ms

[root@esxi7-bravo:~]

Cheers.

Volker

0 Kudos
a_p_
Leadership
Leadership

A while back I created a nested Streched vSAN Cluster on an ESXi host, and had a similar issue, although everything seemed to be configured properly. IIRC the setup worked after enabling either Forged Transmits, or MAC address changes on the physical host's vSwitch(es). Maybe worth a try!?

André

0 Kudos
derSpielmann
Contributor
Contributor

Thank you much, André, for your suggestions!

I have already had the following settings:

Screen Shot 2020-04-24 at 5.42.41 PM.png

Please let me know if anything else comes to mind!

Cheers.

Volker

0 Kudos
TheBobkin
Champion
Champion

Hello Volker,

32GB memory should be more than adequate, so very unlikely you are encountering the same issue I mentioned.

[root@esxi7-alpha:~] vmkping -I vmk2 169.254.86.39 -s 8972 -d

...

[root@esxi7-bravo:~] vmkping -I vmk2 169.254.192.23 -s 8972 -d

Is this a 255.255.0.0 (or similar) subnet or are the vmks on each host in different subnets?

They need to be in the same subnet.

Bob

0 Kudos
derSpielmann
Contributor
Contributor

Hi Bob,

Yes, I am using a 255.255.0.0 subnet. I went all DHCP and this is what the quick start wizard chose by default.

Cheers.

Volker

0 Kudos
TheBobkin
Champion
Champion

Hello Volker,

Using DHCP for vSAN traffic isn't supported unless reservations are set.

To rule out a few things, can you set static IPs for vSAN traffic and try configuring the cluster without QuickStart?

Note that the Health UI in the vSphere Client generally gives more verbose information as to the source of the issue than the esxcli host version (e.g. vCenter is authorititive should be green or your unicastlists are incomplete).

If the above doesn't yield a cluster, I would like to validate whether we can manually configure this from the CLI.

To get more insight if you could share the current output on both nodes of:

# esxcli vsan cluster get

# esxcli vsan network list

# esxcli vsan cluster unicastagent list

Bob

0 Kudos