Hello y'all!
I am taking my first steps with workload management capabilities. I am not a vSAN admin. But in order to get to the workload management piece, I need to get my lab to have a functional vSAN cluster. My questions might seem basic to most of you, please bear with me.
I am running 2 ESXi 7.0.0 beta hosts nested on a physical ESXi 6.7.
1. I deployed vCenter 7 appliance
2. I used the "Quick Start" wizard to configure a cluster
After I select the physical cache and capacity disks for each of the two hosts, the wizard completes without errors.
However, the next health check surfaces a partitioned network:
[root@esxi7-alpha:~] esxcli vsan health cluster list
Health Test Name Status
-------------------------------------------------- ------
Overall health red (Network misconfiguration)
Network red
Hosts with connectivity issues green
vSAN cluster partition red
All hosts have a vSAN vmknic configured green
vSAN: Basic (unicast) connectivity check green
vSAN: MTU check (ping with large packet size) green
vMotion: Basic (unicast) connectivity check green
vMotion: MTU check (ping with large packet size) green
Network latency check green
Performance service yellow
Performance service status yellow
Physical disk green
Operation health green
Disk capacity green
Congestion green
Component limit health green
Component metadata health green
Memory pools (heaps) green
Memory pools (slabs) green
Data green
vSAN object health green
Cluster green
Advanced vSAN configuration in sync green
vSAN daemon liveness green
vSAN Disk Balance green
Resync operations throttling green
Software version compatibility green
Disk format version green
Capacity utilization green
Disk space green
Read cache reservations green
Component green
What if the most consumed host fails green
[root@esxi7-alpha:~]
I have verified using ping and vmkping that all nodes can ping each other (which also seems confirmed by the green status of all the connectivity checks).
Many have reported this particular issue and found the root cause to be duplicate host UUIDs because they cloned their nested ESXi instances. I did NOT do that but installed both individually and they both have unique UUIDs.
I have removed and re-added one of the hosts from and to the cluster as suggested by some and end up in the same state. I validated a consistent 9000 MTU setting on the VDS and the VMKs.
There is one message in the UI hinting at a connectivity issue, but I am at my wits end as to where that issue would be and how I could test for it and fix it, given that all the connectivity tests pass:
I conclude that my issue is so basic, that I haven't even considered to look in the right place.
I welcome any pointers!
Cheers.
Volker
Hello Volker,
Welcome to Communities and vSAN.
"I conclude that my issue is so basic, that I haven't even considered to look in the right place."
Not necessarily - I have been playing with vSAN 7.0 (Beta and GA) hosts at home on VMware Workstation and have noted a difference between this and 6.7 that insufficient memory assigned the VM-hosts actually can fail network join on cluster set-up (as opposed to just when creating Disk-Groups as it was in previous versions and expected).
Please if you can inform how much memory these hosts have available and increase it if possible (anything less than 6GB just won't work from what I have seen).
If you could try configuring a cluster normally, not through QuickStart, this will rule out any potential issues with this aspect.
You could also validate whether manually configuring a cluster via the CLI as opposed to via vCenter works (though it should work given adequate resources).
How exactly are you testing ping between the hosts? You should be doing a vmkping -I from the vmk with vSAN-enabled to the IP of the vSAN-enabled vmk on the other host, using -s <MTU-configured-minus-28> -d
e.g. with jumbo frames disabled:
# vmkping -I vmk2 192.168.164.32 -s 1472 -d
Bob
Hi Bob,
thank you for your answer and suggestions!
Both my ESXi 7 VMs have 32GB of memory. I have read in some threads that sometimes memory issues on a single host can disrupt a whole cluster. But I figured 32GB hard reserved memory would be enough.
Here are my vmkping results:
[root@esxi7-alpha:~] vmkping -I vmk2 169.254.86.39 -s 8972 -d
PING 169.254.86.39 (169.254.86.39): 8972 data bytes
8980 bytes from 169.254.86.39: icmp_seq=0 ttl=64 time=0.620 ms
8980 bytes from 169.254.86.39: icmp_seq=1 ttl=64 time=0.726 ms
8980 bytes from 169.254.86.39: icmp_seq=2 ttl=64 time=0.755 ms
--- 169.254.86.39 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.620/0.700/0.755 ms
[root@esxi7-alpha:~]
and
[root@esxi7-bravo:~] vmkping -I vmk2 169.254.192.23 -s 8972 -d
PING 169.254.192.23 (169.254.192.23): 8972 data bytes
8980 bytes from 169.254.192.23: icmp_seq=0 ttl=64 time=0.638 ms
8980 bytes from 169.254.192.23: icmp_seq=1 ttl=64 time=0.732 ms
8980 bytes from 169.254.192.23: icmp_seq=2 ttl=64 time=0.831 ms
--- 169.254.192.23 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.638/0.734/0.831 ms
[root@esxi7-bravo:~]
Cheers.
Volker
A while back I created a nested Streched vSAN Cluster on an ESXi host, and had a similar issue, although everything seemed to be configured properly. IIRC the setup worked after enabling either Forged Transmits, or MAC address changes on the physical host's vSwitch(es). Maybe worth a try!?
André
Thank you much, André, for your suggestions!
I have already had the following settings:
Please let me know if anything else comes to mind!
Cheers.
Volker
Hello Volker,
32GB memory should be more than adequate, so very unlikely you are encountering the same issue I mentioned.
[root@esxi7-alpha:~] vmkping -I vmk2 169.254.86.39 -s 8972 -d
...
[root@esxi7-bravo:~] vmkping -I vmk2 169.254.192.23 -s 8972 -d
Is this a 255.255.0.0 (or similar) subnet or are the vmks on each host in different subnets?
They need to be in the same subnet.
Bob
Hi Bob,
Yes, I am using a 255.255.0.0 subnet. I went all DHCP and this is what the quick start wizard chose by default.
Cheers.
Volker
Hello Volker,
Using DHCP for vSAN traffic isn't supported unless reservations are set.
To rule out a few things, can you set static IPs for vSAN traffic and try configuring the cluster without QuickStart?
Note that the Health UI in the vSphere Client generally gives more verbose information as to the source of the issue than the esxcli host version (e.g. vCenter is authorititive should be green or your unicastlists are incomplete).
If the above doesn't yield a cluster, I would like to validate whether we can manually configure this from the CLI.
To get more insight if you could share the current output on both nodes of:
# esxcli vsan cluster get
# esxcli vsan network list
# esxcli vsan cluster unicastagent list
Bob