Hello there,
I'm in 4+4+1 configuration, 6.7 u1, L3 everywhere for vsan replication and witness, currently in testing phase.
In my previous testing phase, I was using L2 everywhere (not supported) but wasn't having any networking issues.
Now I moved on L3 everywhere, and strange things was happening and I had very hard time to recover the cluster to the point I had to erase and recreate from scratch my vSAN.
But the issues still persist. So before opening an SR, maybe someone can give me hints.
To summarize
administration vlan : 192.168.49.0/24, L2 style, all ESXis and witness have an IP in this network, dns names resolve in this network
siteA, vsan witness vlan : 172.16.90.0/27 ; gateway 172.16.90.1
siteB, vsan witness vlan : 172.16.90.64/27 ; gateway 172.16.90.65
siteA, vsan replication vlan : 172.16.90.128/27 ; gateway 172.16.90.129
siteB, vsan replication vlan : 172.16.90.192/27 ; gateway 172.16.90.193
siteC, vsan vlan : 172.16.92.0/29 ; gateway 172.16.92.1
on siteA, I added the following routes :
esxcli network ip route ipv4 add -n 172.16.90.64/27 -g 172.16.90.1
esxcli network ip route ipv4 add -n 172.16.90.192/27 -g 172.16.90.129
esxcli network ip route ipv4 add -n 172.16.92.0/29 -g 172.16.90.1
on siteB, I added the following routes :
esxcli network ip route ipv4 add -n 172.16.90.0/27 -g 172.16.90.65
esxcli network ip route ipv4 add -n 172.16.90.128/27 -g 172.16.90.193
esxcli network ip route ipv4 add -n 172.16.92.0/29 -g 172.16.90.65
on siteC (witness) I added the following routes :
esxcli network ip route ipv4 add -n 172.16.90.64/27 -g 172.16.92.1
esxcli network ip route ipv4 add -n 172.16.90.192/27 -g 172.16.92.1
esxcli network ip route ipv4 add -n 172.16.90.0/27 -g 172.16.92.1
esxcli network ip route ipv4 add -n 172.16.90.128/27 -g 172.16.92.1
vmkping works on all ESXi from/to all ESXi/witness.
esxcli vsan health cluster list gives green status on all health checks.
Health Test Name Status
-------------------------------------------------- ----------
Overall health green (OK)
Cluster green
ESXi vSAN Health service installation green
vSAN Health Service up-to-date green
Advanced vSAN configuration in sync green
vSAN CLOMD liveness green
vSAN Disk Balance green
Resync operations throttling green
Software version compatibility green
Disk format version green
Network green
Hosts disconnected from VC green
Hosts with connectivity issues green
vSAN cluster partition green
All hosts have a vSAN vmknic configured green
vSAN: Basic (unicast) connectivity check green
vSAN: MTU check (ping with large packet size) green
vMotion: Basic (unicast) connectivity check green
vMotion: MTU check (ping with large packet size) green
Network latency check green
Data green
vSAN object health green
Limits green
Current cluster situation green
After 1 additional host failure green
Host component limit green
Physical disk green
Operation health green
Disk capacity green
Congestion green
Component limit health green
Component metadata health green
Memory pools (heaps) green
Memory pools (slabs) green
Performance service green
Stats DB object green
Stats master election green
Performance data collection green
All hosts contributing stats green
Stats DB object conflicts green
On the web client, health is not ok.
First, I have the "vsan cluster configuration consistency" as yellow, two hosts and the witness are in warning state.
So I used the "remediate inconsistent configuration" action icon, and I have now a running task "remediate vsan cluster" running for long time stuck at 81%.
edit : it has now successfully failed after 1 hour, you notice also another failed task that I can't explain :
Also "Hosts with connectivity issues" is red with 2 hosts showing not ok :
The flash client is just... lost ?? You can see empty fields that should not be :
And reports hosts being pre-6.5 ?? What the ??
All ESXis, vcenter are 6.7 U1 as the witness host
Configuration assist show only one problem (one ESXi have a faulty network card, so only 1 uplink enabled) :
Html5 client works, or not, depends on the mood :
same view, flash, not better :
All in all, I think my problems are network related. But can't find out what is the problem.
As you can see, there is only 1 network partition :
esxcli unicastagent list, from SA-ESXi-01 :
[root@sa-esxi-01:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name
------------------------------------ --------- ---------------- ------------- ----- ----------
5bd9adf2-8ad7-5458-4d76-246e96d75924 0 true 172.16.90.134 12321
5bd9b2d5-ed54-0642-5345-246e96d78d24 0 true 172.16.90.135 12321
5b87dbbc-3a01-062c-b9d1-e4434b0139a8 0 true 172.16.90.197 12321
5bd8969c-9f9f-2d08-a5aa-246e96d755c4 0 true 172.16.90.133 12321
5b896631-415e-6c96-05bf-e4434b013828 0 true 172.16.90.198 12321
5b896aa1-32a5-7ad8-d1bb-e4434b013318 0 true 172.16.90.199 12321
5b7ea1a4-4bee-88f6-b5bc-e4434b0132d8 0 true 172.16.90.196 12321
5c1107b6-6197-81be-e9ff-005056927cdc 1 true 172.16.92.4 12321
esxcli unicastagent list, from SB-ESXi-01 :
[root@SB-ESXi-01:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name
------------------------------------ --------- ---------------- ------------- ----- ----------
5bd8969c-9f9f-2d08-a5aa-246e96d755c4 0 true 172.16.90.133 12321
5bd9adf2-8ad7-5458-4d76-246e96d75924 0 true 172.16.90.134 12321
5b896631-415e-6c96-05bf-e4434b013828 0 true 172.16.90.198 12321
5b87dbbc-3a01-062c-b9d1-e4434b0139a8 0 true 172.16.90.197 12321
5b896aa1-32a5-7ad8-d1bb-e4434b013318 0 true 172.16.90.199 12321
5bd0aaec-711d-c56a-0be1-246e96d75264 0 true 172.16.90.132 12321
5bd9b2d5-ed54-0642-5345-246e96d78d24 0 true 172.16.90.135 12321
5c1107b6-6197-81be-e9ff-005056927cdc 1 true 172.16.92.4 12321
HTML5 and flash interface are really slow to browse in vSAN tabs since I enabled stretched cluster.
My guess is that vcenter is trying to communicate with ESXi on vSAN networks but vSAN networks are not available to vcenter (why would they ?? vcenter is supposed to communicate using administration network). And so, if that were the problem, why only a few hosts would show up in error state ?
I say it again, using vmkping, from each host to each host, same site or different site, using the good vmkX specified with -I, every network path works flawlessly.
So I may missing something here, but it's quite a disappointment that vsan is not able to tell me exactly what is the problem, instead it just fails on random tests / items.
Thanks for reading and any ideas
Hello,
After digging into this with VMware support, we fixed the issue by recreating the cluster and vsan cluster + fixing some "writers" that I'm not able to explain.
But for anyone coming here in the future : The root cause has not been clearly identified but it's like 90% chance the cause was that I upgraded from 6.7 to 6.7 U1 in the wrong order.
I upgraded vsan witness, then ESXi, then vcenter. This is NOT what you should do.
Correct upgrade path : vcenter first, then vsan witness and the ESXis.
Hello Sharantyr3,
With regard to your first issue (Network Diagnositc Mode), this is known issue:
The other issues though, this does sound like host-vCenter or vCenter-host communication issues (is it transient?) - how is your management traffic configured and is there any possibility that vSAN traffic is actually being routed through the Management network and causing contention issues?
Bob
Hello,
Thanks for the KB link, but I think it's related to all my other problems because when I was testing in all L2 mode, I didn't have this problem. Also this does not impact the witness only.
I don't know how to answer your question, the network is like i described. Each ESXi has 4 vmkernel, 1 for management, 1 for vmotion, 1 for vsan replication trafic and 1 for vsan witness.
Tell me which informations you would need to understand better my environment.
Is vcenter supposed to reach ESXi on the vsan replication and/or witness vmkernel IPs ?
How is it possible that vcenter complains about connectivity issues whereas escli on ESXi show no issues on network ?
Also I dont believe in contention issues as there is no VMs running (apart 3 VMs for testing) and all hosts are running on 10Gb network.
Thanks
Edit :
I added a NIC on vcsa into siteA, vsan witness vlan : 172.16.90.0/27
I used 172.16.90.10 as it's free. I added routes :
route add -net 172.16.90.0 netmask 255.255.255.224 gw 172.16.90.1 dev eth1
route add -net 172.16.90.64 netmask 255.255.255.224 gw 172.16.90.1 dev eth1
route add -net 172.16.90.128 netmask 255.255.255.224 gw 172.16.90.1 dev eth1
route add -net 172.16.90.192 netmask 255.255.255.224 gw 172.16.90.1 dev eth1
route add -net 172.16.92.0 netmask 255.255.255.248 gw 172.16.90.1 dev eth1
The vcsa can now reach any IP on vsan networks : replication and witness for siteA, B and C.
Still no success, so my problem might be somewhere else.
the log files are so verbose and so many of them, I don't know which one to look into, if anyone have a hint...
Edit 2, the issue seems transiant :
Hello,
After digging into this with VMware support, we fixed the issue by recreating the cluster and vsan cluster + fixing some "writers" that I'm not able to explain.
But for anyone coming here in the future : The root cause has not been clearly identified but it's like 90% chance the cause was that I upgraded from 6.7 to 6.7 U1 in the wrong order.
I upgraded vsan witness, then ESXi, then vcenter. This is NOT what you should do.
Correct upgrade path : vcenter first, then vsan witness and the ESXis.