VMware Cloud Community
Sharantyr3
Enthusiast
Enthusiast
Jump to solution

stretched cluster networking issues

Hello there,

I'm in 4+4+1 configuration, 6.7 u1, L3 everywhere for vsan replication and witness, currently in testing phase.

In my previous testing phase, I was using L2 everywhere (not supported) but wasn't having any networking issues.

Now I moved on L3 everywhere, and strange things was happening and I had very hard time to recover the cluster to the point I had to erase and recreate from scratch my vSAN.

But the issues still persist. So before opening an SR, maybe someone can give me hints.

To summarize

administration vlan : 192.168.49.0/24, L2 style, all ESXis and witness have an IP in this network, dns names resolve in this network

siteA, vsan witness vlan : 172.16.90.0/27 ; gateway 172.16.90.1

siteB, vsan witness vlan : 172.16.90.64/27 ; gateway 172.16.90.65

siteA, vsan replication vlan : 172.16.90.128/27 ; gateway 172.16.90.129

siteB, vsan replication vlan : 172.16.90.192/27 ; gateway 172.16.90.193

siteC, vsan vlan : 172.16.92.0/29 ; gateway 172.16.92.1

on siteA, I added the following routes :

esxcli network ip route ipv4 add -n 172.16.90.64/27 -g 172.16.90.1

esxcli network ip route ipv4 add -n 172.16.90.192/27 -g 172.16.90.129

esxcli network ip route ipv4 add -n 172.16.92.0/29 -g 172.16.90.1

on siteB, I added the following routes :

esxcli network ip route ipv4 add -n 172.16.90.0/27 -g 172.16.90.65

esxcli network ip route ipv4 add -n 172.16.90.128/27 -g 172.16.90.193

esxcli network ip route ipv4 add -n 172.16.92.0/29 -g 172.16.90.65

on siteC (witness) I added the following routes :

esxcli network ip route ipv4 add -n 172.16.90.64/27 -g 172.16.92.1

esxcli network ip route ipv4 add -n 172.16.90.192/27 -g 172.16.92.1

esxcli network ip route ipv4 add -n 172.16.90.0/27 -g 172.16.92.1

esxcli network ip route ipv4 add -n 172.16.90.128/27 -g 172.16.92.1

vmkping works on all ESXi from/to all ESXi/witness.

esxcli vsan health cluster list gives green status on all health checks.

Health Test Name                                    Status

--------------------------------------------------  ----------

Overall health                                      green (OK)

Cluster                                             green

  ESXi vSAN Health service installation             green

  vSAN Health Service up-to-date                    green

  Advanced vSAN configuration in sync               green

  vSAN CLOMD liveness                               green

  vSAN Disk Balance                                 green

  Resync operations throttling                      green

  Software version compatibility                    green

  Disk format version                               green

Network                                             green

  Hosts disconnected from VC                        green

  Hosts with connectivity issues                    green

  vSAN cluster partition                            green

  All hosts have a vSAN vmknic configured           green

  vSAN: Basic (unicast) connectivity check          green

  vSAN: MTU check (ping with large packet size)     green

  vMotion: Basic (unicast) connectivity check       green

  vMotion: MTU check (ping with large packet size)  green

  Network latency check                             green

Data                                                green

  vSAN object health                                green

Limits                                              green

  Current cluster situation                         green

  After 1 additional host failure                   green

  Host component limit                              green

Physical disk                                       green

  Operation health                                  green

  Disk capacity                                     green

  Congestion                                        green

  Component limit health                            green

  Component metadata health                         green

  Memory pools (heaps)                              green

  Memory pools (slabs)                              green

Performance service                                 green

  Stats DB object                                   green

  Stats master election                             green

  Performance data collection                       green

  All hosts contributing stats                      green

  Stats DB object conflicts                         green

On the web client, health is not ok.

First, I have the "vsan cluster configuration consistency" as yellow, two hosts and the witness are in warning state.

So I used the "remediate inconsistent configuration" action icon, and I have now a running task "remediate vsan cluster" running for long time stuck at 81%.

pastedImage_0.png

pastedImage_1.png

edit : it has now successfully failed after 1 hour, you notice also another failed task that I can't explain :

pastedImage_0.png

Also "Hosts with connectivity issues" is red with 2 hosts showing not ok :

pastedImage_2.png

The flash client is just... lost ?? You can see empty fields that should not be :

pastedImage_3.png

And reports hosts being pre-6.5 ?? What the ??

All ESXis, vcenter are 6.7 U1 as the witness host

Configuration assist show only one problem (one ESXi have a faulty network card, so only 1 uplink enabled) :

pastedImage_7.png

Html5 client works, or not, depends on the mood :

pastedImage_5.png

same view, flash, not better :

pastedImage_10.png

All in all, I think my problems are network related. But can't find out what is the problem.

As you can see, there is only 1 network partition :

pastedImage_11.png

esxcli unicastagent list, from SA-ESXi-01 :

[root@sa-esxi-01:~] esxcli vsan cluster unicastagent list

NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name

------------------------------------  ---------  ----------------  -------------  -----  ----------

5bd9adf2-8ad7-5458-4d76-246e96d75924          0              true  172.16.90.134  12321

5bd9b2d5-ed54-0642-5345-246e96d78d24          0              true  172.16.90.135  12321

5b87dbbc-3a01-062c-b9d1-e4434b0139a8          0              true  172.16.90.197  12321

5bd8969c-9f9f-2d08-a5aa-246e96d755c4          0              true  172.16.90.133  12321

5b896631-415e-6c96-05bf-e4434b013828          0              true  172.16.90.198  12321

5b896aa1-32a5-7ad8-d1bb-e4434b013318          0              true  172.16.90.199  12321

5b7ea1a4-4bee-88f6-b5bc-e4434b0132d8          0              true  172.16.90.196  12321

5c1107b6-6197-81be-e9ff-005056927cdc          1              true  172.16.92.4    12321

esxcli unicastagent list, from SB-ESXi-01 :

[root@SB-ESXi-01:~] esxcli vsan cluster unicastagent list

NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name

------------------------------------  ---------  ----------------  -------------  -----  ----------

5bd8969c-9f9f-2d08-a5aa-246e96d755c4          0              true  172.16.90.133  12321

5bd9adf2-8ad7-5458-4d76-246e96d75924          0              true  172.16.90.134  12321

5b896631-415e-6c96-05bf-e4434b013828          0              true  172.16.90.198  12321

5b87dbbc-3a01-062c-b9d1-e4434b0139a8          0              true  172.16.90.197  12321

5b896aa1-32a5-7ad8-d1bb-e4434b013318          0              true  172.16.90.199  12321

5bd0aaec-711d-c56a-0be1-246e96d75264          0              true  172.16.90.132  12321

5bd9b2d5-ed54-0642-5345-246e96d78d24          0              true  172.16.90.135  12321

5c1107b6-6197-81be-e9ff-005056927cdc          1              true  172.16.92.4    12321

HTML5 and flash interface are really slow to browse in vSAN tabs since I enabled stretched cluster.

My guess is that vcenter is trying to communicate with ESXi on vSAN networks but vSAN networks are not available to vcenter (why would they ?? vcenter is supposed to communicate using administration network). And so, if that were the problem, why only a few hosts would show up in error state ?

I say it again, using vmkping, from each host to each host, same site or different site, using the good vmkX specified with -I, every network path works flawlessly.

So I may missing something here, but it's quite a disappointment that vsan is not able to tell me exactly what is the problem, instead it just fails on random tests  / items.

Thanks for reading and any ideas Smiley Happy

Reply
0 Kudos
1 Solution

Accepted Solutions
Sharantyr3
Enthusiast
Enthusiast
Jump to solution

Hello,

After digging into this with VMware support, we fixed the issue by recreating the cluster and vsan cluster + fixing some "writers" that I'm not able to explain.

But for anyone coming here in the future : The root cause has not been clearly identified but it's like 90% chance the cause was that I upgraded from 6.7 to 6.7 U1 in the wrong order.

I upgraded vsan witness, then ESXi, then vcenter. This is NOT what you should do.

Correct upgrade path : vcenter first, then vsan witness and the ESXis.

View solution in original post

Reply
0 Kudos
3 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello Sharantyr3,

With regard to your first issue (Network Diagnositc Mode), this is known issue:

VMware Knowledge Base

The other issues though, this does sound like host-vCenter or vCenter-host communication issues (is it transient?) - how is your management traffic configured and is there any possibility that vSAN traffic is actually being routed through the Management network and causing contention issues?

Bob

Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast
Jump to solution

Hello,

Thanks for the KB link, but I think it's related to all my other problems because when I was testing in all L2 mode, I didn't have this problem. Also this does not impact the witness only.

I don't know how to answer your question, the network is like i described. Each ESXi has 4 vmkernel, 1 for management, 1 for vmotion, 1 for vsan replication trafic and 1 for vsan witness.

Tell me which informations you would need to understand better my environment.

Is vcenter supposed to reach ESXi on the vsan replication and/or witness vmkernel IPs ?

How is it possible that vcenter complains about connectivity issues whereas escli on ESXi show no issues on network ?

Also I dont believe in contention issues as there is no VMs running (apart 3 VMs for testing) and all hosts are running on 10Gb network.

Thanks

Edit :

I added a NIC on vcsa into siteA, vsan witness vlan : 172.16.90.0/27

I used 172.16.90.10 as it's free. I added routes :

route add -net 172.16.90.0 netmask 255.255.255.224 gw 172.16.90.1 dev eth1

route add -net 172.16.90.64 netmask 255.255.255.224 gw 172.16.90.1 dev eth1

route add -net 172.16.90.128 netmask 255.255.255.224 gw 172.16.90.1 dev eth1

route add -net 172.16.90.192 netmask 255.255.255.224 gw 172.16.90.1 dev eth1

route add -net 172.16.92.0 netmask 255.255.255.248 gw 172.16.90.1 dev eth1

The vcsa can now reach any IP on vsan networks : replication and witness for siteA, B and C.

Still no success, so my problem might be somewhere else.

the log files are so verbose and so many of them, I don't know which one to look into, if anyone have a hint...

Edit 2, the issue seems transiant :

pastedImage_1.png

Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast
Jump to solution

Hello,

After digging into this with VMware support, we fixed the issue by recreating the cluster and vsan cluster + fixing some "writers" that I'm not able to explain.

But for anyone coming here in the future : The root cause has not been clearly identified but it's like 90% chance the cause was that I upgraded from 6.7 to 6.7 U1 in the wrong order.

I upgraded vsan witness, then ESXi, then vcenter. This is NOT what you should do.

Correct upgrade path : vcenter first, then vsan witness and the ESXis.

Reply
0 Kudos