3 Replies Latest reply on Jan 30, 2019 1:00 AM by Sharantyr3

    stretched cluster networking issues

    Sharantyr3 Enthusiast

      Hello there,

       

      I'm in 4+4+1 configuration, 6.7 u1, L3 everywhere for vsan replication and witness, currently in testing phase.

       

      In my previous testing phase, I was using L2 everywhere (not supported) but wasn't having any networking issues.

      Now I moved on L3 everywhere, and strange things was happening and I had very hard time to recover the cluster to the point I had to erase and recreate from scratch my vSAN.

       

      But the issues still persist. So before opening an SR, maybe someone can give me hints.

       

      To summarize

       

      administration vlan : 192.168.49.0/24, L2 style, all ESXis and witness have an IP in this network, dns names resolve in this network

       

      siteA, vsan witness vlan : 172.16.90.0/27 ; gateway 172.16.90.1

      siteB, vsan witness vlan : 172.16.90.64/27 ; gateway 172.16.90.65

      siteA, vsan replication vlan : 172.16.90.128/27 ; gateway 172.16.90.129

      siteB, vsan replication vlan : 172.16.90.192/27 ; gateway 172.16.90.193

      siteC, vsan vlan : 172.16.92.0/29 ; gateway 172.16.92.1

       

      on siteA, I added the following routes :

      esxcli network ip route ipv4 add -n 172.16.90.64/27 -g 172.16.90.1

      esxcli network ip route ipv4 add -n 172.16.90.192/27 -g 172.16.90.129

      esxcli network ip route ipv4 add -n 172.16.92.0/29 -g 172.16.90.1

       

      on siteB, I added the following routes :

      esxcli network ip route ipv4 add -n 172.16.90.0/27 -g 172.16.90.65

      esxcli network ip route ipv4 add -n 172.16.90.128/27 -g 172.16.90.193

      esxcli network ip route ipv4 add -n 172.16.92.0/29 -g 172.16.90.65

       

       

      on siteC (witness) I added the following routes :

      esxcli network ip route ipv4 add -n 172.16.90.64/27 -g 172.16.92.1

      esxcli network ip route ipv4 add -n 172.16.90.192/27 -g 172.16.92.1

      esxcli network ip route ipv4 add -n 172.16.90.0/27 -g 172.16.92.1

      esxcli network ip route ipv4 add -n 172.16.90.128/27 -g 172.16.92.1

       

       

      vmkping works on all ESXi from/to all ESXi/witness.

       

      esxcli vsan health cluster list gives green status on all health checks.

       

      Health Test Name                                    Status

      --------------------------------------------------  ----------

      Overall health                                      green (OK)

      Cluster                                             green

        ESXi vSAN Health service installation             green

        vSAN Health Service up-to-date                    green

        Advanced vSAN configuration in sync               green

        vSAN CLOMD liveness                               green

        vSAN Disk Balance                                 green

        Resync operations throttling                      green

        Software version compatibility                    green

        Disk format version                               green

      Network                                             green

        Hosts disconnected from VC                        green

        Hosts with connectivity issues                    green

        vSAN cluster partition                            green

        All hosts have a vSAN vmknic configured           green

        vSAN: Basic (unicast) connectivity check          green

        vSAN: MTU check (ping with large packet size)     green

        vMotion: Basic (unicast) connectivity check       green

        vMotion: MTU check (ping with large packet size)  green

        Network latency check                             green

      Data                                                green

        vSAN object health                                green

      Limits                                              green

        Current cluster situation                         green

        After 1 additional host failure                   green

        Host component limit                              green

      Physical disk                                       green

        Operation health                                  green

        Disk capacity                                     green

        Congestion                                        green

        Component limit health                            green

        Component metadata health                         green

        Memory pools (heaps)                              green

        Memory pools (slabs)                              green

      Performance service                                 green

        Stats DB object                                   green

        Stats master election                             green

        Performance data collection                       green

        All hosts contributing stats                      green

        Stats DB object conflicts                         green

       

      On the web client, health is not ok.

       

      First, I have the "vsan cluster configuration consistency" as yellow, two hosts and the witness are in warning state.

      So I used the "remediate inconsistent configuration" action icon, and I have now a running task "remediate vsan cluster" running for long time stuck at 81%.

       

       

       

      edit : it has now successfully failed after 1 hour, you notice also another failed task that I can't explain :

       

       

      Also "Hosts with connectivity issues" is red with 2 hosts showing not ok :

       

       

      The flash client is just... lost ?? You can see empty fields that should not be :

      And reports hosts being pre-6.5 ?? What the ??

      All ESXis, vcenter are 6.7 U1 as the witness host

       

      Configuration assist show only one problem (one ESXi have a faulty network card, so only 1 uplink enabled) :

       

       

      Html5 client works, or not, depends on the mood :

       

       

      same view, flash, not better :

       

       

       

       

      All in all, I think my problems are network related. But can't find out what is the problem.

      As you can see, there is only 1 network partition :

       

      esxcli unicastagent list, from SA-ESXi-01 :

       

      [root@sa-esxi-01:~] esxcli vsan cluster unicastagent list

      NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name

      ------------------------------------  ---------  ----------------  -------------  -----  ----------

      5bd9adf2-8ad7-5458-4d76-246e96d75924          0              true  172.16.90.134  12321

      5bd9b2d5-ed54-0642-5345-246e96d78d24          0              true  172.16.90.135  12321

      5b87dbbc-3a01-062c-b9d1-e4434b0139a8          0              true  172.16.90.197  12321

      5bd8969c-9f9f-2d08-a5aa-246e96d755c4          0              true  172.16.90.133  12321

      5b896631-415e-6c96-05bf-e4434b013828          0              true  172.16.90.198  12321

      5b896aa1-32a5-7ad8-d1bb-e4434b013318          0              true  172.16.90.199  12321

      5b7ea1a4-4bee-88f6-b5bc-e4434b0132d8          0              true  172.16.90.196  12321

      5c1107b6-6197-81be-e9ff-005056927cdc          1              true  172.16.92.4    12321

      esxcli unicastagent list, from SB-ESXi-01 :

      [root@SB-ESXi-01:~] esxcli vsan cluster unicastagent list

      NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name

      ------------------------------------  ---------  ----------------  -------------  -----  ----------

      5bd8969c-9f9f-2d08-a5aa-246e96d755c4          0              true  172.16.90.133  12321

      5bd9adf2-8ad7-5458-4d76-246e96d75924          0              true  172.16.90.134  12321

      5b896631-415e-6c96-05bf-e4434b013828          0              true  172.16.90.198  12321

      5b87dbbc-3a01-062c-b9d1-e4434b0139a8          0              true  172.16.90.197  12321

      5b896aa1-32a5-7ad8-d1bb-e4434b013318          0              true  172.16.90.199  12321

      5bd0aaec-711d-c56a-0be1-246e96d75264          0              true  172.16.90.132  12321

      5bd9b2d5-ed54-0642-5345-246e96d78d24          0              true  172.16.90.135  12321

      5c1107b6-6197-81be-e9ff-005056927cdc          1              true  172.16.92.4    12321

       

      HTML5 and flash interface are really slow to browse in vSAN tabs since I enabled stretched cluster.

       

      My guess is that vcenter is trying to communicate with ESXi on vSAN networks but vSAN networks are not available to vcenter (why would they ?? vcenter is supposed to communicate using administration network). And so, if that were the problem, why only a few hosts would show up in error state ?

       

      I say it again, using vmkping, from each host to each host, same site or different site, using the good vmkX specified with -I, every network path works flawlessly.

       

      So I may missing something here, but it's quite a disappointment that vsan is not able to tell me exactly what is the problem, instead it just fails on random tests  / items.

       

       

      Thanks for reading and any ideas

        • 1. Re: stretched cluster networking issues
          TheBobkin Virtuoso
          vExpertVMware Employees

          Hello Sharantyr3,

           

           

          With regard to your first issue (Network Diagnositc Mode), this is known issue:

          VMware Knowledge Base

           

          The other issues though, this does sound like host-vCenter or vCenter-host communication issues (is it transient?) - how is your management traffic configured and is there any possibility that vSAN traffic is actually being routed through the Management network and causing contention issues?

           

           

          Bob

          • 2. Re: stretched cluster networking issues
            Sharantyr3 Enthusiast

            Hello,

             

            Thanks for the KB link, but I think it's related to all my other problems because when I was testing in all L2 mode, I didn't have this problem. Also this does not impact the witness only.

             

            I don't know how to answer your question, the network is like i described. Each ESXi has 4 vmkernel, 1 for management, 1 for vmotion, 1 for vsan replication trafic and 1 for vsan witness.

            Tell me which informations you would need to understand better my environment.

             

            Is vcenter supposed to reach ESXi on the vsan replication and/or witness vmkernel IPs ?

             

            How is it possible that vcenter complains about connectivity issues whereas escli on ESXi show no issues on network ?

             

            Also I dont believe in contention issues as there is no VMs running (apart 3 VMs for testing) and all hosts are running on 10Gb network.

             

            Thanks

             

             

            Edit :

            I added a NIC on vcsa into siteA, vsan witness vlan : 172.16.90.0/27

            I used 172.16.90.10 as it's free. I added routes :

             

             

            route add -net 172.16.90.0 netmask 255.255.255.224 gw 172.16.90.1 dev eth1

            route add -net 172.16.90.64 netmask 255.255.255.224 gw 172.16.90.1 dev eth1

            route add -net 172.16.90.128 netmask 255.255.255.224 gw 172.16.90.1 dev eth1

            route add -net 172.16.90.192 netmask 255.255.255.224 gw 172.16.90.1 dev eth1

            route add -net 172.16.92.0 netmask 255.255.255.248 gw 172.16.90.1 dev eth1

             

            The vcsa can now reach any IP on vsan networks : replication and witness for siteA, B and C.

             

            Still no success, so my problem might be somewhere else.

             

            the log files are so verbose and so many of them, I don't know which one to look into, if anyone have a hint...

             

             

            Edit 2, the issue seems transiant :

             

            • 3. Re: stretched cluster networking issues
              Sharantyr3 Enthusiast

              Hello,

               

              After digging into this with VMware support, we fixed the issue by recreating the cluster and vsan cluster + fixing some "writers" that I'm not able to explain.

               

              But for anyone coming here in the future : The root cause has not been clearly identified but it's like 90% chance the cause was that I upgraded from 6.7 to 6.7 U1 in the wrong order.

               

              I upgraded vsan witness, then ESXi, then vcenter. This is NOT what you should do.

               

              Correct upgrade path : vcenter first, then vsan witness and the ESXis.