9 Replies Latest reply on Mar 26, 2020 1:29 AM by depping

    vsan stretched cluster - multiple HA isolation address

    Sharantyr3 Enthusiast

      Hello there,

       

      I understand it is recommended to have 2 isolation address for HA, one per site in our case of stretched cluster (Advanced Options | vSAN Stretched Cluster Guide | VMware )

      So I configured an IP on site 1 (preferred) and another IP on site 2 (secondary).

       

      I did crash test vsan and HA by shutting the replication link between site 1 and site 2.

      vsan and HA worked as it should have : poweroff VMs on site 2, restart on site 1 : ok.

       

      But, on the vcenter web interface, full errors claiming that HA could not restart VMs on site 2 hosts (insufficient ressources).

      Only a graphical glitch I guess, so not that bad as VMs were restarted on site 1 where the storage was available.

       

      But I was wondering, in our case, when ESXi on site 1 can reach isolation address 1, not isolation address 2

      And ESXi on site 2 can reach isolation address 2, not isolation address 1

       

      How is HA supposed to handle this ?

        • 1. Re: vsan stretched cluster - multiple HA isolation address
          Nawals Hot Shot

          Both site isolation address reachable each other? If not please check network connectivity between those IP.

          Nawals
          Please mark helpful or correct if your issue resolved.
          • 2. Re: vsan stretched cluster - multiple HA isolation address
            Sharantyr3 Enthusiast

            I don't understand your question, from what point of view are you asking ?

             

            HA IP 1 is on Site 1 witness subnet

            HA IP 2 is on Site 2 witness subnet

             

            Both IPs are reachable from all ESXi on Site 1 and Site 2

             

            I was just wondering, what is the supposed mechanic behind having 2 isolation address when 1 site can reach 1 IP and the other site can reach the other IP.

            Can't find any doc explaining how HA should make a decision in that case.

            Is isolation address 1 taking over isolation address 2 ?

             

            Take the following case :

            Site 1 and 2 : 192.168.0.0/24 : "vsan replication network"

            Site 1 : 192.168.1.0/24 : "witness network 1", 192.168.1.1 router IP on site 1, used as isolation address 1 - HA IP 1

            Site 2 : 192.168.2.0/24 : "witness network 2", 192.168.2.1 router IP on site 2, used as isolation address 2 - HA IP 2

            Site 3 (witness) : 192.168.3.0/24 : "witness network 3"

             

            if there is a network outage between site 1 and site 2 :

            Site 1 and Site 2 cannot replicate anymore

            Site 1 and Site 2 can reach site 3 (witness)

            Site 1 can reach HA IP 1, not HA IP 2

            Site 2 can reach HA IP 2, not HA IP 1

             

            I did test that, and found out HA restarted VMs on Site 1, but not because it was aware of vsan "preferred" site, only because storage (vsan) was accessible on the site 1.

            But, it did raise many alarms and errors complaining about not being abble to restart VMs on site 2 (insufficient resources).

             

            Anyway, my main question is more about understanding how HA is supposed to handle multiple isolation address on multiple sites (specificaly for vsan stretched clusters).

            • 3. Re: vsan stretched cluster - multiple HA isolation address
              Nawals Hot Shot

              Follow this link for more understanding. Advanced Options | vSAN Stretched Cluster Guide | VMware 

              Nawals
              Please mark helpful or correct if your issue resolved.
              1 person found this helpful
              • 4. Re: vsan stretched cluster - multiple HA isolation address
                Sharantyr3 Enthusiast

                Sorry but I don't think you understand me, maybe my english is so bad

                Anyway, the link you provided me is just an "intro" to my topic, but you led me on the right path.

                 

                I found out this :

                "When the master host stops receiving these heartbeats from a slave host, it checks for host liveness before declaring the host to have failed."

                "Host network isolation occurs when a host is still running, but it can no longer observe traffic from vSphere HA agents on the management network. If a host stops observing this traffic (1st action), it attempts (2nd action) to ping the cluster isolation addresses."

                Source

                 

                So, the HA master node commnication is taken in account before (more important) isolation address. It's him who determines which site should be up when there is a cross site link failure.

                 

                In my case, I noticed, before my test, that HA master was on Site 2.

                 

                So that explains why HA tried first to restart on Site 2 regardless of vsan availability on site 1 : because master was on site 2 !

                But then the "smart" mechanics of HA found out the storage was available on site 1 and then HA master moved to site 1 (I just checked and it's the case, HA master is on site 1)

                Too bad I don't have the fdm.log from this test time, could have been interresting to validate this.

                • 5. Re: vsan stretched cluster - multiple HA isolation address
                  MikeStoica Expert
                  vExpert

                  Do you have a Witness setup? Did you followed these steps Creating a New vSAN Stretched Cluster | vSAN Stretched Cluster Guide | VMware  when creating the stretched cluster?

                  • 6. Re: vsan stretched cluster - multiple HA isolation address
                    depping Champion
                    VMware EmployeesUser Moderators

                    Actually that is not correct what you are stating. There are two things here:

                    1. Availability of vSAN components

                    2. HA

                     

                    If the connection between the locations is gone (between data locations), each location will end up with it's own master as an election will happen! I described those HA details here:

                    Clustering Deep Dive eBook

                     

                    From a VM point of view the VMs which reside in the "secondary" location (which you specified during creation of the stretched cluster) will lose access to disk when the connection between data locations is impacted. This is because the Witness will bind itself to the preferred location. you can find all those details here:

                    vSAN Stretched Cluster Guide | VMware

                    • 7. Re: vsan stretched cluster - multiple HA isolation address
                      Sharantyr3 Enthusiast

                      Hello Mr Depping,

                       

                      I should have asked you live on VMworld Barcelona ! Thanks for the link to clustering deep dive, nice to have some things to read while looked at home because of covid

                       

                      "If the connection between the locations is gone (between data locations), each location will end up with it's own master as an election will happen!"

                       

                      Ok, that seems logical, just to know, did you try it?

                      Stretched cluster

                      HA master running on secondary site

                      shut the vsan data link (not witness link and not vcenter<->esxis management link)

                       

                      I got many alarms from vcenter stating that it could not restart VMs on secondary site like this :

                      Target: my-vm

                      Previous Status: Green

                      New Status: Red

                       

                      Alarm Definition:

                      ([Event alarm expression: Insufficient resources for vSphere HA to start the VM. Reason: {reason.@enum.fdm.placementFault}; Status = Red] OR [Event alarm expression: vSphere HA failed to restart a network isolated virtual machine; Status = Red] OR [Event alarm expression: VM powered on; Status = Green] OR [Event alarm expression: vSphere HA restarted a virtual machine; Status = Green])

                       

                      Event details:

                      Insufficient resources to fail over my-vm in Cluster-1 that recides in Datacenter. vSphere HA will retry the fail over when enough resources are available. Reason: The host(s) cannot access virtual machine components

                       

                      I m not understanding what part is not correct in my message regarding alarms I got, because to me it's clear that HA tried to restart on the vsan secondary site (all VMs had HA warning raised).

                       

                      Also my main question was more about multiple ha isolation address, best practices state that you should have 1 on each site, but when the inter site link is shut, each site will end up with 1 HA isolation address reachable, so from HA point of view, no site is isolated if it rely only on isolation address.

                      That's why I assumed that HA master had a major role, and why I got these messages.

                       

                      I'll read your bible on HA if I find my answer

                      • 8. Re: vsan stretched cluster - multiple HA isolation address
                        depping Champion
                        VMware EmployeesUser Moderators

                        yes I have tested this many times, what you are seeing are false positive warnings, this is just a UI artefact, nothing to worry about

                        • 9. Re: vsan stretched cluster - multiple HA isolation address
                          depping Champion
                          VMware EmployeesUser Moderators

                          Also, when it comes to the isolation address keep in mind that the following happens:

                           

                          1. Master in Site A
                          2. Networks fails between data locations
                          3. Master observes no traffic from Site B
                          4. Hosts in Site B observe no traffic from Master
                          5. Site A will form a "sub cluster"
                          6. Site B will trigger a master election process
                          7. Site B will form  a "sub cluster" with a master in Site B

                           

                          As there's communication possible between the nodes in each cluster an "isolation" can never be declared, the isolation address doesn't have much to do with that either way.