4 Replies Latest reply on Mar 5, 2019 2:24 AM by depping

    How to avoid Split Brain in a stretched Cluster

    Dormelchen2 Novice

      Hello,

       

      i am searching the internet for an answer... but i am stuck.

      What we have and whats happened.

      We have a cluster with 4 hosts.

      2 on side A and 2 on side B

      2 dark fibre between A and B

      The hosts are connected to a network with 3 physical adapters and connected to the storage via 4 FC HBA to a DataCore Cluster

       

      What was happened ? The 2 dark fibre were cut by ???

      So we see 2 hosts disconnected - 50% VMs disconnected (looking vsphere server on side A)

      I could connect to a backdoor to Side B and open the website of the esxi hosts and see, that the 50 % (the disconnected ones from side A) VMs are runinng fine.

       

      So i shut down the VMs on side B, removed the disconnected hosts on side A and register the disconnected VMs, power Up.

       

      So - thats not a good solution.

       

      How can we manage this, that one side will shutdown the remaining running VMs and the other side will power up these ?

        • 1. Re: How to avoid Split Brain in a stretched Cluster
          sk84 Expert
          vExpert

          What did you expect from this failure scenario?

           

          If HA had worked automatically, site A would have started the VMs from site B and site B the VMs from site A, since these hosts are in the same situation. In the end all VMs would have run twice and you would have had a real split brain scenario and far more problems.

           

          If your infrastructure is set up properly, a failure of the dark fiber connections won't be a problem because the VMs on both sites can continue to run and function. Only management functions would have been limited for the time of the dark fiber cut.

           

          And in this scenario with a Metro Cluster and 2 sites there can only be a manual failover where a human decides on which site all VMs should be started (for the most cases).

           

          More information can be found in the Metrocluster Best Practice guide:

          VMware vSphere® Metro Storage Cluster Recommended Practices

           

          For the sake of completeness I have to mention that there are also Active-Active storage solutions with automatic failover on the market:

          VMware Knowledge Base

          • 2. Re: How to avoid Split Brain in a stretched Cluster
            depping Champion
            VMware EmployeesUser Moderators

            Look at the recommendations in the whitepaper I wrote, which is mentioned above. But considering you are using Datacore you may also want to look at their best practices. Normally what you would see is that an APD or PDL is triggered, if you have the automatic response to a PDL or APD enabled in vSphere HA then the VMs that lost access to storage should be restarted in the remote location AND powered off in the "offline" location.

            • 3. Re: How to avoid Split Brain in a stretched Cluster
              Dormelchen2 Novice

              Hello sk84,

               

              The problem is:

              [quote]If your infrastructure is set up properly, a failure of the dark fiber connections won't be a problem because the VMs on both sites can continue to run and function. Only management functions would have been limited for the time of the dark fiber cut.

              [/quote]

               

              Yes - they are running on both sites and that is the problem.

              If the dark fibre "breaks" we need to run ALL VMs in the other location.

              So we need to shutdown Location A and everything should run in Location B

               

              And this is, what i am looking for. Some automation like:

               

              Location A is not reachable from Location B and vise versa

              - shutdown all VMs on A

              - stop datacore on A

              - startup all VMs on B (which were on A)

               

              i dont know how to realize this. The most important thing is, that Location B is running, cause Location B is a managed datacenter

              • 4. Re: How to avoid Split Brain in a stretched Cluster
                depping Champion
                VMware EmployeesUser Moderators

                Normally you use the APD or PDL response that you can configure in vSphere HA for this Marcus. If this will work will depend on how DATACORE has implemented their solution. For most stretched cluster solutions these days vendors use a "witness" in a 3rd location. And when a split brain has occurred, site partition also sometimes called, they will declare 1 location winner for each of the presented stretched datastores. The other location will then either go in PDL, or in APD state, and then HA can take action (when configured) based on that.

                 

                So talk to Datacore, or dig up a document, that describes exactly to the letter what they do in this scenario, as that will tell you what you should be seeing and if vSphere HA can even respond to this failure.