4 Replies Latest reply on Jul 6, 2019 11:34 AM by depping

    Stretched cluster failure scenario

    therealhostman Novice

      Hi,

       

      After some clarification pls regarding a stretched cluster and a specific failure scenario.  The 10Gb link between two clusters used for synchronous replication, if this goes dark for a period of time, what is the outcome?  Some documentation I've read says VMs will move from secondary to preferred cluster (if running active/active for example).  Is it not possible for VMs to continue running active/active, albeit in a state where synchronous replication will not restart until the 10Gb link is back online?

       

      Thanks.

        • 1. Re: Stretched cluster failure scenario
          TheBobkin Virtuoso
          vExpertVMware Employees

          In a standard stretched cluster configuration (RAID1,FTT1=1 across sites), if the inter-site connection is broken VMs will fail over to whichever site is configured as Preferred (this can of course be changed should that site be down/impaired).

          VMs won't run on both side simultaneously - that wouldn't make sense as then it would be split-brained and which set of data would you use following the outage?

          Following re-establishing connection between the sites, the delta data from the Preferred site is synced to the other site.

           

          More information regarding failure scenarios and required HA settings etc. can be found here:

          https://storagehub.vmware.com/t/vmware-vsan/vsan-stretched-cluster-guide/

           

           

          Bob

          • 2. Re: Stretched cluster failure scenario
            depping Champion
            VMware EmployeesUser Moderators

            And we (VMware vSAN product team) knows this is a problem, and we are looking to fix this in the future. The reason you end up in this situation today is because the Witness VM binds itself to 1 location. Which means the other location will lose quorum and as such all VMs which are stretched will lose access to their storage objects, and those VMs will be killed by vSAN automatically.

             

            Again, this is a known concern, and the team has it listed as an issue we need to solve in the future. I can't comment unfortunately when this will be,

            • 3. Re: Stretched cluster failure scenario
              therealhostman Novice

              Provided communications with the witness is still live during a scenario where the replication link has failed, I would have thought VMs could remain running 50/50, with of course the ability to fail across clusters disabled until the replication is re-enabled and data resynced.  I understand the concept of a split brain, but this is what the witness server is essentially supposed to prevent.

               

              From what depping is saying, this is how the vSAN development guys want it to work, but it needs development work to facilitate it?

              • 4. Re: Stretched cluster failure scenario
                depping Champion
                VMware EmployeesUser Moderators

                The problem is that the Witness Appliance is not a witness, but it hosts witness objects. A host can only be part of 1 cluster or partition in this case, so the witness host will bind itself to the preferred location. Which causes the secondary location to lose quorum.

                 

                Yes this needs development work, and is being looked at.

                1 person found this helpful