Dormelchen2
Contributor
Contributor

How to avoid Split Brain in a stretched Cluster

Hello,

i am searching the internet for an answer... but i am stuck.

What we have and whats happened.

We have a cluster with 4 hosts.

2 on side A and 2 on side B

2 dark fibre between A and B

The hosts are connected to a network with 3 physical adapters and connected to the storage via 4 FC HBA to a DataCore Cluster

What was happened ? The 2 dark fibre were cut by ???

So we see 2 hosts disconnected - 50% VMs disconnected (looking vsphere server on side A)

I could connect to a backdoor to Side B and open the website of the esxi hosts and see, that the 50 % (the disconnected ones from side A) VMs are runinng fine.

So i shut down the VMs on side B, removed the disconnected hosts on side A and register the disconnected VMs, power Up.

So - thats not a good solution.

How can we manage this, that one side will shutdown the remaining running VMs and the other side will power up these ?

0 Kudos
4 Replies
sk84
Expert
Expert

What did you expect from this failure scenario?

If HA had worked automatically, site A would have started the VMs from site B and site B the VMs from site A, since these hosts are in the same situation. In the end all VMs would have run twice and you would have had a real split brain scenario and far more problems.

If your infrastructure is set up properly, a failure of the dark fiber connections won't be a problem because the VMs on both sites can continue to run and function. Only management functions would have been limited for the time of the dark fiber cut.

And in this scenario with a Metro Cluster and 2 sites there can only be a manual failover where a human decides on which site all VMs should be started (for the most cases).

More information can be found in the Metrocluster Best Practice guide:

VMware vSphere® Metro Storage Cluster Recommended Practices

For the sake of completeness I have to mention that there are also Active-Active storage solutions with automatic failover on the market:

VMware Knowledge Base

--- Regards, Sebastian VCP6.5-DCV // VCP7-CMA // vSAN 2017 Specialist Please mark this answer as 'helpful' or 'correct' if you think your question has been answered correctly.
0 Kudos
depping
Leadership
Leadership

Look at the recommendations in the whitepaper I wrote, which is mentioned above. But considering you are using Datacore you may also want to look at their best practices. Normally what you would see is that an APD or PDL is triggered, if you have the automatic response to a PDL or APD enabled in vSphere HA then the VMs that lost access to storage should be restarted in the remote location AND powered off in the "offline" location.

0 Kudos
Dormelchen2
Contributor
Contributor

Hello sk84,

The problem is:

[quote]If your infrastructure is set up properly, a failure of the dark fiber connections won't be a problem because the VMs on both sites can continue to run and function. Only management functions would have been limited for the time of the dark fiber cut.

[/quote]

Yes - they are running on both sites and that is the problem.

If the dark fibre "breaks" we need to run ALL VMs in the other location.

So we need to shutdown Location A and everything should run in Location B

And this is, what i am looking for. Some automation like:

Location A is not reachable from Location B and vise versa

- shutdown all VMs on A

- stop datacore on A

- startup all VMs on B (which were on A)

i dont know how to realize this. The most important thing is, that Location B is running, cause Location B is a managed datacenter

0 Kudos
depping
Leadership
Leadership

Normally you use the APD or PDL response that you can configure in vSphere HA for this Marcus. If this will work will depend on how DATACORE has implemented their solution. For most stretched cluster solutions these days vendors use a "witness" in a 3rd location. And when a split brain has occurred, site partition also sometimes called, they will declare 1 location winner for each of the presented stretched datastores. The other location will then either go in PDL, or in APD state, and then HA can take action (when configured) based on that.

So talk to Datacore, or dig up a document, that describes exactly to the letter what they do in this scenario, as that will tell you what you should be seeing and if vSphere HA can even respond to this failure.

0 Kudos