VMware Cloud Community
Dormelchen2
Contributor
Contributor

VSphere HA question regarding failure and how to prevent

Hello,

i have a question:

we have 2 datacenter - they are connected with 2 cables.

Last week we had an failure - both cables were "down" at once.

So we had on datacenter 1 - 2 hosts ok and 2 hosts disconnected -- the same on the other side.

And -> we have a  lot of disconnected VMs on the both sides.

We have an synchronized storage between the two datacenter, so the VMs on each side could run.

The problem was, that i had to shutdown all machines on side 2, remove the disconnected hosts on side 1,

re register the VMs etc..

How can i prevent this ? is there any possibility to say:

datacenter 1 and his hosts are the "master" - if datacenter 2 fails (the hosts, network etc) shutdown all machines on datacenter 2

and start them in datacenter 1 ?

Thank you

Reply
0 Kudos
6 Replies
ThompsG
Virtuoso
Virtuoso

Hi Dormelchen2,

I think you probably need to explain your environment a litle more.

From what you have written, it seems to me that you have some sort of Metro cluster between DC A and DC B meaning that you have a single vCenter cluster that contains hosts from both sites?

Before trying to offer advise I would like to know a little more about the infrastructure and what is being attempted Smiley Happy

Kind regards.

Reply
0 Kudos
Dormelchen2
Contributor
Contributor

Hello ThompsG,

what we have is an ESXI Cluster with (for example) 4 Hosts

2 in DC A and 2 in DC B

One VCenter

Storage (datastores) clustered via DATACORE - synchron

Now - what was happen - DC B was down (cables) and all VMs and Hosts were disconnected.

We had to manually shutdown them on DC B - deregister and register on DC A start up

This is what we want to prevent - that we have manually have to do this things.

But i dont know how to realize

regards

Reply
0 Kudos
ThompsG
Virtuoso
Virtuoso

Hi Dormelchen2,

Okay - sorry more questions to help get the full picture and so I don't make any assumptions Smiley Happy

  • Do you have a "witness" node so that storage is able to make a decision on which side is surviving?
  • What is your Host Isolation reponse?
  • Do you have a HA isolation address configured?

Kind regards.

Reply
0 Kudos
Dormelchen2
Contributor
Contributor

Hello ThompsG

No we do not have this in place. I dont know if datacore has such a feature.

Host Isolation is not active now.

Cause i dont know  what will happen..... (regarding my first question)

Reply
0 Kudos
ThompsG
Virtuoso
Virtuoso

Hi Dormelchen2,

So the first thing I would confirm is if DataCore supports the concept of a witness, i.e. how does the storage determine which side is not available. Without this you could get into a split brain scenario and have both sides hosting the same VMs.

Once you have determined a way for the storage to keep consistent between both sites plus work out which site is down, then you can start to look at the possibility of implementing Host Isolation. Here is a couple of good blog articles that explain the concept:

So by using something that both sides of your ESXi cluster can see, will allow them to work out which side is surviving. From there the otherside will power down the VMs and then start on hosts which are not isolated. Problems I can see however is that replication will also be severed between the two storage arrays so what is the arrays response to this situation as well?

  • Does it break replication therefore allowing host access to continue?
  • Does it stop the source machines from talking to the array until replication is resumed?

This also needs to be determined so you know where your data is and what point of time the VMs are at.

Kind regards.

Reply
0 Kudos
depping
Leadership
Leadership

You should start with reading this KB:

VMware Knowledge Base

This explains the Datacore stretched cluster technology. It also points to the correct documentation for installing and configuring the vSphere environment. On top of that, it points to a different document which explains how to configure HA for Stretched.

https://urldefense.proofpoint.com/v2/url?u=http-3A__datacore.custhelp.com_app_answers_detail_a-5Fid_...

In this document VMCP is described and it is recommended to configure it for APD and PDL. (Guessing the scenario you describe leads to an APD). More details on how this works are to be found here:

VMware vSphere® Metro Storage Cluster Recommended Practices

Reply
0 Kudos