VMware Cloud Community
Sharantyr3
Enthusiast
Enthusiast
Jump to solution

vsan stretched cluster - multiple HA isolation address

Hello there,

I understand it is recommended to have 2 isolation address for HA, one per site in our case of stretched cluster (Advanced Options | vSAN Stretched Cluster Guide | VMware )

So I configured an IP on site 1 (preferred) and another IP on site 2 (secondary).

I did crash test vsan and HA by shutting the replication link between site 1 and site 2.

vsan and HA worked as it should have : poweroff VMs on site 2, restart on site 1 : ok.

But, on the vcenter web interface, full errors claiming that HA could not restart VMs on site 2 hosts (insufficient ressources).

Only a graphical glitch I guess, so not that bad as VMs were restarted on site 1 where the storage was available.

But I was wondering, in our case, when ESXi on site 1 can reach isolation address 1, not isolation address 2

And ESXi on site 2 can reach isolation address 2, not isolation address 1

How is HA supposed to handle this ?

Reply
0 Kudos
1 Solution

Accepted Solutions
depping
Leadership
Leadership
Jump to solution

Actually that is not correct what you are stating. There are two things here:

1. Availability of vSAN components

2. HA

If the connection between the locations is gone (between data locations), each location will end up with it's own master as an election will happen! I described those HA details here:

Clustering Deep Dive eBook

From a VM point of view the VMs which reside in the "secondary" location (which you specified during creation of the stretched cluster) will lose access to disk when the connection between data locations is impacted. This is because the Witness will bind itself to the preferred location. you can find all those details here:

vSAN Stretched Cluster Guide | VMware

View solution in original post

Reply
0 Kudos
11 Replies
Nawals
Expert
Expert
Jump to solution

Both site isolation address reachable each other? If not please check network connectivity between those IP.

NKS Please Mark Helpful/correct if my answer resolve your query.
Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast
Jump to solution

I don't understand your question, from what point of view are you asking ?

HA IP 1 is on Site 1 witness subnet

HA IP 2 is on Site 2 witness subnet

Both IPs are reachable from all ESXi on Site 1 and Site 2

I was just wondering, what is the supposed mechanic behind having 2 isolation address when 1 site can reach 1 IP and the other site can reach the other IP.

Can't find any doc explaining how HA should make a decision in that case.

Is isolation address 1 taking over isolation address 2 ?

Take the following case :

Site 1 and 2 : 192.168.0.0/24 : "vsan replication network"

Site 1 : 192.168.1.0/24 : "witness network 1", 192.168.1.1 router IP on site 1, used as isolation address 1 - HA IP 1

Site 2 : 192.168.2.0/24 : "witness network 2", 192.168.2.1 router IP on site 2, used as isolation address 2 - HA IP 2

Site 3 (witness) : 192.168.3.0/24 : "witness network 3"

if there is a network outage between site 1 and site 2 :

Site 1 and Site 2 cannot replicate anymore

Site 1 and Site 2 can reach site 3 (witness)

Site 1 can reach HA IP 1, not HA IP 2

Site 2 can reach HA IP 2, not HA IP 1

I did test that, and found out HA restarted VMs on Site 1, but not because it was aware of vsan "preferred" site, only because storage (vsan) was accessible on the site 1.

But, it did raise many alarms and errors complaining about not being abble to restart VMs on site 2 (insufficient resources).

Anyway, my main question is more about understanding how HA is supposed to handle multiple isolation address on multiple sites (specificaly for vsan stretched clusters).

Reply
0 Kudos
Nawals
Expert
Expert
Jump to solution

Follow this link for more understanding. Advanced Options | vSAN Stretched Cluster Guide | VMware 

NKS Please Mark Helpful/correct if my answer resolve your query.
Sharantyr3
Enthusiast
Enthusiast
Jump to solution

Sorry but I don't think you understand me, maybe my english is so bad :smileyplain:

Anyway, the link you provided me is just an "intro" to my topic, but you led me on the right path.

I found out this :

"When the master host stops receiving these heartbeats from a slave host, it checks for host liveness before declaring the host to have failed."

"Host network isolation occurs when a host is still running, but it can no longer observe traffic from vSphere HA agents on the management network. If a host stops observing this traffic (1st action), it attempts (2nd action) to ping the cluster isolation addresses."

Source

So, the HA master node commnication is taken in account before (more important) isolation address. It's him who determines which site should be up when there is a cross site link failure.

In my case, I noticed, before my test, that HA master was on Site 2.

So that explains why HA tried first to restart on Site 2 regardless of vsan availability on site 1 : because master was on site 2 !

But then the "smart" mechanics of HA found out the storage was available on site 1 and then HA master moved to site 1 (I just checked and it's the case, HA master is on site 1)

Too bad I don't have the fdm.log from this test time, could have been interresting to validate this.

Reply
0 Kudos
MikeStoica
Expert
Expert
Jump to solution

Do you have a Witness setup? Did you followed these steps Creating a New vSAN Stretched Cluster | vSAN Stretched Cluster Guide | VMware  when creating the stretched cluster?

Reply
0 Kudos
depping
Leadership
Leadership
Jump to solution

Actually that is not correct what you are stating. There are two things here:

1. Availability of vSAN components

2. HA

If the connection between the locations is gone (between data locations), each location will end up with it's own master as an election will happen! I described those HA details here:

Clustering Deep Dive eBook

From a VM point of view the VMs which reside in the "secondary" location (which you specified during creation of the stretched cluster) will lose access to disk when the connection between data locations is impacted. This is because the Witness will bind itself to the preferred location. you can find all those details here:

vSAN Stretched Cluster Guide | VMware

Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast
Jump to solution

Hello Mr Depping,

I should have asked you live on VMworld Barcelona ! :smileygrin: Thanks for the link to clustering deep dive, nice to have some things to read while looked at home because of covid Smiley Happy

"If the connection between the locations is gone (between data locations), each location will end up with it's own master as an election will happen!"

Ok, that seems logical, just to know, did you try it?

Stretched cluster

HA master running on secondary site

shut the vsan data link (not witness link and not vcenter<->esxis management link)

I got many alarms from vcenter stating that it could not restart VMs on secondary site like this :

Target: my-vm

Previous Status: Green

New Status: Red

Alarm Definition:

([Event alarm expression: Insufficient resources for vSphere HA to start the VM. Reason: {reason.@enum.fdm.placementFault}; Status = Red] OR [Event alarm expression: vSphere HA failed to restart a network isolated virtual machine; Status = Red] OR [Event alarm expression: VM powered on; Status = Green] OR [Event alarm expression: vSphere HA restarted a virtual machine; Status = Green])

Event details:

Insufficient resources to fail over my-vm in Cluster-1 that recides in Datacenter. vSphere HA will retry the fail over when enough resources are available. Reason: The host(s) cannot access virtual machine components

I m not understanding what part is not correct in my message regarding alarms I got, because to me it's clear that HA tried to restart on the vsan secondary site (all VMs had HA warning raised).

Also my main question was more about multiple ha isolation address, best practices state that you should have 1 on each site, but when the inter site link is shut, each site will end up with 1 HA isolation address reachable, so from HA point of view, no site is isolated if it rely only on isolation address.

That's why I assumed that HA master had a major role, and why I got these messages.

I'll read your bible on HA if I find my answer Smiley Happy

Reply
0 Kudos
depping
Leadership
Leadership
Jump to solution

yes I have tested this many times, what you are seeing are false positive warnings, this is just a UI artefact, nothing to worry about Smiley Happy

Reply
0 Kudos
depping
Leadership
Leadership
Jump to solution

Also, when it comes to the isolation address keep in mind that the following happens:

  1. Master in Site A
  2. Networks fails between data locations
  3. Master observes no traffic from Site B
  4. Hosts in Site B observe no traffic from Master
  5. Site A will form a "sub cluster"
  6. Site B will trigger a master election process
  7. Site B will form  a "sub cluster" with a master in Site B

As there's communication possible between the nodes in each cluster an "isolation" can never be declared, the isolation address doesn't have much to do with that either way.

Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast
Jump to solution

Hello there,

A few subquestions for you @depping 

From your blog

Note, this is not the case for vSAN, with vSAN automatically the vSAN network is used by HA.

I have one vmkernel for vsan replication and one for vsan witness, different subnets.

If I choose an isolation address reachable either from vsan replication vmk or vsan witness vmk, would both work ?

 

Would it be a good idea to add a 3rd isolation address which reside on the 3rd site (witness site) (reachable via the specific default gateway of vsan witness vmkernel) ?

Would the HA know that this remote address is reachable using the default gateway on vsan witness vmkernel or should I add the static route using esxcli ?

 

Thanks!

Reply
0 Kudos
depping
Leadership
Leadership
Jump to solution

I have not tested using a 3rd isolation address, and I am not sure it makes sense either to be honest. The isolation address should be site local, as it helps each hosts in each site to determine they are isolated. Note that an isolation address is used by the host which is isolated from the network. If the local isolation address isn't reachable, it is very unlikely the isolation address in the remote witness location is reachable to be honest. Sure you could create a 3rd, but I have personally not done this, and I am not sure which network it will use, not tested it. That is definitely something I think you should test before implementing.

Reply
0 Kudos