Network Partition and VMware HA


I have a question about Network Partition and HA.

This is the situation:

  • vSphere 5.5
  • 1 Cluster
  • 4 ESXi host (2 in Datacenter1 and 2 in Datacenter2)
  • Gateway address of the management network (and the only Isolation Address configured in the cluster) in Datacenter1
  • HA Master in Datacenter1
  • vCenter en Datacenter1
  • HA Isolation Response: Shut down

ESXi01 - Master (VM1, VM2, vcenter)ESXi03 - Slave (VM5, VM6)
ESXi02 - Slave (VM3, VM4)ESXi04 - Slave (VM7, VM8)

If the communication between the 2 Datacenters fail (only LAN, the storege is ok):

  • ESXi03 and ESXi04 elect a new master
  • The virtual machines remain powered on in the 2 Datacenters
  • 1 cluster with 2 partitions

ESXi01 - Master (VM1, VM2, vcenter)ESXi03 - Master (VM5, VM6)
ESXi02 - Slave (VM3, VM4)ESXi04 - Slave (VM7, VM8)

Is possible to configure the cluster for an automatic restart of virtual machines in case of Partition Network and get only 1 active Datacenter? I assume that de Storage is OK.

The goal is this situation:

ESXi01 - Master (VM1, VM2, VM5, VM6, vcenter)ESXi03 - No Virtual Machines
ESXi02 - Slave (VM3, VM4, VM7, VM8)ESXi04 - No Virtual Machines

I think, that because the hosts in Datacenter2 cannot ping the Isolation Address in Datacenter1, the Isolation Response is triggered.

Any help?


0 Kudos
3 Replies

Is this a Metro Cluster? Is it a uniform or non-uniform cluster and does the storage system support metro clusters?

I don't think you could achieve your desired goal with the standard functionality and may need a third party products such as VPLEX, NetApp also has something as well as some other storage vendors. HA is smart enough to know the link has gone down and then each site elects a new master but since the shared storage is still online the vmdk locks are in place and HA realizes that those machines can't be restarted and leaves them online.

EMC VPLEX was designed to achieve the goal you've outlined. You set a preferred site for the shared datastores/LUNs in the software and when the link goes down, only the preferred site will have read/write access to the datastores/LUNs, after a bit of time the locks on the vmdks will expire and then HA and do its thing and start them up at the preferred site.

I suppose you could hack something together in PowerCLI to monitor the link and shut down the VM's at Datacenter2 and then they would be able to be powered on at Datacenter1 but it could be clunky. Those products from EMC and NetApp were designed from the start to do that, to remove the read/write from the non-preferred site to allow the other site to take over.

Here's some good reading material

VMware vSphere® Metro Storage Cluster

vSphere Metro Storage Cluster - Uniform vs Non-Uniform

vSphere vMSC solutions, what's supported and whats not?

HA demo using vMSC with EMC VPLEX Metro

Cheers! If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

I think, that because the hosts in Datacenter2 cannot ping the Isolation Address in Datacenter1, the Isolation Response is triggered.

The HA host isolation response can't be triggered as long as there are still 2 alive hosts that can still communicate in DC2.

The definition of an "isolated host" is:

1. the HA FDM agent can't communicate with any other host of the cluster

2. only if the FDM cannot reach any of the other hosts, it tries to ping the isolation address(es)

Therefore, as long as a host can still exchange HA heartbeats with another host through the HA network, the isolation address and isolation response are entirely irrelevant.


Key thing here is that you are describing a partition, and then mix it with the response to an isolation event...