rleon_vm
Enthusiast
Enthusiast

Intelligence in a VSAN Stretched Cluster to vMotion VMs to the other site under certain conditions

Hi all,

Suppose in a VSAN Stretched Cluster:

  • A VMDK's storage profile is PFTT=1, SFTT=1.
  • Using VM-to-Host affinity "should" rules to bind the VM to the Preferred-Site.

Just want to see if it is possible for a VSAN Stretched Cluster to automatically vMotion a VM to the other site when for example:

  • SFTT in the Preferred-Site becomes non-compliant. E.g., 2 out of 3 components become unavailable (less than 50% votes).
  • Or, all 3 components in the Preferred-Site become unavailable (0% votes).
  • ...But at least one ESXi host in the Preferred-Site is still running

Normally, with the affinity rule in effect, HA would restart the VM on the surviving ESXi in the Preferred-Site, even if it means the storage traffic has to traverse over the Inter-Site-Link to the Secondary-Site (Where SFTT is still compliant).

Is there a way to make it so that in such scenarios, DRS would just ignore the affinity "should" rule for the VM, and just vMotion the VM across to the other site where the VM's storage traffic could happen locally?

Thanks!

0 Kudos
6 Replies
rleon_vm
Enthusiast
Enthusiast

Sorry for the bump. I will rephrase my question to make it clearer.

We have:

Preferred-Site: 2 of 3 hosts failed

Secondary-Site: 3 of 3 hosts still healthy

What happens at the VMDK level:

VMs can only access their VMDKs using the Secondary-Sites component copies.

What happens at the VM runtime level:

VMs can still run on the Preferred-Site's remaining host, but access to their VMDKs will have to go through the ISL to the Secondary-Site.

What I'm asking:

In the above scenario, it would be better for HA (or DRS) to just restart (or vMotion) all VMs to the Secondary-Site, ignoring the affinity rules. Anyway to do this?

0 Kudos
rleon_vm
Enthusiast
Enthusiast

Sorry for the necro thread bump.

I created this thread based on VSAN 6.7U3.
I reckon problem probably didn't have a solution due to no replies.

Just bumping this topic up again to see if there is a solution / fix / workaround for this problem in 7.0U1.

0 Kudos
depping
Leadership
Leadership

Should rules are only ignored when HA cannot restart the VM in the "site" you defined, or when DRS sees a significant imbalance. There's nothing else right now unfortunately. 

I have filed a few feature requests around this in the past, and hopefully those will make it in over time.

rleon_vm
Enthusiast
Enthusiast

Thanks for confirming. I might have to come up with some PowerCLI workaround for now, at least until this feature is build-in to HA+DRS+VSAN.

I'm thinking of some script to periodically check to see if - due to HA restart and DRS Affinity-should-rules - a VM is running in a site where its VSAN disk components are not actually accessible, and that disk I/O traffic is actually going across the intersite-link *shudders*.
And if found, it will just vMotion the VM to the other site so the VM would achieve some kind of "disk read write locality" again.

0 Kudos
depping
Leadership
Leadership

Sure, but if the rules are still in place, then the VM may migrate back, so that is something to take in to consideration!

0 Kudos
rleon_vm
Enthusiast
Enthusiast

Right.

Note to self: Make PowerCLI script remove affected VMs from the Affinity rule after the cross-site vMotion, then send some log or email telling the world what it did so it can be manually undone later once things are fixed.

 

0 Kudos