VMware Cloud Community
grilled_cheese
Contributor
Contributor
Jump to solution

DRS vMotion of large VM during a Failure of a ESXi host

Does anyone know how this scenario would be handled?

I have 3 ESX hosts each with 500GB of RAM

Host A has 1 VM (VM_A) with 450GB of ram

Host B has various VMs using a total of 200GB

Host C has various VMs using a total of 200GB

If Host A fails, VMA_ would need to fail over to either host B or C, but not enough room.

Would the vMotion:

A. start VM_A on Host B and after a period of time migrate the existing VMs to Host C, since Host B is now overloaded?

or

B.  migrate VMs away from Host B to Host C, until there is enough resources for VM_A, then start VM_A on Host B?

I know you can choose that VM_A does not power on until there are sufficient resources, but is vMotion smart enough to make room for it before it is powered on?

I dont want to get into a situation where it starts on a Host which is overloaded, even for a short while, or it doesnt power on at all and requires manual intervention.

0 Kudos
1 Solution

Accepted Solutions
ZibiM
Enthusiast
Enthusiast
Jump to solution

Hi

First of all pls read resources recommended by Scott

2nd You need to decide whether you are thinking about host evacuation (DRS event really) or host failure (HA event)

3rd there a lots of options that can decide about outcomes

Few remarks

Considering traditional scenario with no admission control and no reservation

In case of HA due to the Host A failure - big VM will be powered on at the first available host (B or C), and then DRS will start to move small VMs to the other host

Considering non traditional scenario with reservation

HA won't be able to power on big VM at the first attempt, but will notify DRS about that.

DRS in turn will attempt to make enough room by making the migrations (like from B to C)

HA makes regular checks and makes several attempts to power on VMs from failed hosts.

AFAIR last HA attempt is made like 30 min after failure in order to allow DPM to power on suspended hosts from stand by.

Sooner or later your hosts will have enough room to fit this big VM and power it on.

If you use reservation consider enabling admission control based on resource percentage.

It will prevent you from shooting yourself in the foot

View solution in original post

0 Kudos
6 Replies
scott28tt
VMware Employee
VMware Employee
Jump to solution

Your scenario isn’t specific at all to vMotion, you’re actually asking about HA (failover) and DRS (compute resource management).

vMotion is merely the live migration mechanism used by the dynamic balancing function of DRS.


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
0 Kudos
scott28tt
VMware Employee
VMware Employee
Jump to solution

This may help: Using vSphere HA and DRS Together

As the article mentions the priority is availability, so that’s HA.

Unless your VMs are set with memory reservations, there will be “room” for your big VM to failover - it will just contend with the other VMs on whichever host is fails over onto.

It will then be down to DRS to balance the VMs across the 2 remaining hosts (using vMotion as necessary).


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
0 Kudos
grilled_cheese
Contributor
Contributor
Jump to solution

Assuming we use memory reservations, I'm wondering if we would run into a situation where it simply fails because 'not enough resources available' as opposed to moving things around to allow enough resources.

I guess the real answer is "test it and see"

0 Kudos
scott28tt
VMware Employee
VMware Employee
Jump to solution

HA would have prevented you from powering on all your VMs in the first place (even with all hosts available) if their failover could not be guaranteed.

vSphere HA Admission Control


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
0 Kudos
ZibiM
Enthusiast
Enthusiast
Jump to solution

Hi

First of all pls read resources recommended by Scott

2nd You need to decide whether you are thinking about host evacuation (DRS event really) or host failure (HA event)

3rd there a lots of options that can decide about outcomes

Few remarks

Considering traditional scenario with no admission control and no reservation

In case of HA due to the Host A failure - big VM will be powered on at the first available host (B or C), and then DRS will start to move small VMs to the other host

Considering non traditional scenario with reservation

HA won't be able to power on big VM at the first attempt, but will notify DRS about that.

DRS in turn will attempt to make enough room by making the migrations (like from B to C)

HA makes regular checks and makes several attempts to power on VMs from failed hosts.

AFAIR last HA attempt is made like 30 min after failure in order to allow DPM to power on suspended hosts from stand by.

Sooner or later your hosts will have enough room to fit this big VM and power it on.

If you use reservation consider enabling admission control based on resource percentage.

It will prevent you from shooting yourself in the foot

0 Kudos
grilled_cheese
Contributor
Contributor
Jump to solution

Ok thanks this is what I was looking for!

"Considering non traditional scenario with reservation

HA won't be able to power on big VM at the first attempt, but will notify DRS about that.

DRS in turn will attempt to make enough room by making the migrations (like from B to C)

HA makes regular checks and makes several attempts to power on VMs from failed hosts.

AFAIR last HA attempt is made like 30 min after failure in order to allow DPM to power on suspended hosts from stand by.

Sooner or later your hosts will have enough room to fit this big VM and power it on."

0 Kudos