atmosphere23
Contributor
Contributor

Admission Control - [Sanity Check] Unbalanced cluster w/ no VM reservations

Jump to solution

I've been spending time reviewing Adminission Control policy and just looking for a "sanity check" that my calculations and findings in my environment are accurate.  Currently managing an unbalanced cluster which includes 1 host significantly larger in memory than the rest (host 5).  I've read from multiple sources including the 5.1 deepdive from Duncan and Frank that the "host failures the cluster tolerates" policy will waste resources in the scenario to ensure the largest host is protected.  However, the math below shows that this policy is almost dead-on with what the percentage-based policy would cover in a N+1 scenario (again ensuring the largest host is protected).

The breakdown is:

Host 1

Host 2

Host 3

Host 4

Host 5

Host 6

Host 7

Totals

Memory (GB)

48

72

64

64

128

64

72

512

Memory %

9.38

14.06

12.5

12.5

25

12.5

14.06

CPU

18.08 GHz

19.12 GHz

19.12 GHz

18.08 GHz

19.12 GHz

19.12 GHz

19.12 GHz

131.76

CPU %

13.78

14.5

14.5

13.78

14.5

14.5

14.5

In a N+1 scenario the recommended percentages would work out to 25% for memory (128 / 512) * 100 and 15% (rounded up) for CPU (19.12 / 131.76) * 100 to ensure the largest host is protected.  I realize that we may reduce waste by lowering these values, but would ultimately defeat the point of HA if the largest host failed (granted I realize again it would need to be heavily utilized) and there were not enough resources for HA initiated restarts.

The current used policy of host failures the cluster tolerates slot sizes are the default values (32 MHz for CPU and 217MB (overhead) for memory) with no VM reservations.  We do utilize RP reservations, but they don't come into play with HA slot size calculation if I understand correctly.  We currently have 2210 total slots (1540 available) in our main cluster given the hosts/resources provided.  138 slots are used for powered-on VMs and 532 are reserved for failover.  For memory, this works out to be 112.73 GB (532 * 217) / 1024).  That equates to 22% of the total memory in the cluster.  For CPU it works out to be 16.63 GHz (32 MHz * 532) / 1000 which rounded is 17% of the total CPU in the cluster.

So if I switched to the percentage-based policy with best practices in mind I would reserve more memory and less CPU than current, but only slightly given the percentages above.  Are there any underlying issues with this configuration I'm missing or given the configuration is my analysis accurate? Or would it be recommended to implement the percentage based policy and lower the reserved resources to values lower than the largest host in the cluster to gain available resources?  It's "catch 22" because we obviously want to maximize available resources in the cluster while ensuring adequate failover resources.

0 Kudos
1 Solution

Accepted Solutions
4 Replies
depping
Leadership
Leadership

Take a step back first and look at the amount of slots you have and then compare that to the amount of virtual machines you want to run. Do you really expect to power-on another 1500 virtual machines (that is how many slots you have) ?

Realistically speaking that won't happen as you will be overcommitting to a point where most VMs will be unusable as you will be swapping memory like crazy all the time.

I can understand your question, but in reality when you have this amount of resources and are NOT using any reservations and don't have one really really large virtual machine which screws up the algorythm than you have nothing to worry about. So at this point I wouldn't worry. You obviously have a good understanding of how Admission Control works, what actually causes slots to go up (or down) and as such I would say you are well prepared.

Thanks for reading and buying the book by the way, much appreciated!

atmosphere23
Contributor
Contributor

Thanks for the quick response.

0 Kudos
atmosphere23
Contributor
Contributor

Good article, and thanks again for the detailed follow up.  I was debating utilizing VM level reservations after your initial response with the idea that the slot sizes would depict more realistic data, but the idea of managing such reservations is very unappealing and due to the dynamic nature of RP reservations I will continue to use them in the cluster.

0 Kudos