Solved: 'Performance degradiation VMs tolerate' setting: b...

roland_geiser · ‎12-11-2018

Hello

I am trying to find out on which memory metric the 'performance degradiation VMs tolerate' setting in the Admission Control menu is based on.

Let's do an example:

I have 10 ESXi hosts, each with 512 GB RAM, resulting in total 5 GB of RAM. Used memory in the cluster is 2.87 TB (If I sum up the memory consumed value on all VMs manually, I do get virtually the same amount of memory):

Some VMs do have reservations, most VMs don't. The total amount of reservations in the cluster is 1.2 TB.

Admission Control is configured like this:

Normally, I configure 'Host failures cluster tolerates' to five hosts, so 50 percent of my "stretched over two sites cluster" can fail. In this example let's configure this setting to six hosts. This results in this graphic:

OK above we see the gray shaded bar, these are my 60 % failover capacity. I also see the reserved memory in blue. So in this state, every VM with reservations of course receives its reservation and the rest of available memory (for the vms without or partially reservations) is the light gray bar.This is approximately (5TB*0.4 - 1.2 TB) = 0.8 TB RAM.

From this it follows that

in the normal state (10 hosts available) the vms are using 2.87 TB (thereof 1.2 TB reserved) --> unreserved capacity occupied by all VMs: 1.67 TB
in a failover state (4 hosts available) the vms can only use 2 TB (thereof still 1.2 TB reserved) --> unreserved capacity occupied by all VMs: 0.8 TB

So in a failover state (4 hosts running), the vms without reservations do have to move closer together. But it seems, that there is still enough capacity and there is no warning that the running VMs utilization cannot satisfy the configured failover resources on the cluster. Although the vms have less memory available, there is not yet a performance degradiation!

OK, lets try with 'Host failures cluster tolerates' set to 7 hosts. Of course, the gray shaded bar is getting longer, the blue bar for reserved memory is the same and gap between - the non reserved memory) is getting very small, only approximately 0.3 TB. So now, after waiting a few minutes, the cluster is complaining about insufficient failover resources...

So I am curious which memory metric is used here to do the calculation. I don't think it is only memory active? Because a calculation based only on memory active seems to be a calculation too progressive in my opinion and with only taken memory active in consideration one can expect some performance degradiation. Is it probably memory active and a certain percentage as a buffer?

Does anyone know here more how this calculation works? Because I would like to have a feeling how my vms will perform in a failover state (apart from that the algorithm tells me that the vm's don't have memory issues and that "the same performance is guaranteed...")

Best regards

Roland

stgepopp · ‎12-14-2018

Hi Roland,

according to page 133 from the clustering deep dive book, which you can be hopefully still downloaded here (http://pages.rubrik.com/clustering-deep-dive-ebook.html?utm_campaign=authors ) the used metric is: memory actively used

This metric is a statistical one, which means it's a small probe out of a large set of data (sometimes not absolutely reliable). But short after a restart of ten's of VMs on a new host this metric (in conjunction with your setting) is good enough to decide if VMs will run performance degraded or not.

Erich

View solution in original post

MattMeyer · ‎12-11-2018

I believe this is using consumed memory + memory overhead. That is the same metric that is used for the graph in the first screenshot as you already noted. The number of available resources is calculated by removing N hosts set in the Admission Control setting. If the sum of consumed memory exceeds cluster resources after N failures are removed, you get the warning. This warning is not controlled by reservations.

roland_geiser · ‎12-11-2018

Hello

No I don't think it is based on consumed memory. Because when I set the 'host failures cluster tolerates' to 6, my available memory is 5 TB*0.4 = 2 TB in a failure state. Memory consumed in the cluster is 2.87 GB. But as mentioned: in this configuration the warning does not appear! I have to set the setting to 7 hosts, where the available non-reserved memory is only about 0.3 TB. Only then the warning pops up.

So probably VMware is really calculating with 'memory active' (plus a small percentage of buffer maybe) and if a vm get's its memory active in the failure state it is ok for them and "there is no performance degradiation"... ?

I know that this setting has nothing to do with reservations. It is rather made for people who don't work with reservations for every vm to become a feeling how their vms will perform in a desaster. Because we all know that without reservations, we can start almost unlimited vms, but they will have a poor performace and swapping is ocurring when heavily overloaded...

MikeStoica · ‎12-12-2018

"Memory is calculated by taking the total amount of resources in a cluster and from this the virtualization overhead like agents and the VMkernel is subtracted" vSphere HA admission control calculation for memory - VMware vSphere Blog

roland_geiser · ‎12-12-2018

This is true for the percentage based failover capacity. But this is not the same as the percentage of performance degradation VMs tolerate setting.

stgepopp · ‎12-14-2018

Hi Roland,

according to page 133 from the clustering deep dive book, which you can be hopefully still downloaded here (http://pages.rubrik.com/clustering-deep-dive-ebook.html?utm_campaign=authors ) the used metric is: memory actively used

This metric is a statistical one, which means it's a small probe out of a large set of data (sometimes not absolutely reliable). But short after a restart of ten's of VMs on a new host this metric (in conjunction with your setting) is good enough to decide if VMs will run performance degraded or not.

Erich

roland_geiser · ‎12-14-2018

Hello Erich

Thank you for your input. I saw that information too. It seems that "memory actively used" is indeed the value that VMware uses for it's calculation. Because I have to set 'Host failures cluster tolerates' to 7 what means there is a very small amount of (non reserved) memory available, there seems evidence that they use memory active.

I know what memory active means. Several tools like for example the "Oversized VM Report" from Veeam One base their calculations upon memory active. So I think Veeam calculates the memory active over a certain period, takes the peak value und proposes that peak plus some buffer from ten or twenty percent as the recommended memory configuration. I don't know how this is calculated by vCOPS, would be interesting to know.

For me, If I see a VM with for example 16 GB of RAM and I see that memory active is constantly about 1 GB, I know I can go and configure the VM with less memory without circumstances. I will then go to 8 GB probably or to 4 GB or even 2 GB, but the more progressive I decrease the memory the better I observe the VM for a while - probably by means of the guest-operating system's memory counters - to make shure the VM performs still good.

I think if I configure a VM with exactly just the (peak) amount of memory active, there is a good chance that there is a performance degradiaton. But if VMware really does calculate with "memory actively used" = memory active, in their eyes tthe same performance is guaranteed after a VM restart...? And we know that memory active during a vm restart is also very high for a couple of minutes...

So If I check my VCSA: Configured Memory 16 GB, memory active = 2 GB. Memory utilization according to the VAMI Memory Utilization Trending chart: 50%. Would anybody really configure the memory to 2 GB and expect the same performance? I am not shure if this calculation is too progressive.

roland_geiser · ‎12-19-2018

I have checked the memory active on the cluster: 'Active Guest Memory' is somewhat over 300 GB. So as described above, the warning Running VMs utilization cannot satisfy the configured failover resources on the cluster pops up when the available unreserved memory for the VMs is approximately 0.3 TB (Host failures cluster tolerates = 7). If I have 0.7 TB unreserved memory available (Host failures cluster tolerates = 6), the warning does not appear. So this confirms the theory that VMware is working with memory active in fact.

depping · ‎12-19-2018

Just to be clear, the content in the vSphere 6.7 Clustering Deep Dive book was reviewed by the HA/DRS developers, so it is good to see they provided us the correct info

roland_geiser · ‎12-20-2018

The Clustering Deep Dive eBook is really a great reading. Compliment!

All

'Performance degradiation VMs tolerate' setting: based on which memory metric?