Solved: Reported Memory usage on SQL VMs

Algon2 · ‎11-03-2021

Hi

I'm just trying to be clear on this. We've got a number of Windows SQL VMs running in our environment. The reported memory usage within the OS is completely different to what ESXi\Vcenter is reporting.

I initially assumed that this was because SQL server is reserving all the memory it can, based on any limits set within SQL server itself, hence why the OS reports what it does. Therefore the active memory reported by Vcenter for this VM is more accurate because the hypervisor can see exactly what physical memory is being used.

However, sometimes the memory figure shown in Vcenter may be so small it's almost hard to believe that the SQL VM isn't using more than what's shown.

Is there something I'm missing here or is my understanding correct?

Thanks in advance

vbondzio · ‎11-04-2021

I'm saying that vSphere level counters measure different things than the in guest counters . They are both accurate in their own right. Our activity / touched metric is sample based, 100 random pages every minute are unmapped, reads and writes trapped and counted and the untouched remainder remapped. Then another 100 random pages across the VMs mapped memory are unmapped and the cycle begins anew.

What you see in the guest is what applications requested from the OS, an app doesn't have to touch everything at all times, some data can just be resident in memory but not looked at nor written to on a regular bases.For example, imagine calc.exe wanted 2GB of memory at some point in the past. If there is enough memory available (free / zero / standby), it will have gotten it too. Now, an hour later it might only touch 100KB of those pages for upkeep while it is running outside of whatever task it did to require 2GB. See, if it never yielded / gave up the memory (it's up to the application), Windows won't take it away either. Windows will only reclaim / (guest) swap to disk when available memory is nearly exhausted (let's ignore modern Windows memory compression, because MSFT made fun of us when we introduced it years ago).

At that point, it will look at calc.exe's memory and notice that most pages haven't been touched in forever so it will swap out the stale memory so another process can get the page after it has been freed and zeroed. It has to swap because you can't just drop memory / information an application might expect to be there. So what counter would have been "more accurate" here? The guest ones that showed 2 GB or (assuming you could get to that level of granularity) vSphere who showed only 100KB active / touched?

It really depends on the application, do you just have a greedy / opportunistic app that gobbles up all the memory it wants? The OS will let it until something else needs it. Do you have an application that is super latency sensitive and has a somewhat random memory access pattern that is not very "active" but hurts tremendously if pages are (guest) swapped out? Then trusting "active" how much memory the guest needs / can be reclaimed (even just with the balloon driver), might not be the best idea.

TL;DR:
You usually don't need all of "consumed" (vSphere level counter, real machine memory mapped to the guest, can include guest zero / free / standby / cache / stale etc.)
You usually need less than "memory usage" (Windows guest counter, doesn't include zero / free / standby, does include all actively mapped working sets so also "stale" application memory)
---> sweet sport for "right sizing" is usually here
You usually need _more_ than "active / touched" (vSphere level counter, heuristic, doesn't care about in guest memory metrics)

We do have a metric in vROps that is using a private API to read guest counters via Tools, Guest Memory Needed, that is Guest "used" memory + the some cache / standby, IMO not aggressive enough esp. since IO is getting cheaper and cheaper. Just my 2 cents though.

I attached the slides of the 2018 PBP session, the recording is sadly no longer available but I hope the pdf helps understand some of what I have written above.

View solution in original post

vbondzio · ‎11-03-2021

They are not the same, active / touched on ESXi is a sample based on 100 different random pages every minute that are actually read from / written too. Guest reported sum of all working sets (in OS "used memory", guest virtual to guest physical mapping) is usually waaaaay above that. The active metric was really ever just intended for entitlement calculation, i.e. a blackbox approach to which VMs memory was "busier" (actively reading / writing to memory), it somehow ended up in the UI and now we have what is a very common misunderstanding.

Algon2 · ‎11-03-2021

Thanks vbondzio

"Guest reported sum of all working sets (in OS "used memory", guest virtual to guest physical mapping) is usually waaaaay above that"

Do you mean the counters taken from within the guest OS? Are you implying it would be more accurate to obtain mem usage from within the guest itself?

vbondzio · ‎11-04-2021

I'm saying that vSphere level counters measure different things than the in guest counters . They are both accurate in their own right. Our activity / touched metric is sample based, 100 random pages every minute are unmapped, reads and writes trapped and counted and the untouched remainder remapped. Then another 100 random pages across the VMs mapped memory are unmapped and the cycle begins anew.

What you see in the guest is what applications requested from the OS, an app doesn't have to touch everything at all times, some data can just be resident in memory but not looked at nor written to on a regular bases.For example, imagine calc.exe wanted 2GB of memory at some point in the past. If there is enough memory available (free / zero / standby), it will have gotten it too. Now, an hour later it might only touch 100KB of those pages for upkeep while it is running outside of whatever task it did to require 2GB. See, if it never yielded / gave up the memory (it's up to the application), Windows won't take it away either. Windows will only reclaim / (guest) swap to disk when available memory is nearly exhausted (let's ignore modern Windows memory compression, because MSFT made fun of us when we introduced it years ago).

At that point, it will look at calc.exe's memory and notice that most pages haven't been touched in forever so it will swap out the stale memory so another process can get the page after it has been freed and zeroed. It has to swap because you can't just drop memory / information an application might expect to be there. So what counter would have been "more accurate" here? The guest ones that showed 2 GB or (assuming you could get to that level of granularity) vSphere who showed only 100KB active / touched?

It really depends on the application, do you just have a greedy / opportunistic app that gobbles up all the memory it wants? The OS will let it until something else needs it. Do you have an application that is super latency sensitive and has a somewhat random memory access pattern that is not very "active" but hurts tremendously if pages are (guest) swapped out? Then trusting "active" how much memory the guest needs / can be reclaimed (even just with the balloon driver), might not be the best idea.

TL;DR:
You usually don't need all of "consumed" (vSphere level counter, real machine memory mapped to the guest, can include guest zero / free / standby / cache / stale etc.)
You usually need less than "memory usage" (Windows guest counter, doesn't include zero / free / standby, does include all actively mapped working sets so also "stale" application memory)
---> sweet sport for "right sizing" is usually here
You usually need _more_ than "active / touched" (vSphere level counter, heuristic, doesn't care about in guest memory metrics)

We do have a metric in vROps that is using a private API to read guest counters via Tools, Guest Memory Needed, that is Guest "used" memory + the some cache / standby, IMO not aggressive enough esp. since IO is getting cheaper and cheaper. Just my 2 cents though.

I attached the slides of the 2018 PBP session, the recording is sadly no longer available but I hope the pdf helps understand some of what I have written above.

Algon2 · ‎12-08-2021

I meant to reply to this when it was initially posted, so apologies for not posting my reply sooner.

Thanks loads for taking the time to write this detailed response. It was well explained and I found it extremely helpful.

Cheers 🙂

All

Reported Memory usage on SQL VMs