We are running ESX 3.0.2 without DRS and we monitor the free memory to determine if VM need to be manually moved to another ESX or if we need to add a new ESX. Our limit is 90% memory used.
Now, we are deploying ESX 4.0 U2 with manual DRS and we notice that memory is fully used by ESX, even if all VM OS do not allocate it. The reason is explained here
On this page, one can see :
planning requires a thorough understanding of the average and the peak
memory demands of each virtual machine and resource pool as well as the
implications of their reservation, limit, and share settings. The risk
and reward of performance and memory overcommitment also must be closely
+You should base your capacity planning decisions on the examination
of multiple memory performance metrics and not the host memory usage
So, my question is to find a way to know when we will get performance issue because of a memory too much overcommited.
==> What are the "multiple memory performance metrics" ?
Are there any alarm that could be set ? Like "host swap page write", but what are the good triggers ?
high : no memory overcommitment
soft : ballooning
hard : host swapping
Is there a way to get an warning if ballooning is too much higher or if swapping
Thanks for any help.
Do your hosts support RVI or EPT? If so its most likely that your VMs use "all" of their assigned RAM now because large pages are used if possible in vSphere, and thus transparent page sharing doesnt kick in until the host starts swapping.
I would suggest you test your VMs by limiting their memory until it hurts performance, then you remove the limit and set a reservation for that amount instead. That way you guarantee that your VMs will always have as much memory as they need and ballooning or swapping cannot hurt them. That also makes you able to overcommit as much memory as possible. Don't rely on manually watching for alarms and taking action.
In response to the first part of your post, I think Martin gave a good potential answer. Do your hosts have the new Nehalem processors (Intel Xeon 55xx series) or the newer AMD chips? If so, vSphere takes advantage of this and uses "large memory pages" by default. It's a nice performance gain - 20 to 25 percent - but it causes the VM to consume all resources that are allocated to it (similar to what you might see with a SQL server box). This may seem like a bad thing at first, but ESX will reclaim memory from these VMs on an as-needed basis with little to no suffering from your VMs. To see if this is what is happening on your end, go into your vSphere client, look at the memory performance for one of your VMs, and looking at the "memory consumed" metric. If we're on the right track, you'll see that it is flat-lining at or close to 100% of the allocated memory value.
As for memory overcommitment, you will not encounter a performance issue because of allocating too much memory to VMs on a particular host. Pretend you have a host with 4GB memory and you put two VMs on that host, each with a 4GB allocation. If each VM actively uses (looking at the "memory active" metric) 1GB at peak workload, you will never run into a performance issue because your host will never go above 50% utilization of memory resources. When the sum value of the memory active all on VMs begins to get close to the total memory on a host, that's the time to be cautious.
The primary memory metrics you will want to look at are: memory active, memory ballooning, and memory swapping. Memory active peaks are a decent way of finding out what a VM actually uses/needs from the host. Memory ballooning allows you to check when ESX is stealing memory pages from one VM to give to another. Keep in mind, that although you want to be aware of ballooning, the balloon driver does a good job of stealing unused or idle memory pages, so it usually doesn't affect VM performance except in the worst situations - i.e. when a host needs to reclaim more memory from a VM and the VM doesn't have any more idle/unused memory pages to spare. This "soft state" takes place at 4% free host memory, but ballooning starts before it reaches this threshold in order to prevent it from crossing the 4% line and heading towards a "hard state". The hard state gets reached at 2% free memory, at which point the hypervisor begins to swap. Hypervisor swapping is significantly worse than swapping within the VM/OS, so something to avoid.
Please forgive my wall of text and let me know if that helps.
VKernel Systems Engineer
Keep in mind though that ballooning can severly hurt for example JVM.
I would suggest reading this document too(most of it is the same though) http://www.vmware.com/files/pdf/techpaper/vsp_41_perf_memory_mgmt.pdf
Good call. As with anything, there are exceptions to the norm. JVMs, for example, have their own best practices: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100848.... Any others you can think of? That's the only one that comes to my mind.
VKernel Systems Engineer
Thanks, for the info for JVM.
But it's not the answer I would like to have.
I already read memory management doc. It explain how it work well.But it does not include the info of the KB 1021896
But it does not explain well how to follow the capacity planning and how to decide when a new host must be installed if resources (memory in particular) is too much overcommited automatically.
For instance, I can set an alarm is CPU usage is 95% during 5 minutes.
If I understood, I could set the same à 94% during 5 minutes for host memory ?
We have hundreds of VM and I would like to know which metrics must be considered to automatically send an alarm if ESX begin to swap for memory overcommitment.
The reason we answered the way we did is that there is no simple answer to when you have overcommited memory "too much", which means setting an alarm is of little use. It all depends on the virtual machines and what they're running.
But, if you don't set any limits then no virtual machines will start swapping until the host is using at least the amount of ram specified in the documents provided in this thread. If I'm not mistaken I think there is a default alarm in vCenter when the host is using 90% or so ram as well, which means it will send an alarm some time before there is any risk of swapping.
If you found this or any other post helpful please consider the use of the Helpful/Correct buttons to award points
Every environment is different and yours might be more/less dynamic than someone else's. Here's some thoughts that may help you achieve your objective:
The KB article you mentioned affects the memory consumed/usage metrics. In other words, memory consumed/usage is always going to show at or near 100%, thus making it fairly obsolete. The metric you want to use is memory active which is a representation of the memory pages your hosts/VMs are actively using. For the time being, ignore memory overallocation/overcommitment because this doesn't hurt performance by itself.
To monitor capacity, you want to look at the peak memory active/usage over the past day/week/month to see how high it is reaching. You can do this on the host or cluster level. Not knowing the dynamics/growth of your environment, I would say that if you want to play it safe, make sure this peak usage doesn't go above 80 or 85 percent. This will give you a buffer in case of unexpected VM spikes, but still allow you to have decent ROI on the hardware.
For capacity planning, depending on how long the process of acquiring new hardware is, you may want to be more proactive. For example, once your host memory active peak begins to reach 70 or 75 percent (or sooner), it may be a good trigger to start the purchasing process.
With these thresholds, it doesn't hurt to play a little on the conservative side at first and then adjust them as needed.
Alerting is a little bit trickier, because there's no way to do a memory active alert on hosts and memory usage (consumed) won't get the job done (as mentioned earlier). Your best bet might be to set an alert for "Host Swap Pages Write". This would track any hypervisor swapping, which is pretty nasty for performance and only takes place as a last resort (hard state).
Please let us know if this helps or if I'm still missing the mark.
VKernel Systems Engineer
Thanks for the answers. It seems a new alarm could be created to control active memory usage of VM.
You said to put a trigger on "Host Swap Pages Write". But which level ? How long ?
As far as I understood, if a VM is low memory sized, it may start to swap even if the host have enough memory. It's important to detect it too