We have had many issues with memory ballooning and linux memory usage. We have machines that were doing batch processing. They would be idle, so the balloon would take memory from them. They would then kick off a large batch, trying to utilize a signigicant portion of the VM's memory. The VM would hang as linux tried to reclaim memory, and/or the OOM killer would kick in.
We gave up and set reservations on all our VM's. We also set the limit of how much memory the balloon driver could take to be a small number (10%) i think, so that even if we forgot to set a reservation, we would be ok.
One thing to note is that this does not sit well with the HA admission control. We had to shut that off. (admission control takes the higest reservation and assumes all VM's will need that much memory so it tries to reserve more then it needs)
I also recall reading somewhere that you should make sure that your swap space in the VM is greater then what the balloon could take away, so that the machine could at least swap while the balloon was emptying. But this was unacceptable for us. (linux swap performance being what it is) So we reserve full memory for all our machines.
We are having the same problem and have the same workaround (avoid memory ballooning at all costs). It looks like Red Hat has a couple of options that may help. Bear in mind that the commands need to be added to your rc.local or some other init script to persist a reboot.
Why is oom killer killing random
processes when there appears to be plenty of memory in my Red Hat Enterprise
Release 4 Update 4 system?
Found:Red Hat Enterprise Linux 4 Update
The command topSolutioshows a lot of memory is being cached and swap is hardly being used.
bash-0.9# echo 100 > /proc/sys/vm/lower_zone_protection
This work around required a kernel update (2.6.9-67.0.7.ELsmp in our case):
The out of memory killer is
designed to kill processes to recover memory
under very severe out of memory
conditions. The out of memory killer can
now be disabled; as such, when an
out of memory condition occurs, a kernel
panic will occur. To disable the
out of memory killer, use the command
echo 1 > /proc/sys/vm/panic_on_oom
Note that the out of memory killer is still
enabled by default.
I have shut off the OOM killer on one VM, but not had memory ballooning since the OOM killed Oracle serveral times in one week.
Has anyone else made either of these changes in an environment where memory ballooning is happening?
That second option doesn't sound too good; cause a kernel panic?
I've seen the same ballooning/oom-kill problem, and found that doing an "echo 0 > /proc/sys/vm/oom-kill" helps.
Yeah, I agree it isn't the perfect solution, but the premise is that the system is relatively healthy when the OOM killer is evoked. I'm going to shut OOM killer off in a bunch of our linux VM's and aggravate the balloon driver some time this week. I'll post what I find.
Didn't know about the second solution, looks like I can shut off the OOM killer without a kernel upgrade (woot).
Reading up on Linux virtual memory models reminds of just how much I don't know about how these things actually work.
Well, I've shut off the OOM killer in our dev and test environments and reduced resources to force the balloon driver without crashing anything (to my knowledge). My guess is that we will continue to reserve memory equal to the total memory allocated to our production linux virtuals for the sake of performance, but it will be nice to be able to remove an ESX server from the cluster for maintenance without having random applications get killed.
My understanding of the way the OOM killer should work when it is shut off via /proc/sys/vm/oom-kill being set to 0 is that when the OOM killer would have normally been evoked the event would be logged but no processes would be killed. I have not seen the event logged yet, so I'm not sure we are out of the waters just yet. I am also concerned that we will see performance related issues.
We ended up doubling memory in all ESX servers to 32G
Have not seen this OOM killer since. Some apps are more sensitive than others and will not swap when the ballon bumps them. These apps need reservations equal to allocated memory.
I know this is an old post, but I just ran into this issue for the first time after we placed a ESX server into maintenance mode due to a failed hard drive. I found the following KB article that tell you exactly how to address this problem: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003586 .
I would add that better memory allocation on the VM/applications should also help. In our case the affected server is running a JSEE web server that uses an average of 214 MB of RAM, but the application developers gave it 1 GB of memory that is pinned. Since there is a large amount of granted memory that is not active, it becomes an target for memory ballooning, once the balloon hit ~1 GB, oom killer would terminate the JSEE web server.