VMware Cloud Community
sharvey2
Contributor
Contributor
Jump to solution

OOM killer shut down my Oracle. Related to memory ballooning

This has to be a bug. I was adding memory to ESX servers in an 8-blade cluster and using DRS to migrate VMs before shutting down the physical box. The cluster is not very stessed according to the performance charts and it was after hours so I had two ESX servers down at the same time. Suddenly I got a page: Production Oracle running on REL 4 was down, killed by Out Of Memory killer. This was the second time this had happened and we were trying to blame the DBAs, but this looked obviously related to the maintenance work so we dug in and found that balloon memory had spiked up a bit on the Oracle VM. System swap utilization on the VM had also kicked in, but nothing dramatic, 30% utilization.

Obviously, DRS had migrated some VMs to the ESX server that hosted the Oracle VM, but the memory shares were set to high for that VM. It should not have been messed with.

Our workaround solution is to set a memory reservation equal to the memory allocation on the Oracle VM (this is wasteful and cannot become a trend), and to set up an affinity group to keep big Oracle VMs from sharing ESX servers.

But questions remain: Why did this happen? Why did shares setting not protect the VM? Why did system swap not max out before OOM kicked in? It looks like memory ballooning has some serious issues.

Comments? Has this happened to you?

Reply
0 Kudos
1 Solution

Accepted Solutions
Aladen
Enthusiast
Enthusiast
Jump to solution

We have had many issues with memory ballooning and linux memory usage. We have machines that were doing batch processing. They would be idle, so the balloon would take memory from them. They would then kick off a large batch, trying to utilize a signigicant portion of the VM's memory. The VM would hang as linux tried to reclaim memory, and/or the OOM killer would kick in.

We gave up and set reservations on all our VM's. We also set the limit of how much memory the balloon driver could take to be a small number (10%) i think, so that even if we forgot to set a reservation, we would be ok.

One thing to note is that this does not sit well with the HA admission control. We had to shut that off. (admission control takes the higest reservation and assumes all VM's will need that much memory so it tries to reserve more then it needs)

I also recall reading somewhere that you should make sure that your swap space in the VM is greater then what the balloon could take away, so that the machine could at least swap while the balloon was emptying. But this was unacceptable for us. (linux swap performance being what it is) So we reserve full memory for all our machines.

View solution in original post

Reply
0 Kudos
7 Replies
Aladen
Enthusiast
Enthusiast
Jump to solution

We have had many issues with memory ballooning and linux memory usage. We have machines that were doing batch processing. They would be idle, so the balloon would take memory from them. They would then kick off a large batch, trying to utilize a signigicant portion of the VM's memory. The VM would hang as linux tried to reclaim memory, and/or the OOM killer would kick in.

We gave up and set reservations on all our VM's. We also set the limit of how much memory the balloon driver could take to be a small number (10%) i think, so that even if we forgot to set a reservation, we would be ok.

One thing to note is that this does not sit well with the HA admission control. We had to shut that off. (admission control takes the higest reservation and assumes all VM's will need that much memory so it tries to reserve more then it needs)

I also recall reading somewhere that you should make sure that your swap space in the VM is greater then what the balloon could take away, so that the machine could at least swap while the balloon was emptying. But this was unacceptable for us. (linux swap performance being what it is) So we reserve full memory for all our machines.

Reply
0 Kudos
eahatch
Enthusiast
Enthusiast
Jump to solution

We are having the same problem and have the same workaround (avoid memory ballooning at all costs). It looks like Red Hat has a couple of options that may help. Bear in mind that the commands need to be added to your rc.local or some other init script to persist a reboot.

Issue:
Why is oom killer killing random
processes when there appears to be plenty of memory in my Red Hat Enterprise
Release 4 Update 4 system?

Resolution:
Release
Found:
Red Hat Enterprise Linux 4 Update
4
Symptom:
The command topSolutioshows a lot of memory is being cached and swap is hardly being used.

bash-0.9# echo 100 > /proc/sys/vm/lower_zone_protection

This work around required a kernel update (2.6.9-67.0.7.ELsmp in our case):

  • The out of memory killer is
    designed to kill processes to recover memory
    under very severe out of memory
    conditions. The out of memory killer can
    now be disabled; as such, when an
    out of memory condition occurs, a kernel
    panic will occur. To disable the
    out of memory killer, use the command


echo 1 > /proc/sys/vm/panic_on_oom
Note that the out of memory killer is still
enabled by default.

I have shut off the OOM killer on one VM, but not had memory ballooning since the OOM killed Oracle serveral times in one week.

Has anyone else made either of these changes in an environment where memory ballooning is happening?

Thanks.

Alan

Reply
0 Kudos
dxb
Enthusiast
Enthusiast
Jump to solution

That second option doesn't sound too good; cause a kernel panic?

I've seen the same ballooning/oom-kill problem, and found that doing an "echo 0 > /proc/sys/vm/oom-kill" helps.

Reply
0 Kudos
eahatch
Enthusiast
Enthusiast
Jump to solution

Yeah, I agree it isn't the perfect solution, but the premise is that the system is relatively healthy when the OOM killer is evoked. I'm going to shut OOM killer off in a bunch of our linux VM's and aggravate the balloon driver some time this week. I'll post what I find.

Didn't know about the second solution, looks like I can shut off the OOM killer without a kernel upgrade (woot).

Reading up on Linux virtual memory models reminds of just how much I don't know about how these things actually work.

Reply
0 Kudos
eahatch
Enthusiast
Enthusiast
Jump to solution

Well, I've shut off the OOM killer in our dev and test environments and reduced resources to force the balloon driver without crashing anything (to my knowledge). My guess is that we will continue to reserve memory equal to the total memory allocated to our production linux virtuals for the sake of performance, but it will be nice to be able to remove an ESX server from the cluster for maintenance without having random applications get killed.

My understanding of the way the OOM killer should work when it is shut off via /proc/sys/vm/oom-kill being set to 0 is that when the OOM killer would have normally been evoked the event would be logged but no processes would be killed. I have not seen the event logged yet, so I'm not sure we are out of the waters just yet. I am also concerned that we will see performance related issues.

Reply
0 Kudos
sharvey2
Contributor
Contributor
Jump to solution

We ended up doubling memory in all ESX servers to 32G

Have not seen this OOM killer since. Some apps are more sensitive than others and will not swap when the ballon bumps them. These apps need reservations equal to allocated memory.

Reply
0 Kudos
nkrick
Enthusiast
Enthusiast
Jump to solution

I know this is an old post, but I just ran into this issue for the first time after we placed a ESX server into maintenance mode due to a failed hard drive. I found the following KB article that tell you exactly how to address this problem: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100358... .

I would add that better memory allocation on the VM/applications should also help. In our case the affected server is running a JSEE web server that uses an average of 214 MB of RAM, but the application developers gave it 1 GB of memory that is pinned. Since there is a large amount of granted memory that is not active, it becomes an target for memory ballooning, once the balloon hit ~1 GB, oom killer would terminate the JSEE web server.

Reply
0 Kudos