Nuts and Bolts Performance Tuning Dojo

The fine art of allocating memory

Serial Killer

As RAM runs out, the OOM Killer springs into action. By Federico Lucifredi

In issue 9, I talked about swap memory and how it should be used as a buffer to protect your servers from running out of memory [1]. Swapping is nearly always detrimental to performance, but its presence provides the system with a last chance for a soft landing before more drastic action is taken. This month, I examine the darker side of the picture: Swap hits 100%, and hard out-of-memory errors are appearing in your logs in the form of killed processes (Figure 1). In most cases, the performance degradation and furious disk thrashing caused by highly active swap areas will alert you well in advance of your logs.

Figure 1: The consequences of swap hitting 100%.

A system that does not have swap space configured can still swap to disk – the filesystem cache, shared libraries, and program text can still be swapped out as memory pressure mounts – it just has fewer options to do so. The Linux kernel's defaults allow for overcommitting memory at allocation time. Only memory pages in actual use ("dirty") are backed by physical RAM, so the program shown in Listing 1 will have no trouble allocating 3GB of memory on any current machine, almost irrespective of actual system capacity, because the memory is only being allocated, not used. It will then run into the limits of process address space, hitting a wall at 3056MB of allocation; the maximum allowed in a single 32-bit process because the balance is reserved to the kernel.

Listing 1: Memory Allocation

01 #include <stdio.h>
02 #include <stdlib.h>
03 #include <string.h>
04
05 int main(int argc, char *argv[])
06 {
07   char *newblock = NULL;
08
09   for (int allocation = 0; newblock = (void *) malloc(1024 * 1024); allocation++)
10     {
11          //for (int i=0; i < (1024 * 1024); i+= 4096) newblock[i] = 'Y';
12      printf("Allocated %d MB\n", allocation);
13     }
14 }

Things are more interesting when memory is being used. Uncommenting line 11 does just that. The OOM error in Figure 1 was a result of using stress [2] with this program to withdraw enough resources from the system and force the OOM Killer to intervene.

Major distribution kernels set the default /proc/sys/vm/overcommit_memory value to 0, tuning the kernel overcommit behavior to use the predefined heuristics [3]. Other options are listed in Table 1 – always overcommitting is a dangerous choice. The limit option is tuned by /proc/sys/vm/overcommit_ratio, which is expressed as a percentage and defaults to 50. If you have as much swap as you do RAM, this setting will effectively turn overcommit off, but an error in configuring this facility could result in RAM going unused. The current overcommit limit and amount committed are viewable in /proc/meminfo as CommitLimit and Committed_AS. Comparing Committed_AS with the output of free will showcase the difference between what programs allocate and what they use.

Tabelle 1: /proc/sys/vm/overcommit_memory

Value	Action
`0`	Predefined overcommit heuristics (default)
`1`	Always overcommit
`2`	Overcommit up to limit set by (swap + `overcommit_ratio_percent` * RAM size)

This controversial design distinguished Linux from other kernels like Solaris more than a decade ago. Although this choice leads to better resource utilization, it has consequences of the same kind that overselling airplane capacity does.

The most common trigger for the OOM Killer is a situation in which the overcommit policy approves a memory request, but no RAM or swap are available when a program starts to make use of its pre-existing allocation. A process (or more) must be killed to make room, and a "badness" score is associated with each process in /proc/pid/oom_score for this purpose. The process with the highest score is killed with signal 9, but a process can be protected by setting /proc/pid/oom_adj to -17 (OOM_DISABLE), by manipulating /proc/pid/oom_score_adj, or by configuring the Control Groups memory resource controller [4], depending on the kernel version. The badness score calculation for the popular 2.6.32 kernel [5] follows this logic: Any process caught in the swapoff() system call is killed first, followed by a score evaluation that baselines on memory size (total_vm). This initial score is adjusted to account for half the memory used by any child process, and it doubles if the process has been niced. Long-running processes are somewhat protected, as are superuser processes. Finally, the kernel makes an effort to avoid crossing cpuset boundaries or killing processes with direct hardware access. After an adjustment reflecting oom_adj, badness scores are ready for their grim use.

Newer kernels, like 3.2 in Ubuntu 12.04, push more of this logic to userspace via the /proc/pid/oom_score_adj tunable (ranging from -1000 to +1000), effectively providing a mechanism for pre-designating victims as well as protecting key processes [6]. Let me stress once again that you should look at your specific kernel version; the LXR project (see references) gives you a convenient way to compare differences.

In closing, remember that ulimit -v can be used to limit maximum size of virtual memory in many *NIX systems, and the relatively new Linux cgroups facility might also help if shell users spawning processes are a concern.