NUTS AND BOLTS Cgroups Lead image: © Photographer, 123RF.com

Cgroups for resource management in Linux

Take Control

The new cgroups feature provides an administrative approach to restricting resource use. This application is particularly interesting for virtualized systems. By Ralf Spenneberg

A couple of years ago, I was teaching Linux at a large-scale IT service provider's office. Their administrators had plenty of experience with commercial Unix variants, such as HP-UX, and they asked me how they could implement resource management and controls on Linux. How could an administrator restrict the amount of RAM used by a single process or a group of processes?

At the time, I had to admit that Linux didn't offer this feature. But in 2006, Rohit Seth started to develop this functionality, and as of kernel 2.6.24, administrators can use it. Originally referred to as "process containers," control groups (cgroups for short) [1] can restrict, enumerate (for billing purposes), and isolate resources (e.g., RAM, CPU, I/O).

Although many administrators will not need to use this functionality on a standard server, this application is very interesting in environments with KVM-based virtualization. Cgroups let you restrict the resources used by a virtual guest or prioritize use compared with other guests.

A cgroup lets the administrator group multiple processes and then define parameters for specific subsystems for these processes and all of their child processes. A subsystem could be a resource controller that manages the amount of RAM available.

To use cgroups, you first need to define a hierarchy in which the groups will be managed. To do so, you edit the /etc/cgconfig.conf file, which you can see in Listing 1.

Listing 1: /etc/cgconfig.conf

01 mount {
02         cpuset  = /cgroup/cpuset;
03         cpu     = /cgroup/cpu;
04         cpuacct = /cgroup/cpuacct;
05         memory  = /cgroup/memory;
06         devices = /cgroup/devices;
07         freezer = /cgroup/freezer;
08         net_cls = /cgroup/net_cls;
09         ns      = /cgroup/ns;
10         blkio   = /cgroup/blkio;
11 }

If this file doesn't exist, you will need to install the package. The file creates a separate hierarchy for each subsystem, and you then can define your cgroups below it. The /cgroup/cpu hierarchy lets you manage CPU shares, whereas /cgroup/net_cls takes care of network I/O performance.

Starting the cgconfig daemon creates the directories and mounts the cgroups filesystem. The lssubsys lets you verify that the hierarchies have been created correctly (Listing 2).

Listing 2: lssubsys

01 # lssubsys -am
02 cpuset /cgroup/cpuset
03 cpu /cgroup/cpu
04 cpuacct /cgroup/cpuacct
05 memory /cgroup/memory
06 devices /cgroup/devices
07 freezer /cgroup/freezer
08 net_cls /cgroup/net_cls
09 ns /cgroup/ns
10 blkio /cgroup/blkio

You can then create your control groups by issuing the cgcreate command:

cgcreate -g blkio:/dd

The command in Listing 3 tells you which parameters are available for the Block I/O subsystem.

Listing 3: Block I/O Subsystem

01 # cgget -g blkio  /dd
02 /dd:
03 blkio.reset_stats=
04 blkio.io_queued=Total 0
05 blkio.io_merged=Total 0
06 blkio.io_wait_time=Total 0
07 blkio.io_service_time=Total 0
08 blkio.io_serviced=Total 0
09 blkio.io_service_bytes=Total 0
10 ...

As of kernel 2.6.37, the kernel also supports the blkio.throttle.* options here. This means that you can restrict the maximum I/O bandwidth for read and write operations by a process group.

To test this, you need the major and minor numbers of the device whose bandwidth you want to restrict. If this is /dev/sda1, you can determine them with a simple ls:

# ls -l /dev/sda1
brw-rw----. 1 root disk 8, 1 10. Oct 08:32 /dev/sda1

Here, you can see the device major and minor numbers 8 and 1, respectively.

To restrict the bandwidth for the control group to 1Mbps, you then run cgset or simply use the echo command:

echo "8:1  1048576" > /cgroup/blkio/dd/blkio.throttle.write_bps_device

Now, you can launch dd for a test.

dd if=/dev/zero of=/tmp/test & pid=$!

I will initially be running the dd process in the root cgroup, which has no restrictions. You can test this by sending a SIGUSR1 to the process:

# kill -USR1 $pid
578804+0 records in
578804+0 records out
296347648 bytes (296 MB) copied, 7.00803 s, 42.3 MB/s

To move the process to the dd cgroup, you could use the echo command:

# echo $pid > /cgroups/blkio/dd/tasks

Now, when you send a USR1 signal to dd, you will see that the bandwidth drops dramatically because the process is not allowed to write with a bandwidth of more than 1Mbps.

Instead of restricting the maximum bandwidth, you can also prioritize the bandwidth between groups with the use of the blkio.weight= parameter. The default value is 500, so if you were to give a group a value of 1000, they could then access the block devices twice as often as the other groups.

Instead of using the echo command, you can also assign processes to groups using the cgclassify command.

Also, you can use the cgexec command like this

cgexec -g blkio:dd "dd if=/dev/zero of=/tmp/test"

if you want to launch a process directly in a specific group.

Automatic

Assigning processes to groups manually can be tiresome and error-prone. It makes far more sense for the cgrulesengd daemon to handle these assignments automatically. To allow this to happen, the service needs the /etc/cgrules.conf file, which tells it which process belonging to which user should be assigned to which control group. The file has a fairly simple syntax:

<user>[:<process>] <controllers> <destination>

Using the example with the dd command, the rule would look like this:

*:dd blkio /dd

This adds dd processes belonging to all users to the /dd control group on the blkio resource controller.

Hierarchies

Thus far, I have only looked at individual, isolated control groups; however, you can create hierarchies of groups to add more structure.

To be more precise, you can create additional cgroups within a control group as in cgreate -g blkio:/dd/user1.

The new cgroups appear as subdirectories and inherit the properties of the parent control group. All child cgroups then compete for the resources assigned to the parent cgroup.

If the parent cgroup is only allowed to write at 1Mbps, all of the child groups together are not allowed to exceed this maximum.

Resources are assigned hierarchically; however, these hierarchies don't work for the blkio controller as of this writing. The other controllers, such as CPU, memory, and so on, already support hierarchies.

Virtualization

Where does it make sense to deploy cgroups? Some special applications will benefit from cgroups in your daily work; but in most cases, it makes more sense to let the Linux kernel manage the resources itself rather than establishing limits. If you deploy a virtualization solution like KVM, however, in which you virtualize multiple guests on a single host, it can be very useful to restrict, prioritize, and measure guest resource use. Cgroups give you an ideal approach to implementing this.

You will need to manage the virtualization via the Libvirt libraries and LXC containers or use QEMU/KVM. The libvirtd daemon then creates a separate cgroup with the guest name for each guest when launched. The group exists in the libvirtd/qemu|lxc/guest hierarchy for each controller. You can now manage and prioritize the resources individually for each guest

To allow a guest to use twice as much CPU time as a second guest, you need to modify the CPU controller's cpu.shares. To achieve your goal here, just change the default value from 1024 to 2048. You can use a similar approach to configuring RAM or bandwidth usage. To do so, use the memory controller or the net_cls controller in combination with the tc command.

Note that you need the latest Libvirt variants to support the net_cls controller. This controller differs from all other controllers in that it only sets a classID and then expects the administrator to manage the actual bandwidth using the tc command (see the "Bandwidth Management" box).

Bandwidth Management

If a process is monitored by the net_cls controller, you can assign a classID for all the processes in the cgroup. You can then use the tc with the group. Start by defining the classID for the cgroup:

echo 0x00100001 > /cgroup/net_cls/libvirt/qemu/guest/net_cls.classid

This hexadecimal number comprises two parts: 0xAAAABBBB, where the AAAA digits define the major number of the classID, and the BBBB digits define the minor number. There is no need to pad with zeros. To use the classID, you now need to install a class-based queueing discipline (qdisc) on the outward bound network card (e.g., eth0). The qdisc scheduler decides when to send a packet. Class-based qdisc schedulers let you sort packets to different classes and thus prioritize and restrict these classes. A classic qdisc for restricting network traffic is the Hierarchical Token Bucket (HTB) filter, which will need to be installed on the network card. To do this, delete any existing qdisc versions and load the HTB:

tc qdisc del dev eth0 root 2>/dev/null
tc qdisc add dev etho root handle 10: htb default 2

The next step is to create the classes.

tc class add dev eth0 parent 10: classid 10:1 htb rate 10mbit
tc class add dev eth0 parent 10: classid 10:2 htb rate 20mbit ceil 100mbit

These two lines create two different classes, the first of which has a maximum bandwidth of 10Mbps. The second class is allowed more than 20Mbps, but no more than 100Mbps, if no other class needs the bandwidth. The default2 option when creating the HTB filter rejects unclassified traffic for the second class.

To evaluate the classID for the net_cls cgroup, you need to define another filter:

tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup

From now on, the net_cls classID is automatically used by the kernel to allocate packets to HTB classes. The Libvirt guest can now use a maximum transmission speed of 10Mbps.

You can't use the blkio controller with Libvirt at this time because it doesn't currently support the hierarchies that Libvirtd wants to create. The kernel developers are already working on a solution [2].

If you want to bill for the time used by individual virtual guests, you can use the CPUAcct controller to do so. This counts the CPU time actually used by each guest in /cgroup/cpuacct/libvirt/qemu/guest/cpuacct.usage in nanoseconds.

Threads

The current crop of cgroup implementations works on the basis of threads. Each thread in a process can be managed in a separate cgroup, and you need to remember this when you set out to assign the processes to cgroups with the echo command after launching them. You need to assign all the running threads (/proc/pid/task/) to corresponding cgroups.

The cgexec command facilitates this task. The command launches the process in the cgroup, and any child processes and threads then inherit from the group.

Conclusions

Unfortunately, only the very latest distributions support cgroups. Individual functions are available only in the latest Linux kernels. In other words, administrators need to check which properties you can use. But after doing so, cgroups can give you some very powerful functionality, especially in virtualization environments, for control process and guest resources management.