Nuts and Bolts /proc Filesystem Lead image: Lead Image © psdesign1, Fotolia.com

Kernel and driver development for the Linux kernel

Core Technology

The /proc filesystem facilitates the exchange of current data between the system and user. To access the data, you simply read and write to a file. This mechanism is the first step for understanding kernel programming. ü By Jürgen Quade, Eva-Katharina Kunst

In keeping with the central Unix philosophy that everything is a file, Linux systems publish system information through the virtual filesystems /proc and /sys (see the "Proc Filesystem" box). This brilliant mechanism gives the user read and write access to system internals with the read() and write() functions.

Proc Filesystem

The /proc virtual filesystem folders and files are not stored on a hard disk; rather, the kernel creates them dynamically on access. The term "proc" derives from "processes"; thus, it is clear that the /proc filesystem primarily provides information about computing processes (Figure 1).

Figure 1: Directories and files in the /proc directory.

For each computational process, the kernel creates a new directory with the process identification number as its name. Below the directory, you find extensive information about the job, including the call parameters (cmdline), shared file descriptors (fd) and environment variables (environ), and process statistics.

In addition to information about computing processes, the kernel uses the virtual filesystem to share information with the user system and receive configurations. All data for the current CPU is in /proc/cpuinfo, interrupt sources and the frequency of their occurrence is under /proc/interrupts, and info for activated device drivers with their device numbers is under /proc/devices. Writing 1 to /proc/sys/net/ipv4/ip_forward enables routing in the Linux kernel and configures the watchdog feature when written to /proc/sys/kernel/watchdog. KL:Navigating the /proc filesystem is both useful and instructive. A proc file name reveals how important a file in the /proc filesystem is. Hardware-related information is located – with the exception of CPU information – in the /sys filesystem. This is useful for driver programmers as a platform for exchanging information.

The shell, in turn, maps these access functions to the commands cat (read) and echo (write). To read the number of interrupts that have been triggered since the last boot, then, you can simply enter the cat /proc/interrupts command in a terminal. To pass network traffic through, on the other hand, the superuser only needs to write 1 to the /proc/sys/net/ipv4/ip_forward file:

sudo echo 1 > /proc/sys/net/ipv4/ip_forward

Such functions are very easy to use in scripts.

The /proc filesystem is also useful for your first steps in kernel programming: With fewer than 50 lines of code, you can make the compiler generate a module that outputs the famous "Hello World" string when accessing a proc file (Listing 1).

Listing 1: Simple Proc File

01 #include <linux/module.h>
02 #include <linux/proc_fs.h>
03 #include <linux/seq_file.h>
04
05 #define PROC_FILE_NAME  "Hello_World"
06 static struct proc_dir_entry *proc_file;
07 static char *output_string;
08
09 static int prochello_show( struct seq_file *m, void *v )
10 {
11     int error = 0;
12
13     error = seq_printf( m, "%s\n", output_string);
14     return error;
15 }
16
17 static int prochello_open(struct inode *inode, struct file *file)
18 {
19     return single_open(file, prochello_show, NULL);
20 }
21
22 static const struct file_operations prochello_fops = {
23     .owner  = THIS_MODULE,
24     .open   = prochello_open,
25     .release= single_release,
26     .read   = seq_read,
27 };
28
29 static int __init prochello_init(void)
30 {
31     output_string = "Hello World";
32     proc_file= proc_create_data( PROC_FILE_NAME, S_IRUGO, NULL, &prochello_fops, NULL);
33     if (!proc_file)
34         return -ENOMEM;
35     return 0;
36 }
37
38 static void __exit prochello_exit(void)
39 {
40     if( proc_file )
41         remove_proc_entry( PROC_FILE_NAME, NULL );
42 }
43
44 module_init( prochello_init );
45 module_exit( prochello_exit );
46 MODULE_LICENSE("GPL");

Modules are known extensions of the Linux kernel and are always constructed in a similar way (Figure 2). To manage a module you need two things: routines that implement the functionality of the module and a data structure that receives the addresses of the main routines.

Figure 2: Kernel modules are divided into three categories.

The two management functions are <xxx>_init() and <xxx>_exit(), where <xxx> is the module or an arbitrary name. The <xxx>_init() function runs when the insmod command loads the module in the kernel, and <xxx>_exit() runs when the module is unloaded using rmmod.

The main task of the <xxx>_init() function is to pass the data structure containing the addresses of the relevant module routines to a kernel subsystem. The module logs in to the desired subsystem(s) and itself. In this way, the kernel knows the relevant module routines and can activate them. Similarly <xxx>_exit() is for signing off.

In Listing 1, the three areas of a typical Linux module are clearly visible: the administrative functions prochello_init() and prochello_exit() (lines 29 and 38), the module routines prochello_show() and prochello_open() (lines 9 and 17), and the data structure used for logging in to the kernel prochello_fops (line 22).

Transferring Access Rights

Within the prochello_init() init function, the module logs in to the Linux kernel's proc subsystem by using the proc_create_data() function. It passes in the name of the proc file and the permissions for this file. Because the access rights are encoded as bit patterns, you use defines here (S_IRUGO). In the example, the owner (user), the group, and all others are given read permissions.

The third parameter specifies where in the directory tree of the /proc filesystem the file is to be generated. NULL creates the file below /proc. Finally, the fourth parameter transfers the data structure that contains the access functions to the new proc file. A total of three functions are needed for read access: open(), read(), and release().

However, you only need to implement the open routine yourself (prochello_open()). If you want, you can also implement the other routines (read() and release()), but this is complicated and only useful in exceptional cases. The kernel developers have built a fault-resistant intermediate layer for this, which reduces the complexity of the data exchange to something like printf. The basic idea of the intermediate layer is that the module programmer writes all the data to a sufficiently large memory area (Figure 3).

The show function uses seq_printf() to write the data to RAM; the single-file subsystem copies the relevant to the user. — Figure 3: The show function uses `seq_printf()` to write the data to RAM; the single-file subsystem copies the relevant to the user.

Further processing of the data – in particular, transferring data to the proc file users – is handled by the intermediate layer. This transfer is more complicated than it appears at first glance because the user might not read all the data, just a subset. The user might retrieve the data via multiple read calls.

The intermediate layer is called the "single-file subsystem" and is a special form of sequence file. Above all, it implements the read and the release functions (equivalent to the close() system call), so they can be used without changes when accessing the proc file. To create the single-file instance of the module, programmers call the single_open() function, which passes the address to a routine, typically called show(). This show routine is given the address of a memory page (in this example, the data to be output by seq_printf()); seq_printf() can be called a more or less arbitrary number of times within the show function.

Limited Single Files

Single-file instances are not intended for writing large amounts of data. Instead, the space is limited to 64KB, but that's enough for most tasks. Also, seq_printf() monitors overflows on every call. If that were to happen, the subsystem would automatically grow the memory buffer (e.g., by creating a new one and copying) and would then write the data. If scaling is impossible, seq_printf() returns a negative error code. Professional programmers evaluate the return value, of course. The single_open() function, which receives the address of the show function, is called in the open function for the proc file (Listing 1, line 19).

Figure 4 shows how you can compile the source code in Listing 1 using the Makefile (Listing 2) followed by insmod, which loads the generated module into the kernel, and finally cat, which tests the results. If you don't want to create the proc file directly below /proc but in a subfolder, you can use proc_mkdir to create a directory. The function returns a pointer to a data structure of the type proc_dir_entry, which represents the newly created directory.

Figure 4: Generating and using the kernel module.

Listing 2: Make

01 ifneq ($(KERNELRELEASE),)
02 obj-m    := prochello.o
03
04 else
05 KDIR    := /lib/modules/$(shell uname -r)/build
06 PWD     := $(shell pwd)
07
08 default:
09         $(MAKE) -C $(KDIR)       M=$(PWD) modules
10 endif

The proc_mkdir() function expects two parameters. The first contains the directory name as a string, and the second is an identifier that says where to create the new directory. If the second parameter is equal to NULL, the new folder is created below the /proc directory.

A cardinal error is to forget to delete the proc directories and files you created if you no longer need them or to remove the entire module. This is what the remove_proc_entry() function does. In addition to the name of the file to remove, it also supplies an identifier for the subdirectory that contains the proc file.

Hands On

The subsystem does not support the ability to transfer configuration data to the kernel by writing to a proc file. This is where the module programmer needs to lend a hand. To start, you need memory to cache the data to be written to the kernel for evaluation.

If enough memory is available, you copy the data to be written from userspace to the memory buffer. You can then analyze the data and perform the action requested by the user. It is important not to transfer more data than you can buffer between userspace and kernel-space. The min() function ensures that this does not happen.

Additionally, module programmers must constantly be prepared to deal with users trying to overwrite kernel memory by providing incorrect address information.

The transfer function copy_from_user() also watches out for this and only copies acceptable memory addresses.

Listing 3 shows an extension of the proc file example that includes a write function. It writes the keyword deutsch to the proc file, which translates the string returned on read access to the German version: Hallo Welt. However, if you write english to the proc file, the original Hello World appears again the next time you read. The original code requires further changes, in addition to entering the code from Listing 3.

Listing 3: Extended Proc File

01 static ssize_t prochello_write( struct file *instanz, const char __user \
  *buffer, size_t max_bytes_to_write, loff_t *offset )
02 {
03     ssize_t to_copy, not_copied;
04
05     to_copy = min( max_bytes_to_write, sizeof(kernel_buffer) );
06
07     not_copied = copy_from_user(kernel_buffer,buffer,to_copy);
08     if (not_copied==0) {
09         printk("kernel_buffer: \"%s\"\n", kernel_buffer);
10         if (strncmp( "deutsch", kernel_buffer, 7)==0) {
11             output_string = TEXT_GERMAN;
12         }
13         if (strncmp( "english", kernel_buffer, 7)==0) {
14             output_string = TEXT_ENGLISH;
15         }
16     }
17     return to_copy - not_copied;
18 }

First, you need to add the write function to the data structure. To do this, you add the file_operations prochello_fops structure to the line:

write = prochello_write,

To allow write access, you then need to modify the access privileges on calling proc_create_data: Instead of S_IRUGO, you need S_IRUGO | S_IWUGO.

Finally, four lines need to be added to the program header:

#include <asm/uaccess.h>
static char kernel_buffer[256];
#define TEXT_GERMAN  "Hallo Welt"
#define TEXT_ENGLISH "Hello World"

After making these changes and compiling, then unloading the old module version and reloading the driver, write access should be possible.

The code for implementing proc files is a good template for your own development. Essentially, you need to change the name of the proc file and the show function.

If you have extensive output that changes frequently, you will probably want to look to sequence files instead. Unfortunately, earlier writings on this topic are not really that useful as a guide (see the box "Changes in Kernel 3.10").

Changes in Kernel 3.10

Kernel 3.10 no longer supports the create_proc_entry() function:

proc_file = create_proc_entry("example_file", S_IRUGO, proc_dir );
if (proc_file) {
    proc_file->read_proc = proc_read;
    proc_file->data = NULL;
}

Instead, programmers use the function presented in this article, proc_create_data(), which uses different parameterization. Whereas individual elements of the proc_dir_entry data structure had to be initialized in some of the earlier versions of the kernel, developers working with a more recent kernel version reserve a structure that is already familiar from driver development, struct file_operations, and assign the access methods (open(), read(), release()):

static struct file_operations example_proc_fops = {
    .owner = THIS_MODUL,
    .open = example_proc_open,
    .read = example_proc_read,
    .release = example_proc_release,
}
[...]
static int __init example_proc_init(void)
{
    proc_file = proc_create_data(    "example_file, S_IRUGO, proc_dir,
        &example-proc_fops, NULL );
[...]

The read() and write() access methods differ compared with previous versions. The former peof parameter, which used to indicate that the system had read all the data, no longer exists. In contrast, the current version writes data to the memory address passed in as a parameter and then returns the number of characters written (if the single-file subsystem is not used, unlike the description in this article).