Tools SystemTap Lead image: © boing, Photocase.com

Tracing applications with OProfile and SystemTap

Data on Tap

Does your application data take ages to creep off your disk or your network card, even if no noticeable activity is taking place? Tools such as OProfile and SystemTap help you find out why. By Thorsten Scherf

Experienced administrators tend to use tools such as ps, vmstat, or the like when they need statistics for individual subsystems such as the network, memory, or block I/O. These tools can help identify hardware or software bottlenecks, and they are indisputably useful for a general appraisal, but if you want to delve deeper, you need something with more punch.

Again, the standard toolbox offers a couple of utilities. For example, the popular strace traces applications. In the simplest cases, the tool lists all the system calls (syscalls) with their arguments and return codes for a specific application. Setting options allows for highly selective Strace output. For example, if you need to investigate whether an application is parsing the configuration file that you painstakingly put together, you can call Strace:

strace -e trace=open -o mutt.trace mutt

This command line sends all open syscalls for the Mutt application to the /tmp/mutt.trace output file. Then, you can easily grep the configuration file from the results.

Profiling applications, including the popular OProfile tool http://1, take this a step further by giving you details of the performance of individual applications, the kernel, or the complete system (see Figure 1). For this to happen, OProfile accesses the CPU performance counters on state-of-the-art hardware. The counters have information on how often a specific event has occurred. In this context, an event can be RAM access or the number of interrupts. This information is very useful for identifying bottlenecks or debugging the system.

To install OProfile, you need the kernel-debuginfo http://2 package, which provides the symbol to machine code mappings. Note that the kernel-debuginfo version must match your kernel version. To install the package easily, use your distribution's standard repository. On Fedora, you would do this with Yum:

yum install oprofile kernel-debuginfo

A call to RPM should confirm that the kernel and kernel-debuginfo packages have the same version number:

rpm -q kernel-PAE kernel-PAE-debuginfo oprofilekernel-PAE-2.6.32.10-90.fc12.i686
kernel-PAE-debuginfo-2.6.32.10-90.fc12.i686
oprofile-0.9.5-4.fc12.i686

To profile the kernel, you need to tell OProfile where the kernel image is located with the --vmlinux option:

opcontrol --setup --vmlinux=/usr/lib/debug/lib/ modules/`uname -r`/vmlinux

In normal use, you can omit the image details. The following command gives you an overview of the events that OProfile can enumerate (see Listing 1):

Listing 1: opcontrol Events

opcontrol --list-events
oprofile: available events for CPU type "Core 2"
See Intel Architecture Developer's Manual Volume 3B, Appendix A and Intel Architecture Optimization Reference Manual (730795-001)
INST_RETIRED_ANY_P: (counter: all))
    number of instructions retired (min count: 6000)
L2_RQSTS: (counter: all))
    number of L2 cache requests (min count: 500)
    Unit masks (default 0x7f)
    ----------
    0xc0: core: all cores
    0x40: core: this core
    0x30: prefetch: all inclusive
    0x10: prefetch: Hardware prefetch only
    0x00: prefetch: exclude hardware prefetch
    0x08: (M)ESI: Modified
    0x04: M(E)SI: Exclusive
    0x02: ME(S)I: Shared
    0x01: MES(I): Invalid
LLC_MISSES: (counter: all))
   L2 cache demand requests from this core that missed the L2 (min count:6000)
           Unit masks (default 0x41)
           ----------
           0x41: No unit mask
[...]

opcontrol --list-events

The events will differ depending on the CPU you use. The /usr/share/oprofile directory has lists for the various architectures. An event comprises a symbolic name (L2_RQSTS), a counter (500), and an optional mask (0xc0). The counter defines the accuracy of a profile. The lower the value, the more often the event will be queried. Special properties of an event are available by query with the mask. For example, the L2_RQSTS event tells you how many requests have been made to the CPU's L2 cache. When called with a mask of 0xc0, OProfile returns the value for all the available CPUs; if you set 0x40, you get the value only for the CPU actually running the OProfile process. The command in Listing 2 monitors a specific event.

Listing 2: opcontrol Events

opcontrol --vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux --event L2_RQSTS:500
# opcontrol --reset
# opcontrol --start
Using 2.6+ OProfile kernel interface.
Reading module info.
Using log file /var/lib/oprofile/samples/oprofiled.log
Daemon started.
Profiler running.

The --event option can occur multiple times to monitor more than one event. To make sure that the results are not distorted by historical data, the --reset option deletes them before collecting fresh data with --start. After a while, the --stop option stops monitoring the system. The data collected in this way are now available in the /var/lib/oprofile/samples directory. For a general overview, you can access the data with opreport – either the data for the complete system or the data for a specific application (Listing 3).

Listing 3: opreport for Mutt

  opreport -l /usr/bin/mutt
CPU: Core 2, speed 2401 MHz (estimated)
Counted L2_RQSTS events (number of L2 cache requests) with a unit mask of 0x7f
(multiple flags) count 500
samples  %        image name      symbol name
414      10.8377  mutt            imap_exec_msgset
162       4.2408  mutt            parse_set
161       4.2147  mutt            mutt_buffer_add
145       3.7958  mutt            mutt_extract_token
126       3.2984  mutt            ascii_strncasecmp
124       3.2461  mutt            imap_read_headers
[...]

Depending on the event selected, this information will give you a pretty clear picture of what is happening on your system. For more details of OProfile, check out the highly informative website for the tool http://1.

SystemTap

The SystemTap http://3 tool aims to combine the functionality of classical tracing and profile tools such as Strace and OProfile while providing a simple but powerful interface for the user. SystemTap was originally developed for monitoring the Linux kernel, although more recent versions also let you monitor userspace applications.

SystemTap builds on the kprobes kernel subsystem. It lets the user insert arbitrary program code before any event in kernel space – for example, for kernel functions (kernel.function("function")) inside of kernel modules (module("module").function("function")) or system calls (syscall.system_call). The injected program monitors the event and collects information in the process. Thus, far more precise results can be achieved here than with OProfile, which only queries an event periodically.

SystemTap has also supported querying static tracepoints in the kernel (kernel.trace("tracepoint")) and, more recently, in userspace applications. Developers build static tracepoints into the program code at important locations. Because developers know their program better than anybody else, this kind of information is a big help.

SystemTap programs are written in a language that is similar to Awk. A parser checks the script for syntax errors before converting it to the faster C language, which is then loaded as a kernel module (Figure 2). Using the module on another system that doesn't have a compiler is not a problem, though. SystemTap lets you build modules for kernel versions besides the kernel on your own system. Then you can copy the module to the target system and run it with staprun – more on this subject later.

Figure 2: After the syntax of the SystemTap script is checked, the script is converted to C and loaded as a kernel module.

Because all the major Linux distributions support SystemTap, you can easily install from the standard software repository. The important thing is to install the kernel-debuginfo package along with kernel-devel:

yum install kernel-debuginfo kernel-devel systemtap systemtap-runtime

The latest versions are available from the project's Git repository:

git clone http://sources.redhat.com/git/systemtap.git systemtap

Assuming the installation is successful, you can use the following one-liner to check that SystemTap is working properly:

stap -v -e 'probe vfs.read {printf("Reading data from disk.\n"); exit()}'

If this accesses the kernel's VFS subsystem, stap will send a message to standard output and terminate. The ubiquitous "Hello World" program for SystemTap is shown in Listing 4.

Listing 4: "Hello world" in SystemTap

01 #!/usr/bin/stap
02 probe begin {printf("Hello, world!\n");}
03 probe timer.sec(5) {exit();}
04 probe end {printf("Good-bye, world!\n");}
05
06 # stap helloword.stp
07 Hello, world!
08 <5 seconds later>
09 Good-bye, world!

The "Hello World" example is a great demonstration of the generic structure of a SystemTap script. A script always comprises two parts: an event and a handler – typically preceded by a probe instruction. In this example, the event is the read.vfs function and the handler is the printf command that outputs text to stdout. The handler is always executed when the specified event occurs. Events can be kernel functions, syscalls, or, as in this example, prebuilt tapsets – that is, prebuilt code blocks for specific kernel functions and system calls.

Tapsets are easy to integrate with your own scripts. The templates are typically located below /usr/share/systemtap/tapsets. Besides these synchronous events, other asynchronous events are not bound to a specific event in the kernel or program code, and typically they are used when you need to create a header or footer for your script. They are also suitable for running specific events multiple times.

Listing 5 shows a simple example with two probes, each with an asynchronous and a synchronous event. The first outputs a header at one-second intervals, the second calls the prebuilt tcp.receive tapset, which is defined in Listing 6. This example shows the extent to which the use of tapsets reduces the complexity of your own scripting. When you launch the script from Listing 1, typing stap tcpdump.stp lets you see the network packets arriving at one-second intervals with various other pieces of information. If you omit timer.s(1) in the first event, the header is only output before outputting the first network packet.

Listing 5: tcpdump via SystemTap

01 #!/usr/bin/stap
02
03 // A TCP dump like example
04
05 probe begin, timer.s(1) {
06   printf("-------------------------------------------          ----------------------\n")
07   printf("       Source IP         Dest IP  SPort  DPort  U  A  P  R  S  F\n")
08   printf("-------------------------------------------          ----------------------\n")
09 }
10
11 probe tcp.receive {
12   printf(" %15s %15s  %5d  %5d  %d  %d  %d  %d  %d
13            %d\n", saddr, daddr, sport, dport, urg,
14          ack, psh, rst, syn, fin)
15 }

Listing 6: Tapset tcp.stp

01 probe tcp.receive = kernel.function("tcp_v4_rcv") {
02         iphdr = __get_skb_iphdr($skb)
03         saddr = ip_ntop(__ip_skb_saddr(iphdr))
04         daddr = ip_ntop(__ip_skb_daddr(iphdr))
05         protocol = __ip_skb_proto(iphdr)
06
07         tcphdr = __get_skb_tcphdr($skb)
08         dport = __tcp_skb_dport(tcphdr)
09         sport = __tcp_skb_sport(tcphdr)
10         urg = __tcp_skb_urg(tcphdr)
11         ack = __tcp_skb_ack(tcphdr)
12         psh = __tcp_skb_psh(tcphdr)
13         rst = __tcp_skb_rst(tcphdr)
14         syn = __tcp_skb_syn(tcphdr)
15         fin = __tcp_skb_fin(tcphdr)
16 }

The handler, also known as a body, supports instructions that will be familiar from various programming languages. For example, you can initialize variables and arrays, call functions, and query positioning parameters $ (integer) or @ (string). Of course, you wouldn't want to do without loops (while, until, for, if/else), which give you useful flow control options for the script.

Instead of searching through a mass of data for the required information, which is the case with Strace, SystemTap lets you output the information after exceeding a specific threshold value, or when a specific event occurs. Thanks to the functional scope of the language, the choice of language constructs is more than adequate.

Listing 7 shows another example that uses the vfs.read tapset. The global variable totals is an associative array in this case. It contains the process names and process IDs for all the applications that access the VFS subsystem to read data from disk. The counter is incremented each time it's accessed.

Listing 7: Finding I/O-Intensive Apps

01 #!/usr/bin/stap
02
03 global totals;
04 probe vfs.read
05
06 {
07     totals[execname(), pid()]++
08 }
09
10 probe end
11 {
12     printf("** Summary ** \n")
13     foreach ([name,pid] in totals-)
14     printf("%s (%d): %d \n", name, pid, totals[name,pid])
15 }

If you are interested in a specific userspace program, you'll need to install the matching debuginfo package for the application. To make things easy, I will look at the ls tool as an example. To perform a trace here, you'll need the coreutils-debuginfo package. Calling stap as in Listing 8 gives you an overview of the functions in a specific process.

Listing 8: SystemTap Tracing a Userspace App

stap -e 'probe process("ls").function("*").call {log (pp())}' -c 'ls -l'
total 20
-rw-rw-r--. 1 tscherf tscherf 17347 2010-04-12 08:43 systemtap.txt
process("/bin/ls").function("main@/usr/src/debug/coreutils-7.6/src/ls.c:1225").call
process("/bin/ls").function("set_program_name@/usr/src/debug/coreutils-7.6/lib/progname.c:35").call
process("/bin/ls").function("human_options@/usr/src/debug/coreutils-7.6/lib/human.c:462").call
process("/bin/ls").function("clone_quoting_options@/usr/src/debug/coreutils-7.6/lib/quotearg.c:99").call
process("/bin/ls").function("xmemdup@/usr/src/debug/coreutils-7.6/lib/xmalloc.c:107").call
process("/bin/ls").function("xmalloc@/usr/src/debug/coreutils-7.6/lib/xmalloc.c:43").call
process("/bin/ls").function("get_quoting_style@/usr/src/debug/coreutils-7.6/lib/quotearg.c:110").call
process("/bin/ls").function("clone_quoting_options@/usr/src/debug/coreutils-7.6/lib/quotearg.c:99").call
process("/bin/ls").function("free_pending_ent@/usr/src/debug/coreutils-7.6/src/ls.c:1132").call
[...]
process("/bin/ls").function("close_stdout@/usr/src/debug/coreutils-7.6/lib/closeout.c:107").call
process("/bin/ls").function("close_stream@/usr/src/debug/coreutils-7.6/lib/close-stream.c:57").call
process("/bin/ls").function("close_stream@/usr/src/debug/coreutils-7.6/lib/close-stream.c:57").call

If the parameters of a specific function are also of interest, you can change the call to stap as shown in Listing 9.

Listing 9: Parameter Tracing

stap -e 'probe process("ls").function("clone_quoting_options").call {log (probefunc() . " " . $$parms) }' -c '/bin/ls -l'
total 20
-rw-rw-r--. 1 tscherf tscherf 18216 2010-04-12 09:02 systemtap.txt
clone_quoting_options o=0x0
clone_quoting_options o=0x0

Cross-Compiling

If you want to run a SystemTap script on multiple systems, you will probably prefer not to have to install the compiler and the kernel debug information on all of these machines. In fact, you only need to do so on a build system. The target systems only require the systemtap-runtime RPM and the staprun program it contains. The following command creates a prebuilt binary kernel module for the target system:

stap -r kernel-PAE-2.6.31.12-174.2.22 capt-io.stp -m read-io

The build system also needs the kernel-debuginfo package to match the target system version, and you must ensure that the build and target systems have the same hardware architecture. After creating a kernel module, copy it to the target system and launch it with staprun:

staprun capt-io.ko

If you want non-root users to load this kernel module, they need to be members of the stapusr group; members of the stapdev group can additionally compile their own scripts.

Conclusions

The SystemTap tracing and profiling tool lets regular users perform detailed analyses of kernel and userspace programs without rebooting the whole system. Thanks to the comprehensive tapset library, this can be done without serious programming skills. Advanced users will enjoy the flexible, Awk-like scripting language that gives them the freedom to create highly complex tracing and profiling scripts. The SystemTap FAQ http://4 and the language reference http://5 are useful ports of call for more help.