Tracing applications with OProfile and SystemTap
Data on Tap
Experienced administrators tend to use tools such as ps
, vmstat
, or the like when they need statistics for individual subsystems such as the network, memory, or block I/O. These tools can help identify hardware or software bottlenecks, and they are indisputably useful for a general appraisal, but if you want to delve deeper, you need something with more punch.
Again, the standard toolbox offers a couple of utilities. For example, the popular strace
traces applications. In the simplest cases, the tool lists all the system calls (syscalls) with their arguments and return codes for a specific application. Setting options allows for highly selective Strace output. For example, if you need to investigate whether an application is parsing the configuration file that you painstakingly put together, you can call Strace:
strace -e trace=open -o mutt.trace mutt
This command line sends all open
syscalls for the Mutt application to the /tmp/mutt.trace
output file. Then, you can easily grep the configuration file from the results.
Profiling applications, including the popular OProfile tool http://1, take this a step further by giving you details of the performance of individual applications, the kernel, or the complete system (see Figure 1). For this to happen, OProfile accesses the CPU performance counters on state-of-the-art hardware. The counters have information on how often a specific event has occurred. In this context, an event can be RAM access or the number of interrupts. This information is very useful for identifying bottlenecks or debugging the system.
To install OProfile, you need the kernel-debuginfo http://2 package, which provides the symbol to machine code mappings. Note that the kernel-debuginfo version must match your kernel version. To install the package easily, use your distribution's standard repository. On Fedora, you would do this with Yum:
yum install oprofile kernel-debuginfo
A call to RPM should confirm that the kernel and kernel-debuginfo packages have the same version number:
rpm -q kernel-PAE kernel-PAE-debuginfo oprofilekernel-PAE-2.6.32.10-90.fc12.i686 kernel-PAE-debuginfo-2.6.32.10-90.fc12.i686 oprofile-0.9.5-4.fc12.i686
To profile the kernel, you need to tell OProfile where the kernel image is located with the --vmlinux
option:
opcontrol --setup --vmlinux=/usr/lib/debug/lib/ modules/`uname -r`/vmlinux
In normal use, you can omit the image details. The following command gives you an overview of the events that OProfile can enumerate (see Listing 1):
Listing 1: opcontrol Events
opcontrol --list-events oprofile: available events for CPU type "Core 2" See Intel Architecture Developer's Manual Volume 3B, Appendix A and Intel Architecture Optimization Reference Manual (730795-001) INST_RETIRED_ANY_P: (counter: all)) number of instructions retired (min count: 6000) L2_RQSTS: (counter: all)) number of L2 cache requests (min count: 500) Unit masks (default 0x7f) ---------- 0xc0: core: all cores 0x40: core: this core 0x30: prefetch: all inclusive 0x10: prefetch: Hardware prefetch only 0x00: prefetch: exclude hardware prefetch 0x08: (M)ESI: Modified 0x04: M(E)SI: Exclusive 0x02: ME(S)I: Shared 0x01: MES(I): Invalid LLC_MISSES: (counter: all)) L2 cache demand requests from this core that missed the L2 (min count:6000) Unit masks (default 0x41) ---------- 0x41: No unit mask [...]
opcontrol --list-events
The events will differ depending on the CPU you use. The /usr/share/oprofile
directory has lists for the various architectures. An event comprises a symbolic name (L2_RQSTS
), a counter (500
), and an optional mask (0xc0
). The counter defines the accuracy of a profile. The lower the value, the more often the event will be queried. Special properties of an event are available by query with the mask. For example, the L2_RQSTS
event tells you how many requests have been made to the CPU's L2 cache. When called with a mask of 0xc0
, OProfile returns the value for all the available CPUs; if you set 0x40
, you get the value only for the CPU actually running the OProfile process. The command in Listing 2 monitors a specific event.
Listing 2: opcontrol Events
opcontrol --vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux --event L2_RQSTS:500 # opcontrol --reset # opcontrol --start Using 2.6+ OProfile kernel interface. Reading module info. Using log file /var/lib/oprofile/samples/oprofiled.log Daemon started. Profiler running.
The --event
option can occur multiple times to monitor more than one event. To make sure that the results are not distorted by historical data, the --reset
option deletes them before collecting fresh data with --start
. After a while, the --stop
option stops monitoring the system. The data collected in this way are now available in the /var/lib/oprofile/samples
directory. For a general overview, you can access the data with opreport
– either the data for the complete system or the data for a specific application (Listing 3).
Listing 3: opreport for Mutt
opreport -l /usr/bin/mutt CPU: Core 2, speed 2401 MHz (estimated) Counted L2_RQSTS events (number of L2 cache requests) with a unit mask of 0x7f (multiple flags) count 500 samples % image name symbol name 414 10.8377 mutt imap_exec_msgset 162 4.2408 mutt parse_set 161 4.2147 mutt mutt_buffer_add 145 3.7958 mutt mutt_extract_token 126 3.2984 mutt ascii_strncasecmp 124 3.2461 mutt imap_read_headers [...]
Depending on the event selected, this information will give you a pretty clear picture of what is happening on your system. For more details of OProfile, check out the highly informative website for the tool http://1.
SystemTap
The SystemTap http://3 tool aims to combine the functionality of classical tracing and profile tools such as Strace and OProfile while providing a simple but powerful interface for the user. SystemTap was originally developed for monitoring the Linux kernel, although more recent versions also let you monitor userspace applications.
SystemTap builds on the kprobes
kernel subsystem. It lets the user insert arbitrary program code before any event in kernel space – for example, for kernel functions (kernel.function("function")
) inside of kernel modules (module("module").function("function")
) or system calls (syscall.system_call
). The injected program monitors the event and collects information in the process. Thus, far more precise results can be achieved here than with OProfile, which only queries an event periodically.
SystemTap has also supported querying static tracepoints in the kernel (kernel.trace("tracepoint")
) and, more recently, in userspace applications. Developers build static tracepoints into the program code at important locations. Because developers know their program better than anybody else, this kind of information is a big help.
SystemTap programs are written in a language that is similar to Awk. A parser checks the script for syntax errors before converting it to the faster C language, which is then loaded as a kernel module (Figure 2). Using the module on another system that doesn't have a compiler is not a problem, though. SystemTap lets you build modules for kernel versions besides the kernel on your own system. Then you can copy the module to the target system and run it with staprun
– more on this subject later.
Because all the major Linux distributions support SystemTap, you can easily install from the standard software repository. The important thing is to install the kernel-debuginfo package along with kernel-devel:
yum install kernel-debuginfo kernel-devel systemtap systemtap-runtime
The latest versions are available from the project's Git repository:
git clone http://sources.redhat.com/git/systemtap.git systemtap
Assuming the installation is successful, you can use the following one-liner to check that SystemTap is working properly:
stap -v -e 'probe vfs.read {printf("Reading data from disk.\n"); exit()}'
If this accesses the kernel's VFS subsystem, stap
will send a message to standard output and terminate. The ubiquitous "Hello World" program for SystemTap is shown in Listing 4.
Listing 4: "Hello world" in SystemTap
01 #!/usr/bin/stap 02 probe begin {printf("Hello, world!\n");} 03 probe timer.sec(5) {exit();} 04 probe end {printf("Good-bye, world!\n");} 05 06 # stap helloword.stp 07 Hello, world! 08 <5 seconds later> 09 Good-bye, world!
The "Hello World" example is a great demonstration of the generic structure of a SystemTap script. A script always comprises two parts: an event and a handler – typically preceded by a probe
instruction. In this example, the event is the read.vfs
function and the handler is the printf
command that outputs text to stdout. The handler is always executed when the specified event occurs. Events can be kernel functions, syscalls, or, as in this example, prebuilt tapsets – that is, prebuilt code blocks for specific kernel functions and system calls.
Tapsets are easy to integrate with your own scripts. The templates are typically located below /usr/share/systemtap/tapsets
. Besides these synchronous events, other asynchronous events are not bound to a specific event in the kernel or program code, and typically they are used when you need to create a header or footer for your script. They are also suitable for running specific events multiple times.
Listing 5 shows a simple example with two probes, each with an asynchronous and a synchronous event. The first outputs a header at one-second intervals, the second calls the prebuilt tcp.receive
tapset, which is defined in Listing 6. This example shows the extent to which the use of tapsets reduces the complexity of your own scripting. When you launch the script from Listing 1, typing stap tcpdump.stp
lets you see the network packets arriving at one-second intervals with various other pieces of information. If you omit timer.s(1)
in the first event, the header is only output before outputting the first network packet.
Listing 5: tcpdump via SystemTap
01 #!/usr/bin/stap 02 03 // A TCP dump like example 04 05 probe begin, timer.s(1) { 06 printf("------------------------------------------- ----------------------\n") 07 printf(" Source IP Dest IP SPort DPort U A P R S F\n") 08 printf("------------------------------------------- ----------------------\n") 09 } 10 11 probe tcp.receive { 12 printf(" %15s %15s %5d %5d %d %d %d %d %d 13 %d\n", saddr, daddr, sport, dport, urg, 14 ack, psh, rst, syn, fin) 15 }
Listing 6: Tapset tcp.stp
01 probe tcp.receive = kernel.function("tcp_v4_rcv") { 02 iphdr = __get_skb_iphdr($skb) 03 saddr = ip_ntop(__ip_skb_saddr(iphdr)) 04 daddr = ip_ntop(__ip_skb_daddr(iphdr)) 05 protocol = __ip_skb_proto(iphdr) 06 07 tcphdr = __get_skb_tcphdr($skb) 08 dport = __tcp_skb_dport(tcphdr) 09 sport = __tcp_skb_sport(tcphdr) 10 urg = __tcp_skb_urg(tcphdr) 11 ack = __tcp_skb_ack(tcphdr) 12 psh = __tcp_skb_psh(tcphdr) 13 rst = __tcp_skb_rst(tcphdr) 14 syn = __tcp_skb_syn(tcphdr) 15 fin = __tcp_skb_fin(tcphdr) 16 }
The handler, also known as a body, supports instructions that will be familiar from various programming languages. For example, you can initialize variables and arrays, call functions, and query positioning parameters $ (integer)
or @ (string)
. Of course, you wouldn't want to do without loops (while
, until
, for
, if
/else
), which give you useful flow control options for the script.
Instead of searching through a mass of data for the required information, which is the case with Strace, SystemTap lets you output the information after exceeding a specific threshold value, or when a specific event occurs. Thanks to the functional scope of the language, the choice of language constructs is more than adequate.
Listing 7 shows another example that uses the vfs.read
tapset. The global variable totals
is an associative array in this case. It contains the process names and process IDs for all the applications that access the VFS subsystem to read data from disk. The counter is incremented each time it's accessed.
Listing 7: Finding I/O-Intensive Apps
01 #!/usr/bin/stap 02 03 global totals; 04 probe vfs.read 05 06 { 07 totals[execname(), pid()]++ 08 } 09 10 probe end 11 { 12 printf("** Summary ** \n") 13 foreach ([name,pid] in totals-) 14 printf("%s (%d): %d \n", name, pid, totals[name,pid]) 15 }
If you are interested in a specific userspace program, you'll need to install the matching debuginfo package for the application. To make things easy, I will look at the ls
tool as an example. To perform a trace here, you'll need the coreutils-debuginfo package. Calling stap
as in Listing 8 gives you an overview of the functions in a specific process.
Listing 8: SystemTap Tracing a Userspace App
stap -e 'probe process("ls").function("*").call {log (pp())}' -c 'ls -l' total 20 -rw-rw-r--. 1 tscherf tscherf 17347 2010-04-12 08:43 systemtap.txt process("/bin/ls").function("main@/usr/src/debug/coreutils-7.6/src/ls.c:1225").call process("/bin/ls").function("set_program_name@/usr/src/debug/coreutils-7.6/lib/progname.c:35").call process("/bin/ls").function("human_options@/usr/src/debug/coreutils-7.6/lib/human.c:462").call process("/bin/ls").function("clone_quoting_options@/usr/src/debug/coreutils-7.6/lib/quotearg.c:99").call process("/bin/ls").function("xmemdup@/usr/src/debug/coreutils-7.6/lib/xmalloc.c:107").call process("/bin/ls").function("xmalloc@/usr/src/debug/coreutils-7.6/lib/xmalloc.c:43").call process("/bin/ls").function("get_quoting_style@/usr/src/debug/coreutils-7.6/lib/quotearg.c:110").call process("/bin/ls").function("clone_quoting_options@/usr/src/debug/coreutils-7.6/lib/quotearg.c:99").call process("/bin/ls").function("free_pending_ent@/usr/src/debug/coreutils-7.6/src/ls.c:1132").call [...] process("/bin/ls").function("close_stdout@/usr/src/debug/coreutils-7.6/lib/closeout.c:107").call process("/bin/ls").function("close_stream@/usr/src/debug/coreutils-7.6/lib/close-stream.c:57").call process("/bin/ls").function("close_stream@/usr/src/debug/coreutils-7.6/lib/close-stream.c:57").call
If the parameters of a specific function are also of interest, you can change the call to stap
as shown in Listing 9.
Listing 9: Parameter Tracing
stap -e 'probe process("ls").function("clone_quoting_options").call {log (probefunc() . " " . $$parms) }' -c '/bin/ls -l' total 20 -rw-rw-r--. 1 tscherf tscherf 18216 2010-04-12 09:02 systemtap.txt clone_quoting_options o=0x0 clone_quoting_options o=0x0
Cross-Compiling
If you want to run a SystemTap script on multiple systems, you will probably prefer not to have to install the compiler and the kernel debug information on all of these machines. In fact, you only need to do so on a build system. The target systems only require the systemtap-runtime RPM and the staprun
program it contains. The following command creates a prebuilt binary kernel module for the target system:
stap -r kernel-PAE-2.6.31.12-174.2.22 capt-io.stp -m read-io
The build system also needs the kernel-debuginfo package to match the target system version, and you must ensure that the build and target systems have the same hardware architecture. After creating a kernel module, copy it to the target system and launch it with staprun
:
staprun capt-io.ko
If you want non-root users to load this kernel module, they need to be members of the stapusr
group; members of the stapdev
group can additionally compile their own scripts.
Conclusions
The SystemTap tracing and profiling tool lets regular users perform detailed analyses of kernel and userspace programs without rebooting the whole system. Thanks to the comprehensive tapset library, this can be done without serious programming skills. Advanced users will enjoy the flexible, Awk-like scripting language that gives them the freedom to create highly complex tracing and profiling scripts. The SystemTap FAQ http://4 and the language reference http://5 are useful ports of call for more help.