First steps in system-wide Linux tracing

Motivation

Today at Red Hat we have a “Learn something new” day. After so many years of doing software development, I’m quite well versed in tools like strace, gdb, or good old printf() debugging to examine how an individual process (mis)behaves. But I occasionally run into situations where a system-wide scope of examination is necessary.

For example, when I was working on optimizing Ubuntu’s power usage ages ago, I wanted to hunt down processes which were opening/reading/writing files and thus waking up the hard disk. I discovered Linux’s fanotify API and wrote a little tool called fatrace around it, which has found its way into pretty much all distros these days. But the fanotify API is quite complicated and brittle to use.

Last week I also looked into why a lot of our Cockpit integration tests for NetworkManager checkpoints are so horribly flaky. For these, Cockpit has to deal with real-time deadlines (pretty much the only case that I’m aware of), so it’s inherently unreliable in a cloud test environment with lots of noisy neighbors. In these tests I see a lot of cases where NetworkManager is super-slow, taking maybe 5 seconds for an operation that should normally be done in the blink of an eye. My initial stab in the dark was clearly not effective enough, so I need a bigger hammer.

Linux tracing tools have come a long way in the recent years, so today was a perfect opportunity to finally learn about the current state of the art. I picked bpftrace. It takes a reasonably high-level program as input (feels similar to awk), compiles it into an EBPF program, and runs it in the kernel.

Initial difficulties

I wanted to try this on my Fedora laptop first. After installing the bpftrace package, not much was working though, even the simplest example one-liners complained about “Error attaching probe” – my system does not have /sys/kernel/debug/tracing at all. Some googling and thorough inspection of dmesg uncovered that the brand new kernel lockdown feature throws a spanner in the works here:

[    0.300029] Tracing disabled due to lockdown

I tried to boot my kernel with lockdown=integrity (as lockdown=confidentiality is the default), but that didn’t help, even though the exact same version in a VM does work. Turns out that having SecureBoot enabled automatically enables lockdown. The advertised SysRq+x does not work (SysRq+h does not even know about it, even though I enabled all SysRqs), and I don’t really feel like disabling SecureBoot.

So I did all my subsequent experiments in a Fedora 31 VM, which works just out of the box.

First steps

I started with reading the excellent documentation on Brendan Gregg’s homepage, the One Liner tutorial and the reference guide to get a feeling what’s possible – and there’s a lot! Way too much to grasp all of it in one day of course, but it’s good to keep a high-level overview in the back of your head.

So of course I mostly followed the tutorials and started with variations of simple one-liners, like showing all executed programs in the system:

bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s: ", comm); join(args->argv) }'

One step further, tracing opening of files. This requires two probes, either using the syscall tracepoints:

bpftrace -e 't:syscalls:sys_enter_open,t:syscalls:sys_enter_openat {
      printf("%s(%d) opening %s flags %x mode %x", comm, pid, str(args->filename), args->flags, args->mode) }
  t:syscalls:sys_exit_open,t:syscalls:sys_exit_openat { printf(" = %d\n", args->ret) }'

Or directly hooking into the kernel functions (“kprobes”):

bpftrace -e 'kprobe:do_sys_open { printf("%s(%d) opening: %s flags %x mode %x", comm, pid, str(arg1), arg2, arg3); }
             kretprobe:do_sys_open { printf(" = %d\n", retval) }'

This is already quite useful to track down what happens on your system – think of it as strace -e execve -p heck_yeah_all_processes.

My impression is that it’s better to use the statically defined tracepoints when they are available, as they have much better type safety and are well self-documented. For example:

# cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_open/format
name: sys_enter_open
ID: 627
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:int __syscall_nr;	offset:8;	size:4;	signed:1;
	field:const char * filename;	offset:16;	size:8;	signed:0;
	field:int flags;	offset:24;	size:8;	signed:0;
	field:umode_t mode;	offset:32;	size:8;	signed:0;

print fmt: "filename: 0x%08lx, flags: 0x%08lx, mode: 0x%08lx", ((unsigned long)(REC->filename)), ((unsigned long)(REC->flags)), ((unsigned long)(REC->mode))

But of course kprobes open up a whole lot of new possibilities to peek into the kernel that’s far beyond the syscall userland interface and even the wider set of defined tracepoints. The price is that they are not stable, not self-descriptive, and it’s harder to pry information out of them.

Tracing file access

My morning goal was to find out whether and up to which degree I could replace fatrace with a bpftrace program. As a reference, I ran these two commands in bash:

head /etc/motd; echo hello > somefile

which looks like this with fatrace:

head(35929): RO /usr/bin/head
head(35929): RO /usr/lib64/ld-2.30.so
head(35929): CO /etc/ld.so.cache
head(35929): RO /usr/lib64/libc-2.30.so
head(35929): O /usr/lib/locale/locale-archive
head(35929): RO /etc/motd
head(35929): C /etc/motd
unknown(35929): C /usr/bin/head
unknown(35929): C /usr/lib64/ld-2.30.so
unknown(35929): C /usr/lib64/libc-2.30.so
unknown(35929): C /usr/lib/locale/locale-archive
bash(35903): + /home/admin
bash(35903): CWO /home/admin/somefile

After some tinkering, I came up with this bpftrace program fatrace.bt:

#!/usr/bin/bpftrace
#include <fcntl.h>
#include <limits.h>

/* remember our own pid, so that we can filter it out from the handlers */
BEGIN { @mypid = pid }

tracepoint:syscalls:sys_enter_open /pid != @mypid/ {
     printf("%s(%d) open: %s flags %x mode %x", comm, pid, str(args->filename), args->flags, args->mode);
}

tracepoint:syscalls:sys_enter_openat /pid != @mypid/ {
     printf("%s(%d) openat: ", comm, pid);
     if (args->dfd < INT_MAX) { /* ought to be != AT_FDCWD, but that does not work !?!? */
         printf("[at fd %d]", args->dfd);
     }
     printf("%s flags %x mode %x", str(args->filename), args->flags, args->mode);
}

tracepoint:syscalls:sys_exit_open,
tracepoint:syscalls:sys_exit_openat /pid != @mypid/ {
     printf(" = %d\n", args->ret);
}

tracepoint:syscalls:sys_enter_read /pid != @mypid/ {
     printf("%s(%d) read fd %d size %d\n", comm, pid, args->fd, args->count);
}

tracepoint:syscalls:sys_enter_write /pid != @mypid/ {
     printf("%s(%d) write fd %d size %d\n", comm, pid, args->fd, args->count);
}

END { clear(@mypid) }

which yields this result:

bash(35903) read fd 0 size 1
bash(35903) write fd 2 size 1
bash(35926) read fd 4 size 1
head(35926) openat: /etc/ld.so.cache flags 80000 mode 0 = 3
head(35926) openat: /lib64/libc.so.6 flags 80000 mode 0 = 3
head(35926) read fd 3 size 832
head(35926) read fd 3 size 784
head(35926) read fd 3 size 32
head(35926) read fd 3 size 68
head(35926) read fd 3 size 784
head(35926) read fd 3 size 32
head(35926) read fd 3 size 68
head(35926) openat: /usr/lib/locale/locale-archive flags 80000 mode 0 = 3
head(35926) openat: /etc/motd flags 0 mode 0 = 3
head(35926) read fd 3 size 8192
head(35926) write fd 1 size 18
head(35926) read fd 3 size 8192
bash(35903) openat: somefile flags 241 mode 1b6 = 4
bash(35903) write fd 1 size 6
bash(35903) write fd 1 size 15
bash(35903) write fd 2 size 14

Pros:

fatrace.bt is able to collect a lot more information, such as read/write block sizes or file open flags
fatrace.bt is more robust – with fanotify, you don’t get the comm name or file paths for free, but have to jump through hoops to look them up after the event happened – that causes this bunch of “unknown” process names
fatrace.bt is a lot shorter – 1.1 kB as opposed to the 18.1 kB of fatrace.c

Cons:

Output formatting in bpftrace is really limited – e. g. it would make sense to format the mode and flags values as octal numbers, but printf() can’t do that.
For the read/write traces, mapping the fd to a file name seems really hard. There’s no readlink(), so you can’t just look at /proc/<pid>/fd/<fd>. It probably could maintain a huge lookup table of (pid, fd) → filename, but that seems brittle.
“enter” and “exit” are separate events which don’t necessarily alternate exactly. So the sys_exit_open handler is probably broken in a multi-CPU environment when multiple open()s are called concurrently. (I ran that on a VM with a single CPU). In fact, my first version used kprobes, and I got a bunch of kretprobe:do_sys_open events without a previous kprobe:do_sys_open. I don’t see how to associate an _exit to the corresponding _enter event, there’s no event ID or something similar.

The first two points can easily be mitigated with some post-processing, of course. The third point seems like rather serious to me, but I hope that it’s just my ignorance.

Finding slow processes

In the afternoon I took a stab at my second motivational use case above. I know next to nothing about Linux scheduling, but with the help of the example programs and a lot of experimentation, I had something which seemed at least marginally useful:

#!/usr/bin/bpftrace

/* Check for processes which want to run with a schedule latency of at least 5ms.
   Show a histogram of the slow schedulings per process and pid.
*/

#include <linux/sched.h>

BEGIN { printf("Tracing for slowly scheduled processes; press ^C to finish\n"); }

tracepoint:sched:sched_wakeup,
tracepoint:sched:sched_wakeup_new {
    @lastrunning[args->pid] = nsecs;
}

tracepoint:sched:sched_switch
{
    if (args->prev_state == TASK_RUNNING) {
        @lastrunning[args->prev_pid] = nsecs;
    }
    
    // when did we see the next scheduled process the last time?
    $last_ns = @lastrunning[args->next_pid];
    delete(@lastrunning[args->next_pid]);
    if ($last_ns) {
        $delta_ms = (nsecs - $last_ns) / 1000000;
        if ($delta_ms > 5) {
            @slowprocs[args->next_comm, args->next_pid] = hist($delta_ms);
        }
    }
}

END { clear(@lastrunning) }

On a quiet system with just at most one busy process, it doesn’t show anything at all. With generating some load on my single-CPU VM

cat /dev/urandom > /dev/null &
dd if=/dev/urandom of=/dev/null &
grep -r . / > /dev/null &

running that program for a few seconds showed:

# ./slowprocs.bt
Attaching 5 probes...
Tracing for slowly scheduled processes...

@slowprocs[grep, 41356]:
[4, 8)               249 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)                5 |@                                                   |

@slowprocs[cat, 41350]:
[4, 8)               315 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)               18 |@@                                                  |

@slowprocs[dd, 41351]:
[4, 8)               428 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)               17 |@@

I. e. this filters out all the background noise, and shows me exactly which processes compete for CPU, i. e. they could be scheduled, but are not. The left numbers are scheduling latency ranges in ms, the right numbers are histogram counters.

I didn’t yet try to apply that to my original “flaky test case” problem, I ran out of time today.

Conclusion

I only just dipped my toe in, and I’m already really amazed how powerful and simple Linux tracing has become! (Aside from my initial difficulties, which I’d still like to figure out). A huge thanks to Brendan Gregg and the Linux kernel community at large for this work!

If you have comments or questions, please put them on Twitter.