bpftrace: needing to cast when filtering on negative numbers and why
A few months ago I was using bpftrace to implement a program to monitor writes to a specific directory, and while working on this program I encountered some noteworthy behavior around bpftrace and its filters that I thought I should share.
To monitor writes, my program was monitoring uses of the openat
system call via tracepoint:syscalls:sys_enter_openat
- specifically uses where
the dfd
parameter is set to AT_FDCWD
. Since AT_FDCWD
has the value -100
on Linux, my program looked something like this:
tracepoint:syscalls:sys_enter_openat /comm=="test-program" && args.dfd == -100/ {
printf("openat(AT_FDCWD, ...) detected!\n");
}
...yet when I tested this out, I got no output - my filter did not seem to match! I did an strace test-program
and confirmed that, yes, my test program was indeed
calling openat(AT_FDCWD, ...)
- so what's going on here?
I removed the args.dfd == -100
part of the filter and added some diagnostic printf
calls - I noticed the value 4294967196
(which I'll refer to via its hex notation 0xffffff9c
for the rest of the post) in the output, which just so happens to be what -100
looks like if you were to interpret its bit pattern as an unsigned 32-bit integer! So now I tried this:
tracepoint:syscalls:sys_enter_openat /comm=="test-program" && args.dfd == 0xffffff9c/ {
printf("openat(AT_FDCWD, ...) detected!\n");
}
... and that worked! So did ((int32) args.dfd) == -100
, so the tl;dr here is:
If you're filtering bpftrace probes based on negative numbers and it's not working, cast the argument to the signed integer type you'd expect
Now, I had a fix for my script, but I was left wanting - why doesn't the filter without the cast work? So I continued to dig.
The next thing I tried was a program with the args.dfd = -100
filter, but not the comm == "test-program"
filter - sadly, I can't remember what inspired me to try this! But I started to see some output for this program, which further piqued my curiosity. So I wrote a new program to batch hits by comm
:
tracepoint:syscalls:sys_enter_openat /args.dfd == -100/ {
@signed_hits[comm] = count();
}
tracepoint:syscalls:sys_enter_openat /args.dfd == 0xffffff9c/ {
@unsigned_hits[comm] = count();
}
// exit after gathering results for one minute - bpftrace will automatically dump `@signed_hits` and `@unsigned_hits` upon exit
interval:s:60 {
exit();
}
...and I noticed that the programs showing up in @signed_hits
were all written in Go! And - interestingly enough - Prometheus showed up in both @signed_hits
as well as @unsigned-hits
.
Digging into that, I created a bpftrace program to bucket Prometheus' openat
calls based on args.filename
, and discovered the "unsigned" value for args.dfd
was used for opening /etc/hosts
. Now, if you're unfamiliar with Go, you might not know that Go is mostly implemented in itself, but for some things like networking, it can use facilities from your C library. If you want to completely remove any dependency your program has on libc, you can use a pure Go networking implementation by building your program with go build -tags netgo
- you can also tell Go whether or not to use its own resolver or the C library's via GODEBUG=netdns=go
or GODEBUG=netdns=cgo
, respectively. So I wrote a dumb little demo Go program:
package main
import (
"fmt"
"net"
)
func main() {
_, err := net.Dial("tcp", "example.com:80")
if err != nil {
fmt.Printf("failed to connect: %v\n", err)
}
}
...and observed my bpftrace program's output when I ran my Go program with GODEBUG=netdns=go
and GODEBUG=netdns=cgo
- sure enough, the cgo
runs ended up in the 0xffffff9c
bucket, and the go
runs ended up in the -100
bucket.
So why the difference here? What appears to be happening is Go programs invoke the openat
system call with args.dfd
as 0xffff ffff ffff ff9c
, but glibc-based programs use 0x0000 0000 ffff ff9c
instead!
If you're unfamiliar with how userspace programs make syscalls, or if you need a refresher, here's a high-level overview: the program sets some registers, and then invokes a special processor instruction (INT 0x80
, SYSENTER
, or SYSCALL
) to transfer control to the kernel, which examines the registers and takes action on the userspace program's behalf. (I'm glossing over much of the details - here is a really good article with more!)
The Linux x86_64 ABI says that the first integer argument - dfd
in this case - should be passed in via the RDI
register. So I wrote a little Go program that calls openat(AT_FDCWD, ..)
:
package main
import "os"
func main() {
os.Open("/dev/null") // this happens to use openat under the hood
}
...and an analogous one in C:
#include <fcntl.h>
int
main(void)
{
openat(AT_FDCWD, "/dev/null", 0);
return;
}
...and ran these both under GDB, which I set up to print the value of the RDI
register whenever the openat
system call was called:
(gdb) display $rdi
(gdb) catch syscall openat
With this setup, I only ever saw $rdi = 4294967196
when the catchpoint is hit for the C program - but when running the Go program, I first see $rdi = 4294967196
when Go is doing some sort of hugepage detection at its startup, and then I saw $rdi = -100
!
The reason for this seems to be that if you look at man 2 openat
, the signature for that function uses int
for dfd
, and if you look at bpftrace -lv tracepoint:syscalls:sys_enter_openat
, dfd
is listed as an int
as well. But in C, an int
is either 16 or 32 bits, and in bpftrace, int
s are always 64-bits! So it seems that when Go makes a system call, it's setting all of RDI
, but glibc only sets the bottom 32 bits of that 64-bit register, and bpftrace gets mixed up by glibc's behavior.