bpftrace: needing to cast when filtering on negative numbers and why

A few months ago I was using bpftrace to implement a program to monitor writes to a specific directory, and while working on this program I encountered some noteworthy behavior around bpftrace and its filters that I thought I should share.

To monitor writes, my program was monitoring uses of the openat system call via tracepoint:syscalls:sys_enter_openat - specifically uses where the dfd parameter is set to AT_FDCWD. Since AT_FDCWD has the value -100 on Linux, my program looked something like this:

tracepoint:syscalls:sys_enter_openat /comm=="test-program" && args.dfd == -100/ {
  printf("openat(AT_FDCWD, ...) detected!\n");
}

...yet when I tested this out, I got no output - my filter did not seem to match! I did an strace test-program and confirmed that, yes, my test program was indeed calling openat(AT_FDCWD, ...) - so what's going on here?

I removed the args.dfd == -100 part of the filter and added some diagnostic printf calls - I noticed the value 4294967196 (which I'll refer to via its hex notation 0xffffff9c for the rest of the post) in the output, which just so happens to be what -100 looks like if you were to interpret its bit pattern as an unsigned 32-bit integer! So now I tried this:

tracepoint:syscalls:sys_enter_openat /comm=="test-program" && args.dfd == 0xffffff9c/ {
  printf("openat(AT_FDCWD, ...) detected!\n");
}

... and that worked! So did ((int32) args.dfd) == -100, so the tl;dr here is:

If you're filtering bpftrace probes based on negative numbers and it's not working, cast the argument to the signed integer type you'd expect

Now, I had a fix for my script, but I was left wanting - why doesn't the filter without the cast work? So I continued to dig.

The next thing I tried was a program with the args.dfd = -100 filter, but not the comm == "test-program" filter - sadly, I can't remember what inspired me to try this! But I started to see some output for this program, which further piqued my curiosity. So I wrote a new program to batch hits by comm:

tracepoint:syscalls:sys_enter_openat /args.dfd == -100/ {
  @signed_hits[comm] = count();
}

tracepoint:syscalls:sys_enter_openat /args.dfd == 0xffffff9c/ {
  @unsigned_hits[comm] = count();
}

// exit after gathering results for one minute - bpftrace will automatically dump `@signed_hits` and `@unsigned_hits` upon exit
interval:s:60 {
  exit();
}

...and I noticed that the programs showing up in @signed_hits were all written in Go! And - interestingly enough - Prometheus showed up in both @signed_hits as well as @unsigned-hits.

Digging into that, I created a bpftrace program to bucket Prometheus' openat calls based on args.filename, and discovered the "unsigned" value for args.dfd was used for opening /etc/hosts. Now, if you're unfamiliar with Go, you might not know that Go is mostly implemented in itself, but for some things like networking, it can use facilities from your C library. If you want to completely remove any dependency your program has on libc, you can use a pure Go networking implementation by building your program with go build -tags netgo - you can also tell Go whether or not to use its own resolver or the C library's via GODEBUG=netdns=go or GODEBUG=netdns=cgo, respectively. So I wrote a dumb little demo Go program:

package main

import (
    "fmt"
    "net"
)

func main() {
    _, err := net.Dial("tcp", "example.com:80")
    if err != nil {
        fmt.Printf("failed to connect: %v\n", err)
    }
}

...and observed my bpftrace program's output when I ran my Go program with GODEBUG=netdns=go and GODEBUG=netdns=cgo - sure enough, the cgo runs ended up in the 0xffffff9c bucket, and the go runs ended up in the -100 bucket.

So why the difference here? What appears to be happening is Go programs invoke the openat system call with args.dfd as 0xffff ffff ffff ff9c, but glibc-based programs use 0x0000 0000 ffff ff9c instead!

If you're unfamiliar with how userspace programs make syscalls, or if you need a refresher, here's a high-level overview: the program sets some registers, and then invokes a special processor instruction (INT 0x80, SYSENTER, or SYSCALL) to transfer control to the kernel, which examines the registers and takes action on the userspace program's behalf. (I'm glossing over much of the details - here is a really good article with more!)

The Linux x86_64 ABI says that the first integer argument - dfd in this case - should be passed in via the RDI register. So I wrote a little Go program that calls openat(AT_FDCWD, ..):

package main

import "os"

func main() {
    os.Open("/dev/null") // this happens to use openat under the hood
}

...and an analogous one in C:

#include <fcntl.h>

int
main(void)
{
    openat(AT_FDCWD, "/dev/null", 0);
    return;
}

...and ran these both under GDB, which I set up to print the value of the RDI register whenever the openat system call was called:

(gdb) display $rdi
(gdb) catch syscall openat

With this setup, I only ever saw $rdi = 4294967196 when the catchpoint is hit for the C program - but when running the Go program, I first see $rdi = 4294967196 when Go is doing some sort of hugepage detection at its startup, and then I saw $rdi = -100!

The reason for this seems to be that if you look at man 2 openat, the signature for that function uses int for dfd, and if you look at bpftrace -lv tracepoint:syscalls:sys_enter_openat, dfd is listed as an int as well. But in C, an int is either 16 or 32 bits, and in bpftrace, ints are always 64-bits! So it seems that when Go makes a system call, it's setting all of RDI, but glibc only sets the bottom 32 bits of that 64-bit register, and bpftrace gets mixed up by glibc's behavior.

Published on 2024-07-02