Hunted By a Leak - Part Two

In my previous post, I talked about a slowdown in a Perl 6 process I was fixing, and how I discovered that the cause was really a memory leak. Instead of looking for the memory leak in Proc::Async, I decided to look in run, which also spawns children and exhibited the leak, but works in a synchronous manner. To find the Perl 6 code that was causing the leak, I wrote some code that would call run repeatedly:

for ^100_000 {
    run('true');
}

...along with a script to monitor the RSS of a child program once a second:

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;
use feature qw(say);

use Fcntl;
use File::Slurper qw(read_text);

my @command = @ARGV;

my $pid = fork();

my ( $sentinel_read, $sentinel_write );

pipe $sentinel_read, $sentinel_write;

fcntl($sentinel_write, F_SETFD, FD_CLOEXEC);

if($pid) {
    close $sentinel_write;
    do {
        my $buffer;
        sysread $sentinel_read, $buffer, 1;
    };

    while(1) {
        my @statm = split /\s+/, read_text("/proc/$pid/statm");
        last if $statm[1] == 0; # will be 0 when the child has exited and
                                # is waiting for parent to ask for status
        say STDERR $statm[1];
        sleep 1;
    }
    waitpid $pid, 0;
} else {
    close $sentinel_read;
    exec @command;
    die "couldn't execute command";
}

So I started stripping away code from the body of run, until I ended up with the following definition for run:

sub run(*@args ($, *@),
    :$in = '-',
    :$out = '-',
    :$err = '-',
    Bool :$bin,
    Bool :$chomp = True,
    Bool :$merge,
    Str:D :$enc = 'utf8',
    Str:D :$nl = "\n",
    :$cwd = $*CWD,
    :$env)
{}

You may be put off by its complex signature, but don't let that distract you. The important part is that the subroutine has no body.

So wait...just calling a subroutine leaks memory? To confirm this as true (it better not be!), I tried calling a subroutine with no arguments in its signature; that stopped the leak. After playing around with the signature for a bit, I finally came across a condition that would and would not trigger the leak:

sub no-leak(*@args) {}
sub leak(*@args ($, *@)) {}

If you're not familiar with Perl 6's signatures, allow me to explain. The no-leak subroutine above has a slurpy argument named @args; that is, all extra arguments to the subroutine go into @args. leak has a slurpy @args as well, but the difference is that leak's @args has what's called a subsignature. A subsignature places constraints on what the shape of the argument can be; in this case, ($, *@) just means that it must have at least one value in it. So leak needs to take at least one argument.

Digging into the code that handles subsignatures, I discovered that MVMCallCapture objects in MoarVM were creating new MVMCallsite structs, but not freeing them when the GC comes calling MVMCallCapture structs are used for signature checking and binding parameters to arguments, and this logic is reused for subsignature checking MVMCallCapture structs are used for signature checking and binding parameters to arguments, and this logic is reused for subsignature checking . I naïvely free'd the callsites in the GC handler, hoping that the solution would be that simple; unfortunately, I immediately started seeing failures involving double frees. I was not surprised to learn that call capture objects can share their callsites with one another; however, it was clear that capture objects that create a callsite always outlive the captures that take a reference to the callsite. So I managed to add a flag to tell captures whether they owned their callsite, or were just borrowing them. This plugged the leak! This made me ecstatic, since I was leaving for my honeymoon in Japan the next day, and now I would be able to have two weeks away from the code without having to worry. If only life were so simple...

For better or for worse, Nicholas Clark discovered a use-after-free bug in my fix by running the entire Perl 6 test suite (aka roast) under ASAN (Google's Address Sanitizer). Unfortunately, I had to leave for Japan a few hours later, so I did what I could: I reverted the change I made, started a branch with the fix restored as a reminder, and tried not to think about it during my two weeks in Japan.

In the next installment, I'll cover the nature of that use-after-free bug, and how I managed to fix that with the help of ASAN.

Published on 2015-12-15