Hunted By a Leak - Part Two
In my previous post,
I talked about a slowdown in a Perl 6 process I was fixing, and how I
discovered that the cause was really a memory leak. Instead of looking for the
memory leak in Proc::Async, I decided to look in run, which also spawns
children and exhibited the leak, but works in a synchronous manner. To find
the Perl 6 code that was causing the leak, I wrote some code that would call
run repeatedly:
for ^100_000 { run('true'); }
…along with a script to monitor the RSS of a child program once a second:
#!/usr/bin/env perl use strict; use warnings; use autodie; use feature qw(say); use Fcntl; use File::Slurper qw(read_text); my @command = @ARGV; my $pid = fork(); my ( $sentinel_read, $sentinel_write ); pipe $sentinel_read, $sentinel_write; fcntl($sentinel_write, F_SETFD, FD_CLOEXEC); if($pid) { close $sentinel_write; do { my $buffer; sysread $sentinel_read, $buffer, 1; }; while(1) { my @statm = split /\s+/, read_text("/proc/$pid/statm"); last if $statm[1] == 0; # will be 0 when the child has exited and # is waiting for parent to ask for status say STDERR $statm[1]; sleep 1; } waitpid $pid, 0; } else { close $sentinel_read; exec @command; die "couldn't execute command"; }
So I started stripping away code from the body of run, until I ended up with the
following definition for run:
sub run(*@args ($, *@), :$in = '-', :$out = '-', :$err = '-', Bool :$bin, Bool :$chomp = True, Bool :$merge, Str:D :$enc = 'utf8', Str:D :$nl = "\n", :$cwd = $*CWD, :$env) {}
You may be put off by its complex signature, but don't let that distract you. The important part is that the subroutine has no body.
So wait…just calling a subroutine leaks memory? To confirm this as true (it better not be!), I tried calling a subroutine with no arguments in its signature; that stopped the leak. After playing around with the signature for a bit, I finally came across a condition that would and would not trigger the leak:
sub no-leak(*@args) {} sub leak(*@args ($, *@)) {}
If you're not familiar with Perl 6's signatures, allow me to explain. The
no-leak subroutine above has a slurpy argument named @args; that is,
all extra arguments to the subroutine go into @args. leak has a slurpy
@args as well, but the difference is that leak's @args has what's
called a subsignature. A subsignature places constraints on what the shape
of the argument can be; in this case, ($, *@) just means that it must have
at least one value in it. So leak needs to take at least one argument.
Digging into the code that handles subsignatures, I discovered that
MVMCallCapture objects in MoarVM were creating new MVMCallsite structs,
but not freeing them when the GC comes calling 1). I naïvely free'd the callsites
in the GC handler, hoping that the solution would be that simple;
unfortunately, I immediately started seeing failures involving double
frees. I was not surprised to learn that call capture objects can share
their callsites with one another; however, it was clear that capture objects
that create a callsite always outlive the captures that take a reference to the
callsite. So I managed to add a flag to tell captures whether they owned their
callsite, or were just borrowing them. This plugged the leak! This made me
ecstatic, since I was leaving for my honeymoon in Japan the next day, and now I
would be able to have two weeks away from the code without having to worry. If
only life were so simple…
For better or for worse, Nicholas Clark discovered a use-after-free bug in my fix by running the entire Perl 6 test suite (aka roast) under ASAN (Google's Address Sanitizer). Unfortunately, I had to leave for Japan a few hours later, so I did what I could: I reverted the change I made, started a branch with the fix restored as a reminder, and tried not to think about it during my two weeks in Japan.
In the next installment, I'll cover the nature of that use-after-free bug, and how I managed to fix that with the help of ASAN.
- 1)
MVMCallCapturestructs are used for signature checking and binding parameters to arguments, and this logic is reused for subsignature checking