Tracking Slowdowns in Creating Child Processes - Part One

When I use Perl 5, I do a lot of shelling out to external commands; after all, that's what Perl is good at! So it should come as no surprise that I do the same with Perl 6, which means a lot of the time and effort I spend helping with Perl 6 is improving the interaction with external commands.

While at the Swiss Perl Workshop back in August, I did a bit of work chasing down a race condition in Perl 6's Proc::Async, which is Perl 6's higher-level asynchronous API for handling child processes. While performing a test I wrote repeatedly to verify that I had fixed the race condition, I noticed that individual invocations of child processes started to feel slower. I added some timing code, and sure enough; over thousands of invocations, Proc::Async would gradually slow down. Here's the code I wrote that demonstrated the slowdown:

loop {
    my $start = now;
    my $child ='true');
    await $child.start;
    say now - $start;

Why the slowdown?

To find the source of the slowdown, I sprinkled some printf statements over MoarVM's 1) asyncronous process API, which drives Proc::Async. My initial assumptions were that constant creation and destruction of objects was causing heap fragmentation, and thus malloc was just taking longer to find free chunks of memory. However, the printfs quickly brought me down to libuv 2), and diving into libuv quickly brought me to the true culprit: fork. This surprised me, as it turns out that even though fork will save time copying memory by using copy-on-write pages in most kernels, on Linux, the time to successfully fork a process increases as the parent process grows in RSS, as demonstrated by this program and a plotting of its output:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>

#define NUM_GROWTHS 100
#define FORK_COUNT 10000

static double
timespec_to_nanos(struct timespec *ts)
    double nanos = ts->tv_sec * 1e9;
    nanos += ts->tv_nsec;

    return nanos;

static double
delta_timespec_ms(struct timespec *before, struct timespec *after)
    return (timespec_to_nanos(after) - timespec_to_nanos(before)) / 1e6;

static void
    struct timespec before, after;
    int i;

    assert(clock_gettime(CLOCK_MONOTONIC, &before) == 0);
    for(i = 0; i < FORK_COUNT; i++) {
        pid_t child;
        int status;

        child = fork();

        assert(child != -1);

        if(child) {
            waitpid(child, &status, 0);
        } else {
    assert(clock_gettime(CLOCK_MONOTONIC, &after) == 0);

    printf("%.5f\n", delta_timespec_ms(&before, &after) / FORK_COUNT);
    fflush(stdout); /* flush so that the child process doesn't get a copy of
                       buffered output */

    int i;

    for(i = 0; i < NUM_GROWTHS; i++) {
        assert(malloc(1024 * 1024));

    return 0;

I haven't dug into the kernel source code yet to figure out exactly why this is happening, but that would make an interesting future blog post! So now, in order to fix the slowdown, I would first have to hunt down a memory leak, which I'll cover in the next post.

  1. 1) The flagship virtual machine for Rakudo, which is the canonical Perl 6 compiler
  2. 2) MoarVM uses libuv as an OS abstraction library