Finding the most common n-grams in Russian using Perl 6 and HabraHabr

I've been teaching myself Russian for some time; in fact, I would probably be a lot better at it if I spent time actually learning Russian instead of thinking of ways of hacking my language learning process…which is exactly what we'll be doing here.

¯\_(ツ)_/¯

Since most of my communications in Russian are text-based, I would really like to increase my typing speed. I figured that if I could train my muscle memory to type common patterns, this would help do just that. We can do this by finding the most common n-grams in the Russian language. Fortunately, this is easy to do with the power of Perl!

Getting articles from HabraHabr

HabraHabr is a Russian tech blog site, and should serve as a good corpus of data. So let's write some shell code to get the words used in the top twenty pages:

touch habrahabr-links # This is necessary if you have noclobber on like I do
for i in {1..20}; do
  mojo get http://habrahabr.ru/page$i/ 'a.post_title' attr href >> habrahabr-links
  sleep 10 # be a good netizen
done

touch habrahabr-words
for link in $(cat habrahabr-links); do
  mojo get $link | get-html-body | perl6 -e '
    my @words = slurp.words;
    for @words -> $word {
      next unless $word ~~ /<:Cyrillic>/;
      say $word.subst(/<-:Cyrillic>+$/, q{}).subst(/^ <-:Cyrillic>+/, q{});
    }' >> habrahabr-words
  sleep 10 # be a good netizen
done

This is pretty straightfoward shell code, but I want to go over some of the specifics:

Extracting n-grams

Now that we have our word list, let's extract our n-grams from it. We'll deal with n-grams for n up to 5; I figured this should be long enough. We'll even extract single letters:

for n in {1..5}; do
  perl6 -e '
    multi ngrams(Str $word, Int :$n where * == 1) {
        $word.comb(/./)
    }

    multi ngrams(Str $word, Int :$n) {
        return [] if $word.chars < $n;
        my @chars = $word.comb(/./);
        gather for ^(@chars - $n + 1) {
            take @chars.rotate($_)[^$n].join('')
        }
    }

    my $n = +@*ARGS.shift;
    for lines() -> $word {
        .say for ngrams($word.lc(), :$n).grep(/^ <:Cyrillic>+ $/)
    }
' $n habrahabr-words  >| habrahabr-${n}grams

One Perl 6 feature I would like to point out here is multi subs. You can see that the ngrams sub has two definitions with slightly different signatures; the where * == 1 on the top means that it will only be called if :n(1) is passed in. Otherwise, the more generic variant is called. This allows us to create more specialized variants of a sub for special values, or to break up the logic for an algorithm into nice, discrete chunks.

Finding the most common n-grams

Since there are 33ⁿ n-grams (because there are 33 letters in the Russian alphabet), I probably only want to focus on the most frequent 100 or so. Let's use a Perl 6 program to find the percentage breakdown of each n-gram:

percentages.pl
use v6;

my @occurrences =
    gather for lines() -> $line {
        take (+$0, ~$1) if $line ~~ /^ \s* (\d+) \s+ (.*) $/;
    };

my $total = [+] @occurrences».[0];

for @occurrences -> ($count, $text) {
    printf("%.3f %s\n", $count * 100 / $total, $text);
}

The notable Perl 6 feature here is gather/take; this creates a generator that allows you avoid unnecessary intermediate variables. Other noteworthy features are the reduce metaoperator ([…]), which takes an operator and creates a list-reducing operator out of it (so in this code, [+] means “sum”), and the hyper metaoperator », which creates a vectorized version of an operator.

Now we can round it out with some shell-fu to finish the job:

sort /tmp/habrahabr-1grams | uniq -c | sort -n | perl6 /tmp/percentages.pl

Here's what it looks like for the various Russian characters:

  0.039 ъ
  0.131 ё
  0.323 щ
  0.349 э
  0.540 ш
  0.540 ц
  0.547 ф
  0.571 ю
  0.866 ж
  0.869 х
  1.207 й
  1.292 г
  1.377 ч
  1.773 з
  1.831 б
  1.905 ь
  1.934 ы
  1.982 я
  2.192 у
  2.728 д
  3.206 п
  3.333 м
  3.503 к
  3.961 л
  4.222 в
  5.024 с
  5.424 р
  6.325 н
  7.015 т
  7.576 и
  8.361 а
  8.586 е
  10.469 о

Compared to the frequency table at on Wikipedia, it looks pretty close! There are some biases, however; I found that “февра”, as in февраль, the Russian word for “February”, occurred quite a bit, but these biases are probably small enough not to matter for my purposes. Speaking of my purposes, now that I have this data, what should I do with it? That, my dear reader, is a story for next week…