Finding the most common n-grams in Russian using Perl 6 and HabraHabr

I've been teaching myself Russian for some time; in fact, I would probably be a lot better at it if I spent time actually learning Russian instead of thinking of ways of hacking my language learning process…which is exactly what we'll be doing here.


Since most of my communications in Russian are text-based, I would really like to increase my typing speed. I figured that if I could train my muscle memory to type common patterns, this would help do just that. We can do this by finding the most common n-grams in the Russian language. Fortunately, this is easy to do with the power of Perl!

Getting articles from HabraHabr

HabraHabr is a Russian tech blog site, and should serve as a good corpus of data. So let's write some shell code to get the words used in the top twenty pages:

touch habrahabr-links # This is necessary if you have noclobber on like I do
for i in {1..20}; do
  mojo get$i/ 'a.post_title' attr href >> habrahabr-links
  sleep 10 # be a good netizen

touch habrahabr-words
for link in $(cat habrahabr-links); do
  mojo get $link | get-html-body | perl6 -e '
    my @words = slurp.words;
    for @words -> $word {
      next unless $word ~~ /<:Cyrillic>/;
      say $word.subst(/<-:Cyrillic>+$/, q{}).subst(/^ <-:Cyrillic>+/, q{});
    }' >> habrahabr-words
  sleep 10 # be a good netizen

This is pretty straightfoward shell code, but I want to go over some of the specifics:

Extracting n-grams

Now that we have our word list, let's extract our n-grams from it. We'll deal with n-grams for n up to 5; I figured this should be long enough. We'll even extract single letters:

for n in {1..5}; do
  perl6 -e '
    multi ngrams(Str $word, Int :$n where * == 1) {

    multi ngrams(Str $word, Int :$n) {
        return [] if $word.chars < $n;
        my @chars = $word.comb(/./);
        gather for ^(@chars - $n + 1) {
            take @chars.rotate($_)[^$n].join('')

    my $n = +@*ARGS.shift;
    for lines() -> $word {
        .say for ngrams($, :$n).grep(/^ <:Cyrillic>+ $/)
' $n habrahabr-words  >| habrahabr-${n}grams

One Perl 6 feature I would like to point out here is multi subs. You can see that the ngrams sub has two definitions with slightly different signatures; the where * == 1 on the top means that it will only be called if :n(1) is passed in. Otherwise, the more generic variant is called. This allows us to create more specialized variants of a sub for special values, or to break up the logic for an algorithm into nice, discrete chunks.

Finding the most common n-grams

Since there are 33ⁿ n-grams (because there are 33 letters in the Russian alphabet), I probably only want to focus on the most frequent 100 or so. Let's use a Perl 6 program to find the percentage breakdown of each n-gram:
use v6;

my @occurrences =
    gather for lines() -> $line {
        take (+$0, ~$1) if $line ~~ /^ \s* (\d+) \s+ (.*) $/;

my $total = [+] @occurrences».[0];

for @occurrences -> ($count, $text) {
    printf("%.3f %s\n", $count * 100 / $total, $text);

The notable Perl 6 feature here is gather/take; this creates a generator that allows you avoid unnecessary intermediate variables. Other noteworthy features are the reduce metaoperator ([…]), which takes an operator and creates a list-reducing operator out of it (so in this code, [+] means “sum”), and the hyper metaoperator », which creates a vectorized version of an operator.

Now we can round it out with some shell-fu to finish the job:

sort /tmp/habrahabr-1grams | uniq -c | sort -n | perl6 /tmp/

Here's what it looks like for the various Russian characters:

  0.039 ъ
  0.131 ё
  0.323 щ
  0.349 э
  0.540 ш
  0.540 ц
  0.547 ф
  0.571 ю
  0.866 ж
  0.869 х
  1.207 й
  1.292 г
  1.377 ч
  1.773 з
  1.831 б
  1.905 ь
  1.934 ы
  1.982 я
  2.192 у
  2.728 д
  3.206 п
  3.333 м
  3.503 к
  3.961 л
  4.222 в
  5.024 с
  5.424 р
  6.325 н
  7.015 т
  7.576 и
  8.361 а
  8.586 е
  10.469 о

Compared to the frequency table at on Wikipedia, it looks pretty close! There are some biases, however; I found that “февра”, as in февраль, the Russian word for “February”, occurred quite a bit, but these biases are probably small enough not to matter for my purposes. Speaking of my purposes, now that I have this data, what should I do with it? That, my dear reader, is a story for next week…