Finding the most common n-grams in Russian using Perl 6 and HabraHabr
I've been teaching myself Russian for some time; in fact, I would probably be a lot better at it if I spent time actually learning Russian instead of thinking of ways of hacking my language learning process...which is exactly what we'll be doing here.
¯\_(ツ)_/¯
Since most of my communications in Russian are text-based, I would really like to increase my typing speed. I figured that if I could train my muscle memory to type common patterns, this would help do just that. We can do this by finding the most common n-grams in the Russian language. Fortunately, this is easy to do with the power of Perl!
Getting articles from HabraHabr
HabraHabr is a Russian tech blog site, and should serve as a good corpus of data. So let's write some shell code to get the words used in the top twenty pages:
touch habrahabr-links # This is necessary if you have noclobber on like I do
for i in {1..20}; do
mojo get http://habrahabr.ru/page$i/ 'a.post_title' attr href >> habrahabr-links
sleep 10 # be a good netizen
done
touch habrahabr-words
for link in $(cat habrahabr-links); do
mojo get $link | get-html-body | perl6 -e '
my @words = slurp.words;
for @words -> $word {
next unless $word ~~ /<:Cyrillic>/;
say $word.subst(/<-:Cyrillic>+$/, q{}).subst(/^ <-:Cyrillic>+/, q{});
}' >> habrahabr-words
sleep 10 # be a good netizen
done
This is pretty straightfoward shell code, but I want to go over some of the specifics:
- The
mojo
command comes from Mojolicious, a web development framework which includes various utilities such as the fabulousmojo get
. It allows us to download a page and extract exactly what we need via CSS3 selectors - perfect for scraping! - The
get-html-body
command is a simple Perl script I wrote usingMojo::DOM
, also from the Mojolicious framework. It simply reads in HTML, strips out any tags, and prints what's left. - There's a multi-line Perl 6 oneliner here (what a fantastic oxymoron). Perl 6 preserves its heritage of the powerful oneliner!
- The Perl 6 code (should be) pretty straightforward; the
<:Cyrillic>
regex syntax may be unfamiliar, however. Perl 6 is fully Unicode-aware, and its regexes allow you to filter for Unicode properties, which is exactly what we're doing here!
Extracting n-grams
Now that we have our word list, let's extract our n-grams from it. We'll deal with n-grams for n up to 5; I figured this should be long enough. We'll even extract single letters:
for n in {1..5}; do
perl6 -e '
multi ngrams(Str $word, Int :$n where * == 1) {
$word.comb(/./)
}
multi ngrams(Str $word, Int :$n) {
return [] if $word.chars < $n;
my @chars = $word.comb(/./);
gather for ^(@chars - $n + 1) {
take @chars.rotate($_)[^$n].join('')
}
}
my $n = +@*ARGS.shift;
for lines() -> $word {
.say for ngrams($word.lc(), :$n).grep(/^ <:Cyrillic>+ $/)
}
' $n habrahabr-words >| habrahabr-${n}grams
One Perl 6 feature I would like to point out here is multi subs.
You can see that the ngrams
sub has two definitions with slightly
different signatures; the where * == 1
on the top means that it
will only be called if :n(1)
is passed in. Otherwise, the more
generic variant is called. This allows us to create more specialized
variants of a sub for special values, or to break up the logic for an
algorithm into nice, discrete chunks.
Finding the most common n-grams
Since there are 33ⁿ n-grams (because there are 33 letters in the Russian alphabet), I probably only want to focus on the most frequent 100 or so. Let's use a Perl 6 program to find the percentage breakdown of each n-gram:
use v6;
my @occurrences =
gather for lines() -> $line {
take (+$0, ~$1) if $line ~~ /^ \s* (\d+) \s+ (.*) $/;
};
my $total = [+] @occurrences».[0];
for @occurrences -> ($count, $text) {
printf("%.3f %s\n", $count * 100 / $total, $text);
}
The notable Perl 6 feature here is gather
/take
; this creates a generator
that allows you avoid unnecessary intermediate variables. Other noteworthy features
are the reduce metaoperator ([...]
), which takes an operator and creates a
list-reducing operator out of it (so in this code, [+]
means "sum"), and the hyper
metaoperator >>
, which creates a vectorized version of an operator.
Now we can round it out with some shell-fu to finish the job:
sort /tmp/habrahabr-1grams | uniq -c | sort -n | perl6 /tmp/percentages.pl
Here's what it looks like for the various Russian characters:
0.039 ъ
0.131 ё
0.323 щ
0.349 э
0.540 ш
0.540 ц
0.547 ф
0.571 ю
0.866 ж
0.869 х
1.207 й
1.292 г
1.377 ч
1.773 з
1.831 б
1.905 ь
1.934 ы
1.982 я
2.192 у
2.728 д
3.206 п
3.333 м
3.503 к
3.961 л
4.222 в
5.024 с
5.424 р
6.325 н
7.015 т
7.576 и
8.361 а
8.586 е
10.469 о
Compared to the frequency table at on Wikipedia, it looks pretty close! There are some biases, however; I found that "февра", as in февраль, the Russian word for "February", occurred quite a bit, but these biases are probably small enough not to matter for my purposes. Speaking of my purposes, now that I have this data, what should I do with it? That, my dear reader, is a story for next week...
Published on 2016-03-05