N-grams, from wikipedia:
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.
Today we're going to look at hundreds of different n-gram distributions from various langauges ! But first,
N-grams are used all over the place ! More generally, probabilistic language models are used in speech recognition, machine translation, and, amongst many other things, language detection.
Although there are more accurate and sophicated methods for language detection, it is really amazing how far simple character frequencies can get us. In part two of this blog post, we'll look at actually doing the detection via machine learning. For now, we'll just visualize.
Character frequency analysis actually has a rich history, dating back to the 50's. Claude Shannon, the father of Information Theory, published a very famous paper titled "Prediction and Entropy of Printed English" that is impossible to not mention here. Two other particularly seminal papers in the field of computational language identification include:
- N-gram based text categorization - Cavnar and Trenkle (1994)
- Statistical Identification of Language - Dunning 1994
And if you'd really like to get up to speed on what is happening, there is a great survery in Pethő & Mózes (2014). Alright, motivation aside, lets talk about the data.
I decided to do this little beebee analysis on the Universal Declaration of Human Rights. A trivially small corpus indeed, but conveniently, already translated into almost 460+ languages ! We'll look at around 260 of those languages. Where do we get the data you ask? Conveniently, the NLTK folx have already pushed it all together for us ! You can download it easily with the NLTK corpus downloader.
We'll start simply by just looking a histogram of unigram (single character) frequencies in English. What does this tell us about the English language?
Immediately a few things pop out. First, you probably notice the giant green bar on the left; I intentionally didn't leave out space as a character. Other things we notice probably include the heights of the vowel E, then the vowel U. After that, it looks like O, N, I, and A are all quite high as well. After that, we trickle down through R, H, and K. So this really just shows us our top letters more or less. These distributions, despite their simplicity, are actually at the heart of (simple) decryption techniques and have been used to decrypt military communications in WWII.
Here is an example from wikipedia. How can we decipher the following text using frequency analysis ?
By replacing the most frequent letter of encoded text (a cryptogram) with the most frequent letter in English, then the second most frequent bigram, and so on, we can move through the stages to decryption (where lower case represents a guess at the true letter):
heVeTCSWPeYVaWHaVSReQMthaYVaOeaWHRtatePFaMVaWHKVSTYhtZetheKeetPeJV A few more guesses: hereTCSWPeYraWHarSseQithaYraOeaWHstatePFairaWHKrSTYhtmetheKeetPeJr A few more guesses: hereuponlegrandarosewithagraveandstatelyairandbroughtmethebeetlefr
Although this post isn't about decryption per se, but its really amazing to see how simple unigram distributions can crack a secret code ! So these ngram distributions actually hold a lot of information about how how the language is used.
If we look at a few more languages, we might start to notice more features that help us separate one language from another. Next are are Spanish, Hawaiian, Ukrainian, Uzbek and Zulu (to pick a few). Please be careful, however, when reading these as colors don't map to the same character necessarily between distributions.
For example, It's easy to see Hawaiian's (relative) abundance of vowels.
When we look at bigram (two character) frequencies, we suddenly have considerably more combinations (imagine aa, ab, ac... zx, zy, zz) so it becomes more tedious to use a histogram for all of those. Rather than trying to make hundreds of histograms, with hundreds of bins, we can take a different approach. When looking at classifiers, we'll see that the difference between unigram distributions as features and bigram distributions as features, is quite noticable.
Here is an example of using principal component analysis to project the data onto three dimensions. If you're unfamiliar with PCA, there is a great visual demonstration of the idea on setosa.io's explained visually. To try to summarize as succinctly as possible: all the dimensions (character counts in columns) of our dataset (table) have some variance that explains their differences, and by proxy, the differences between the languages. If we use PCA, we can project all those dimensions onto their principal components, which, mathematically, are the ones that maximize the variance. So we get the three best dimensions with which to contrst our data points. Magic ! But actually, if you want to read a nice introduction to PCA, there is one I particularly like here. This process is especially useful for trying to visualize data that exists in many-more-than three dimensions.
In the following visualizations, try clicking, dragging, scrolling, sliding, etc, to get a feel for the data. The axes are the principal components - rather than trying to figure out what those represent, just imagine that you had to roll up the histograms you saw above into a tiny ball, and throw it into this three dimensional space so that it was close to the other languages who had similar histograms.
It should be easy, at least on a few of these, to spot the not-latin character groups.
For these visualizations, I used the cufflinks library to bind plot.ly to my panda's dataframes for ridiculously easy visualizations. You can check out the code the iPython notebook in the repo here if you're a nerd.
We aren't necessarily limited to PCA. There are lots of other approaches to doing dimensionality reduction. Just for fun (and comparison), I've included a few other before, but I don't feel they are as useful as the PCA reduction.
Remember - we're looking at many columns of data that have been "squashed", in a sense, to 3 dimensions. So the axes don't align with any readily available idea or feature or concept per se - however, this statistical transformation allows for us to see the data in a way we couldn't before, which, in its own right, is useful.
Pretty fun, right ?
There are lots of types of dimensionality reduction. Scikit-learn has a nice comparison here if you're interested in learning more:
Next time, we'll look at building a classifier off these distributions to actually do the classification / language detection we alluded to earlier with a larger dataset (albeit fewer languages). Stay tuned for part 2 !
Also, thanks to JD Digiovanni for thoughts and feedback on this post.