What is perplexity? In information theory, perplexity is a measurement we can use to compare probability models. Why would you care? Perplexity is often used to compare different language models - used in applications like machine translation and speech recognition.

If you're reading up on recent papers in NLP, you'll see that perplexity is the standard metric to use to compare your language model results with others who are modeling on the same dataset. For example, see this comparison.

Aside, perplexity is an interesting metric in its own right. You'll often find it defined like so:

$$2^{H(p)}=2^{-\sum_x{p(x)log_2p(x)}}$$

The terms here say that perplexity is the exponentiation of entropy, and when used for a probability model (not just a distribution) - it would look like:

$$ b^{-\frac{1}{N} \sum_{i=1}^{N}log_bq(x_i)}$$

where $b$ is our base (generally 2) and $x_i$ represents a particular word or term we are guessing with our language model - that is, $q(x_i)$ is the probability the model assigns to the particular item in our validation or test set. If the model is good, it will assign higher probabilities to the word, and end up with a lower perplexity.

# Intuition

The intuition behind perplexity is that it represents how "surprised" we are to see the next item in the sequence, on average. This actually is a really nice way to think about how well a language model can predict language (which has obvious applications in things like speech recognition). For example, imagine that a speech recognition system was assigning probabilities to the last word of the following sentences:

- Lots of people have adopted
*kittens*. - Lots of people have adopted
*mittens*.

We'd *hope* that a language model can be more surprised by mittens. A language model would be able to assign probabilities to each word, and would hopefully guess that kittens is the right word (we could imagine that the audio wasn't perfect, in this case). Thus, a language model that was *not* surprised by the first sentence (and gave a higher probability to the word that was more likely to occur) would be more useful and correct. When we're using the model over a large dataset, we want to know how surprised it was *on average* across all of the words.

However, I personally don't find the mathematical definition above helpful for building that intuition. The exponentiation of entropy isn't particularly clear, at least immediately, as the definition for the more straightforward intuition we can describe with language. However, I found that the same measurement can be written in a much simpler form - thanks to Dan Jurafsky's slides on language modeling:

$$P(W) = P(w_1, w_2 ... w_n)^{-\frac{1}{N}}$$

where he defines perplexity ($P(W)$) as the "inverse probability of the test set, normalized by the number of words". He goes on to simplify:

$$= \sqrt[^N]{\frac{1}{P(w_1, w_2 ... w_n)}}

\\

= \sqrt[^N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1...w_{i-1})}}

$$

The first step is wrapping moving the exponent to a N-th root notation. The second step is applying the chain rule - saying that the probability of word we are looking at ($w_i$) is conditioned on the probabilities of the words that came before it. In practice, though, we really end up just using a few previous words, which is the application of the markov assumption, which gives us:

$$P(W) = \sqrt[^N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_{i-n}...w_{i-1})}}$$

Finally! We have a pretty nice explanation for this equation. The perplexity of a language model ($P(W)$) is the *geometric* mean of the inverse of the probability. This sounds like a mouthful at first, but when we unpack it, we can see it actually makes a lot of sense. The geometric mean is similar to a normal arithmetic mean. Consider the set $X$ of numbers 2, 20, 200 - the arithmetic mean is just to add them up and divide by N (3) giving us (2 + 20 + 200) / 3 = 74:

$$ \frac{1}{N}\sum^N_{i=1}x_i$$

in contrast with the geometric mean, which would multiply all of these values together then take the N-th root (3, or cube root in this example), which would give us (2 * 20 * 200) ^ 1/3 = 20:

$$ \sqrt[^N]{\prod_{i=1}^{N}x_i}$$

which is a drastically different answer. As we can see, the geometric mean is less affected by the magnitude or scale of the different values.

The other half of the equation is the idea of inverse probability. In this case, rather than thinking about how probable something is, we just think in terms of how improbable it is. It becomes simple then to connect the equation to the intuition - how surprised we are to see an item, *on average*.

# Equating the Equations

What was weird to me at first glance was that these two equations are actually equivalent, but they don't look at all like it:

$$

b^{-\frac{1}{N} \sum_{i=1}^{N}log_bq(x_i)}=\sqrt[^N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_{i-1})}}

$$

Let's see how we can equate them. First, lets replace with base $b$ with $e$ and move the exponent down and replace it with the exponential function:

$$exp(-\frac{1}{N} \sum_{i=1}^{N}ln\:q(x_i))$$

since $exp(x)$ is just another way of writing $e^{x}$, and $ln$ is the natural log or $log_e$. After that, we can move the log outside of the sum, since a log of the product is equivalent to the sum of the logs:

$$exp(-\frac{1}{N} ln\:\prod_{i=1}^{N} q(x_i))$$

After that, we can remove the negative sign by making the $q(x_i)$ a fraction, since $-log(x) = log\frac{1}{x}$:

$$exp(\frac{1}{N} ln\:\prod_{i=1}^{N} \frac{1}{q(x_i)}))$$

Then we can move the fraction to the exponent, since $y log(x) = log(x^y)$:

$$exp( ln (\prod_{i=1}^{N} \frac{1}{q(x_i)})^\frac{1}{N})$$

And finally, since we moved the natural log all the way to the beginning, it can cancel out the base, since $e^{ln(x)} = x$, giving us:

$$(\prod_{i=1}^{N} \frac{1}{q(x_i)})^\frac{1}{N}$$

Lastly, we just rewrite the exponent as a N-th root:

$$\sqrt[^N]{\prod_{i=1}^{N} \frac{1}{q(x_i)}}$$

which is what we wanted to equate - where $q(x_i)$ is the probability we get from some model (such as the unigram model $P(w_i|w_{i-1})$).

# Conclusion

Perplexity is a useful and important metric to know when doing computational linguistics, language modeling, and NLP, and being able to recognize it in its different forms is helpful (to me at least!). Hopefully reading this post helped you appreciate it, too.

If you're interested, Arron Schumachner has a fabulous blog post on the topic as well as an interactive game based on perplexity - test your perplexity. Pretty cool!

I think his explanation is really well done:

The perplexity of whatever you're evaluating, on the data you're evaluating it on, sort of tells you "this thing is right about as often as an x-sided die would be."