Multinomial naive Bayes is an extension of the basic naive Bayes classifier, which is really just based on Bayes theorem:

$$P(B|A) = \frac{P(B)P(A|B)}{P(A)}$$

where A and B are events, P(A) and P(B) are the probabilities that those events actually happen. P(A|B) is the probability that A happens given that B already has, or, a conditional probability of A given that B is true.

So we can read that as: the probability of B given A is equal to the probably of B times the probability of A given B, dividied by the probability of A. However, rather than just having one A and one B, we actually have a class, which we can call \(y\), and a set of data, which we'll call \(x_1\) through \(x_n\) We then reformulate naive bayes to become:

$$P(y|x_1,...,x_n) = \frac{P(y)P(x_1,...,x_n|y)}{P(x_1,...,x_n)}$$

If we then make the naive assumption (where the name comes from) that these different \(x\)'s occur independently, we can simplify one term:

$$P(x_i| y, x_1,...,x_{i-1}, x_{i+1},...,x_n) = P(x_i|y)$$

How did we do that? Basically, we're saying that the probability that \(x_i\) occurs is only depdendent on \(y\) and that all the other \(x\)'s have no influence on the value of a particular \(x_i\). That is the indepdence assumption. So, we can rewrite the first equation:

$$P(y|x_1,...,x_n) = \frac{P(y)\prod_{i=1}^nP(x_i|y)}{P(x_1,...,x_n}$$

Meaning that since each \(x_i\) is totally indepedent, the probability of each \(x_i\)can be multiplied such that:

$$\prod_{i=1}^nP(x_i|y) = P(x_1,...,x_n|y)$$

Further more, we know that \(P(x_1,...,x_n)\) is a constant given our input - each n-gram has a particular probability given the data we're working with. Since it is a constant, the following is proportional:

$$P(y|x_1,...,x_n) \propto P(y)\prod_{i=1}^nP(x_i|y)$$

Why does that work? Well, if, for example, \(y = kx\) and \(k\) is a constant, then \(y \propto x\). So finally, we want get to finally derive our \(\hat{y}\), or estimated class:

$$\hat{y} = \arg\max_y P(y)\prod_{i=1}^nP(x_i|y)$$

However, since all these conditional probabilities are being multiplied, we can end up with a floating point underflow (the number is too small to store in memory). Rather, most implementations, including that of sklearn's, use a sum log probablities instead.