Multinomial naive Bayes is an extension of the basic naive Bayes classifier, which is really just based on Bayes theorem:

$$P(B|A) = \frac{P(B)P(A|B)}{P(A)}$$

where A and B are events, P(A) and P(B) are the probabilities that those events actually happen. P(A|B) is the probability that A happens given that B already has, or, a conditional probability of A given that B is true.

So we can read that as: the probability of B given A is equal to the probably of B times the probability of A given B, dividied by the probability of A. However, rather than just having one A and one B, we actually have a class, which we can call $y$, and a set of data, which we'll call $x_1$ through $x_n$ We then reformulate naive bayes to become:

$$P(y|x_1,...,x_n) = \frac{P(y)P(x_1,...,x_n|y)}{P(x_1,...,x_n)}$$

If we then make the naive assumption (where the name comes from) that these different $x$'s occur independently, we can simplify one term:

$$P(x_i| y, x_1,...,x_{i-1}, x_{i+1},...,x_n) = P(x_i|y)$$

How did we do that? Basically, we're saying that the probability that $x_i$ occurs is only depdendent on $y$ and that all the other $x$'s have no influence on the value of a particular $x_i$. That is the indepdence assumption. So, we can rewrite the first equation:

$$P(y|x_1,...,x_n) = \frac{P(y)\prod_{i=1}^nP(x_i|y)}{P(x_1,...,x_n}$$

Meaning that since each $x_i$ is totally indepedent, the probability of each $x_i$can be multiplied such that:

$$\prod_{i=1}^nP(x_i|y) = P(x_1,...,x_n|y)$$

Further more, we know that $P(x_1,...,x_n)$ is a constant given our input - each n-gram has a particular probability given the data we're working with. Since it is a constant, the following is proportional:

$$P(y|x_1,...,x_n) \propto P(y)\prod_{i=1}^nP(x_i|y)$$

Why does that work? Well, if, for example, $y = kx$ and $k$ is a constant, then $y \propto x$. So finally, we want get to finally derive our $\hat{y}$, or estimated class:

$$\hat{y} = \arg\max_y P(y)\prod_{i=1}^nP(x_i|y)$$

However, since all these conditional probabilities are being multiplied, we can end up with a floating point underflow (the number is too small to store in memory). Rather, most implementations, including that of sklearn's, use a sum log probablities instead.