Have you ever wondered how Amazon's Alexa can understand speech? How Google translate works? Maybe IBM's Watson?

Applications like these require that natural language be represented numerically - which is also first step for many machine learning tasks like regression, classification and clustering.

In this post, I'll cover the basics of vector semantics - a core pillar of contemporary natural language processing.

This post will discuss the basics of:

  • simple count matrices
  • n-grams
  • weighting
  • hashing vectorization


Many approaches to featurizing text rely on the distributional hypothesis:

You shall know a word by the company it keeps - Firth (1957)

The idea that words that have similar statistical distributions share similar meanings is the basis for vector semantics. There are a number of nearly synonymous terms: statistical semantics, distributional semantics, vector space models of semantics, word space models, and term vector models.

One of the simplest possible models in this are is called a bag-of-words (BoW) model, which disregards grammar and word ordering for the most part and just looks at some measurement of occurrence. Despite the fact that we know grammar and word ordering is important (more so in some languages than others), the BoW model's simplicity is extremely useful. By simply looking at neighbors and co-occurrences, we can use this simple model to make inferences about new words, determine similarities,  analyze sentiment, cluster documents, among other applications.

Say there is a new word - Regenstiefel - here it is in a few sentences:

  • Regenstiefel can be purchased at certain stores.
  • I always wear long socks with my regenstiefel.
  • Regenstiefel are particularly useful when it's raining outside.

We could make some informed guesses about Regenstiefel, which is the German for rain boots. Similarly, when using BoW models, we can use the existing distribution for words that co-occur with our new word, and handle words we've never seen before.

I don't actually speak German.

Simple Count Matrices

Let's start with the simplest BoW model we can: just count up the words, and use the count vector as our representation of the text.

Let's take the following three sentences:

  • Red pandas are very cute.
  • Giant pandas are black and white.
  • Red pandas are more related to raccoons than giant pandas.

By simply counting up the words in each document, we end up with a numeric representation:

and are black cute giant more pandas raccoons red related than to very white
0 1 1 1 0 1 0 1 0 0 0 0 0 0 1
1 0 1 0 0 1 1 2 1 1 1 1 1 0 0
2 0 1 0 1 0 0 1 0 1 0 0 0 1 0

The size of such a document-term matrix will be the distinct number of terms in the vocabulary $\vert{V}\vert$ (columns) by the number of documents (rows).

In case you didn't know how cute they are.

It's easy to see a number of issues with these raw counts:

  • common words like the, of, and will have giant counts, and skew the results
  • it isn't obvious how to control the number of columns

Common ways to address this include:

  • removing words like the, of, and (called stop words) before processing
  • limiting valid terms to the N most frequent terms
  • using a weighting scheme to make more meaningful words "weigh" more

Simple unigram counts like this obviously have their limitations - there isn't much more to "context" than simply saying the word occurred. Next, we look at small word sequences that will provide a bit more context.


A different kind of n-gram. 

N-Grams are a group of N words in a sliding window, and treat each of those as a distinct item. We say unigram, bigram, and trigram for the choices of N being 1, 2, and 3.

You've likely seen the Google Ngram Viewer that allows you to see the popularity of a given ngram in contrast with another over time, using the Google Books data - in this example, comparing the popularity of the variant and (mis)spellings: connector, connecter, adaptor, adapter

In our previous example sentences, our new set of vocabulary items for a bigram model would be:

and white
are black
are more
are very
black and
giant pandas
more related
pandas are
raccoons than
red pandas
related to
than giant
to raccoons
very cute

with a matrix like:

and white are black are more are very black and giant pandas more related pandas are raccoons than red pandas related to than giant to raccoons very cute
0 1 1 0 0 1 1 0 1 0 0 0 0 0 0
1 0 0 1 0 0 1 1 1 1 1 1 1 1 0
2 0 0 0 1 0 0 0 1 0 1 0 0 0 1

As we can see, this is beginning to get more and more sparse. The sparsity of the matrix grows so fast, that in practice, it's very unusual to see values of N past 5 or 6 (which would already be for very large datasets). This sparsity becomes a big problem when using our resulting model on new data, since the new data will likely have some things in it that are not covered by our model. For example, Allison et al (2006) show that the Gigaword corpus (1.7 billion words) only has 2/3 of all trigrams represented.

Another approach to combat this sparsity is by using skipgrams - which will skip over an intermediate word when creating the n-gram. When creating trigrams from the following sentence:

  • I drank the black coffee

we would end up with a list of trigrams - I drank the, drank the black, ... - and so on, without ever seeing "drank the coffee" - which might actually be more important when creating our model. Interesting, Gutherie et al. (2006) found that using skipgrams can be equivalent to quadrupling the corpus size!


In bag of word models, we need to have a way to account for extremely common words, which can partially be handled by different weighting schemes.


One of the most common approaches in document-term matrices is to use term-frequency-inverse-document-frequency (tf-idf) weighting.

The idea here is to take the raw count (term frequency, or $tf$) and weight that by how often that word occurs in all of the other documents (the inverse document frequency, or $idf$). This gives us a new weight $w_{ij}$ for the cell index by row $i$ and column $j$, where columns are our terms and rows are our documents. If $N$ is the number of documents and $df_j$ is the number of documents with word $j$:

$$idf_{ij} = log(\frac{N}{df_j})$$

$$w_{ij} = tf_{ij} idf_j$$

The effect of tf-idf weighting is that very common words such as the and of are penalized for having occurred in so many different documents. Additionally, if you have a topic specific corpus, common (and thus less meaningful words) will also be penalized. There are a few other variants of the $idf$ calculation, but all are based on the same premise.

Hashing Vectorization

One of the problems that we run into with tf-idf weighting is that we need to keep track of the vocabulary as it grows. Eventually, this can become an issue, as it requires more and more memory. Additionally, the vocabulary needs to be stored as part of the model, say, if we wanted to score new documents to get their tf-idf vectors.

Hashing vectorization is based on the more general hashing trick - rather than storing the word, we store the hash of the word. The hash function takes a string $s$ and simply returns some number $i \in [0, N]$. The interesting part is that we get to constrain the size of $N$. By reducing the number $N$ to value less than the number of distinct words, we are forcing ourselves to create hash collisions. Often, hash functions are designed to avoid collisions, but in this case, the collisions are intended, as we want to reduce the number of columns. This trick has a few advantages:

  • stateless - we never need to keep track of the vocabulary
  • efficient - can run on an item by item basis, rather than having to see all the documents once before calculating the weights/values
  • can be used in a streaming fashion

The disadvantages:

  • the columns cannot be interrogated for the word they represent, since we do not know how many terms may be conflated in a single column due to hash collisions

Because of this, hashing vectorization is a popular choice when we want to predict something with a model, rather than inspecting how the model works or using it's structure as a proxy for some other argument.

In scikit-learn and Spark's MLLib, the hashing vectorizer uses a version of MurmurHash for its hash function.


In this post we talked about:

  • what are the intuitions behind vector semantics
  • converting natural language to numeric representations
  • what an n-gram is
  • what tf-idf weighting is
  • what a hashing vectorizer is, and why we would want to use it