Often, when preparing data for a machine learning model, we need to encode categorical values, such as ["red", "green", "blue"] into numeric schemes so that they can be used with various algorithms. Different encoding schemes can give weight to various aspects of the categories, sometimes for better or for worse. Often, the motivation is pragmatic: choosing the correct encoding scheme can sometimes improve statistical power!
Categorical encoding can get a little tricky. Recently, some additional categorical encoding schemes were added to scikit-learn-contrib. I figured now is as good a time as any to review their differences. The ones that were added (presently) incude ordinal, one-host, binary, various contrast encoding, and a hashing encoder.
I'll go over ordinal, one-host, and binary in this post. Contrast & hashing will be contained in their own posts.
Ordinal encoding is probably the most naive approach here. Some might even argue that if ordinal encoding is appropriate, it wasn't really a categorical value to begin with, and is a, surprise, ordinal variable. However, since ordinal encoding could be used with some textual data that is often treated as categorical, we might as well talk about it.
Give each distinct value a number. Boom.
The obvious problem with this is that some values get higher weights than others. So if I were to choose:
then blue would have more weight than red and be inequal. Often, we would like to treat each categorical value with equal weight. If your data does have a natural ordering where some values are greater than others, then it could actually make sense. For example:
One hot encoding is often called dummy encoding, such as in panda's get_dummies function. This is one of the most popular and simple categorical encoding schemes. For each distinct value of the category, create a new feature that is equal to 1 if true, and 0 otherwise. Each category is then represented by a vector of 3 digits.
At first, it seems like we are done. But actually, there is something very insidious happening here. Each color is represented by a three integer vector. Note that there is no all-zero'd example. What would [0,0,0] represent?
This is actually called the dummy variable trap! The problem most often occurs in non regularized regression, e.g. in multivariable regression, where a column of one's is added to allow for the fitting of the intercept. If two variables add up to one, then the intercept & the dummy variables are collinear and the estimates we get are indeterminate. More technically, its because the matrix we intend to invert becomes singular. There is a great walk through example of why and how that happens here.
This "trap" has been known for quite some time and thus most statistical software deals with it elegantly. Daniel Suits wrote about it in 1957 in a paper titled Use of Dummy Variables in Regression Equations - where he goes over the two easiest ways to fix this:
1.) Remove the constant: however this can be bad since we're not always sure that our data absolutely goes through the origin. Removing the constant disallows the fitting of the intercept.
2.) Remove one of the dummies. This is the easier route, and now all the other dummy variables are compared against this missing "base" case. So if, for example we had a dummy variable for male and one for female, we could actually just use one of those since the second one is redundant.
Sometimes people make the effort to distinguish between one-hot encoding and "dummy" encoding, where one-hot encoding uses n variables for n categories, and dummy encoding assumes you want to remove one of the dummies, giving you n-1 variables. In scikit-learn's OneHotEncoder, and panda's get_dummies we're given the full set of variables. However, with
get_dummies, there is an option to
drop_first which can be set to
True in order to get n-1 variables.
Although dropping one variable is easier and probably better than foregoing the intercept, make sure you know which one is dropped so that you're aware of this basis.
Binary is a little bit easier. We start by encoding the data into an ordinal sequence, then convert the ordinal number to a binary number. For example, the number 40 would be represented as:
Then we treat that binary sequence as a string and split on each digit. This ends up distorting the distance a little but reduces the dimensionality, or number of features. The example below has five category values, but those can be encoded in only 3 features, rather than 5 for dummy encoding or ordinal encoding.
|category||first bit||second bit||third bit|
: Journal of the American Statistical Association, Vol. 52, No. 280 (Dec., 1957), pp. 548-551 ↩︎