Categorical variables are known to hide and mask lots of interesting information in a data set and many times they might even be the most important variables in a model. A good data scientist should be capable of handling such variables effectively and efficiently. If you are a smart data scientist, you’d hunt down the categorical variables in the data set, and dig out as much information as you can. The goal of this article is to present an advanced technique called ‘Entity Embeddings’ to deal with categorical variables in Neural Networks. It was found that neural networks with entity embedding generate better results than tree based methods when using the same set of features for structured data.

✏️ Table of Contents

  • Introduction
  • Structured vs Unstructured Data
  • Shortcoming of using One-Hot encoding in the context of Neural Networks
  • Definition of entity embedding
  • How to learn each value of an embedding vector?
  • Example
  • Hyperparameter: number of columns of the embedding matrix
  • Whenever is possible, it’s best to treat things as categorical variables rather than as continuous variable
  • As the size of the embedding matrix highly depends on the cardinality size is it possible to massively overfit the model?
  • What is an embedding layer?
  • Conclusion
  • References
The island of Andros, Greece

🌟 Introduction

As we already know Neural networks revolutionized computer vision, speech recognition and natural language processing (unstructured data) replacing the long dominating methods in each field. However, they are not as prominent when dealing with machine learning problems with structured data. This can be easily seen by the fact that the top teams in many online machine learning competitions like those hosted on Kaggle use tree based methods more often than neural networks.

The reason is that a neural network can approximate any continuous function and piece wise continuous function. During the training phase the continuity of the data guarantees the convergence of the optimisation, and during the prediction phase it ensures that slightly changing the values of the input keeps the output stable. Thus, it is not suitable to approximate arbitrary non-continuous functions. On the other hand tree based methods do not assume any continuity of the feature variables and can divide the states of a variable as fine as necessary.

🆚 Structured vs. Unstructured Data

Structured data are data collected and organised in a table format with columns representing different features (variables) or target values and rows representing different samples which makes them easily searchable by simple, straightforward search engine algorithms or other search operations; while unstructured data — “everything else” — is comprised of data that is usually not as easily searchable, including formats like audio, video, and social media postings.

❌ Shortcoming of using One-Hot encoding in the context of Neural Networks

One-Hot encoding is a commonly used method for converting a categorical input variable into continuous variable. For every level present, one new variable will be created.Presence of a level is represent by 1 and absence is represented by 0.

One-Hot encoding of ’sex’ feature

This have two main shortcoming:

  1. One-hot encoding of high cardinality features often results in an unrealistic amount of computational resource requirement.
  2. It treats different values of categorical variables completely independent of each other and often ignores the informative relations between them.

Those shortcoming could be overcome using the entity embedding method.

Please remember that for tree based algorithms it’s OK to encode categories using ordinal values (0, 1, 2, 3, etc) while for an algorithm that learns a weight for each variable it’s not OK.

📣 Definition of entity embedding

We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mapping is learned by a neural network during the standard supervised training process. Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables.

It can rapidly generate great results on structured data without having to resort to feature engineering or apply domain specific knowledge. The technique is relatively straight forward, and simply involves turning the categorical variables into numbers and then assigning each value an embedding vector:

The advantage of doing this compared to the traditional approach of creating dummy variables (i.e. doing one hot encodings), is that each day can be represented by four numbers instead of one, hence we gain higher dimensionality and much richer relationships.

In other words, entity embedding automatically learn the representation of categorical features in multi-dimensional spaces which puts values with similar effect to the target output value close to each other helping neural networks to solve the problem. You may think of if we were embedding states in a country for a sales problem, similar states in terms of sales would be closer to each in this projected space.

❓How to learn each value of an embedding vector ?

First define a fully connected neural network and separate numerical and categorical variables.

For each categorical variable:

  1. Initialise a random embedding matrix as m x D.

m: number of unique levels of categorical variable (Monday, Tuesday, …)

D: desired dimension for representation, a hyperparameter which can be between 1 and m-1 (if 1 then it will be label encoding, if m it will be one-hot encoding)

Fig 6. Embedding Matrix

2. Then for each forward pass through neural network we do a lookup for the given level (e.g Monday for “dow”) from the embedding matrix, which will give us a vector as 1 x D.

Fig 7, Embedding Vector after lookup

3. Append this 1 x D vector to our input vector (numerical vector). Think this process as augmenting a matrix, where we add an embedding vector for each category that been embedded by doing lookup for each particular row.

Fig 8. After adding embedding vectors

4. Usually inputs are not updated but for embedding matrices we have this special case where we allow our gradient to flow all the way back to these mapped features and hence optimise them.

✍️ Example

Continuous Variables are fed directly into the neural network after normalising them (temperature and distance in the above figure) whereas categorical variables need special care.For the categorical variable Day of Week we need to put it through an embedding. So we create an embedding matrix of 7 by 4 (e.g. dimension 4 embedding-hyperparameter). So this will look up the 6th row to get back the four items. So day of week 6 will turn into length 4 vector which will then fed directly into the neural network.

We can think this as a process, which allows our categorical embeddings to be better represented at every iteration.

🚀 For people who like video courses and want to kick-start a career in data science today, I highly recommend the below video course from Udacity:

Learn to Become a Data Scientist Online | Udacity | Udacity
Gain real-world data science experience with projects from industry experts. Take the first step to becoming a data scientist. Learn online, with Udacity.

📚 While for book lovers:

🔴 Hyperparameter: number of columns of the embedding matrix

For each of categorical variables, the number of categories it has determine the number of rows of the embedding matrix. Then we define what embedding dimensionality we want.

If you are doing natural language processing, then the number of dimensions you need to capture all the nuance of what a word means and how it’s used has been found empirically to be about 600. Human language is one of the most complex things that we model, so any categorical variables wouldn’t need an embedding matrix with more than 600 dimensions.

At the other end, some things may have pretty simple kind of causality. So ideally, when you decide what embedding size to use, you would kind of use your knowledge about the domain to decide how complex is the relationship and so how big embedding do I need. In practice, you almost never know that. You only know that because maybe somebody else has previously done that research and figured it out like in NLP. So in practice, you probably need to use some rule of thumb, and having tried a rule of thumb, you could then maybe try a little bit higher, a little bit lower and see what helps. So it’s kind of experimental.

A simple rule of thumb is to look at how many discrete values the category has (i.e. the number of rows in the embedding matrix) and make the dimensionality of the embedding half of that. So for day of week which is the eight rows and four columns. Here it is (c+1)//2 — the number of columns divided by two. In addition you can set also an upper limit limiting the dimensionality min((c+1)//2,50).

⚙️ Whenever is possible, it’s best to treat things as categorical variables rather than as continuous variable

The reason for that is that when you feed some things through an embedding matrix, it means every level can be treated like totally differently.

For example, there was a competition on Kaggle a few years ago called Rossmann which is a German grocery chain where they asked to predict the sales of items in their stores. And that included the mixture of categorical and continuous variables. In this competition there was a feature representing how many months the competitor’s shop has been open. In that case zero months or one months is really different.

So if you fed that in as a continuous variable, it would be difficult for the neural net to try and find a functional form that has that big difference. It’s possible because neural net can do anything. But if you are not making it easy for it. Where else, if you used an embedding, treated it as categorical, then it will have a totally different vector for zero versus one. So it seems like, particularly as long as you’ve got enough data, treating columns as categorical variable where possible is a better idea.

When I say where possible, that basically means where the cardinality is not too high. So if this was like the sales ID number that was uniquely different on every row, you can’t treat that as a categorical variable. Because it would be a huge embedding matrix and you wouldn’t make that a categorical variable.

Rule of thumb is to keep categorical variables that don’t have very high cardinality. As in if a variable has unique levels for 90% of the observations then it wouldn’t be a very predictive variable and we may very well get rid of it.

🤔 As the size of the embedding matrix highly depends on the cardinality size is it possible to massively overfit the model?

In old machine learning, we used control complexity by reducing the number of parameters. In modern machine learning, we control complexity by regularisation. Typically, we are not concerned about overfitting because the way we avoid overfitting is not by reducing the number of parameters but by increasing the dropout or increasing the weight decay.

Now having said that, there’s no point using more parameters for a particular embedding than it needs. Because regularization is penalising a model by giving it more random data or by actually penalising weights. So we’d rather not use more than we have to.

Thus the general rule of thumb for designing an architecture is to be generous on the side of the number of parameters. If we then feel that a variable is not important we can manually go and make change to this to make it smaller. Or if we found that there’s not enough data here or using more regularization than we are comfortable with, then we might go back. But we would always start with being generous with parameters.

🌈 What is an embedding layer?

It is like multiplying an one hot encoded vector by a set of coefficients which is exactly the same thing as simply saying let’s grab the thing where the one is. In other words, if we had stored this (1000) as a zero, 0100 as a one, 0020 as a two, then it’s exactly the same as just saying hey, look up that thing in the array.

So we call that version an embedding. So an embedding is a weight matrix you can multiply by one hot encoding. And it’s just a computational shortcut. But it’s mathematically the same.

🤖 Conclusion

Hope you find it interesting and please remember that Entity Embedding looks a good and easy way to directly make the data suitable ready for input to neural nets with no feature engineering involved.

Thanks for reading; if you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.‌

💪💪💪💪 As always keep studying, keep creating 🔥🔥🔥🔥

🔘 References

Originally published at : on Oct 28, 2018.