A couple of months ago I had myself the same question, so I thought of writing an article trying to summarize and documented my understanding of an embedding layer.
Table of Contents
- Detailed Explanation
Generally speaking, we use an embedding layer to compress the input feature space into a smaller one. Imagine that we have 80,000 unique words in a text classification problem and we select to preprocess the text and create a term document matrix. This matrix will be sparse and a sequence of the sequence ['i', 'love', 'you'] is a 80,000-dimensional vector that is all zeros except from 3 elements that correspond to those words. In the case, we pass this matrix as input to the model it will need to calculate the weights of each individual feature (80,000 in total). This approach is memory intensive.
One can imagine the Embedding layer as a simple matrix multiplication that transforms words into their corresponding word embeddings OR turns positive integers (indexes) into dense vectors of fixed size.
As shown above the Embedding layer:
- can only be used as the first layer in a model
- input dimension has to be equal to the number of unique words, usually if zero maps to a word, one can leave
Keras tries to find the optimal values of the Embedding layer's weight matrix which are of size
(vocabulary_size, embedding_dimension) during the training phase.
The input is a sequence of integers which represent certain words (each integer being the index of a word_map dictionary). The Embedding layer simple transforms each integer i into the ith line of the embedding weights matrix.
In simple terms, an embedding learns tries to find the optimal mapping of each of the unique words to a vector of real numbers. The size of that vectors is equal to the
As shown above, each input integer of the sequence is used as index to access a lookup table (embedding weight matrix) that contains vectors for each word. When using an Embedding Layer we have to specify the size of the vocabulary and the reason is for the table to be initialized.
The most common application of an Embedding layer is for text processing. Let's strengthen our understanding with a simple example. Let's assume that our input contains two sentences and we pad them with
Hope to see you soon
Nice meeting you
Let's encode these phrases by assigning each word a unique integer number.
[1, 2, 3, 4, 5] [6, 7, 4, 0, 0] #fill with zeros due to padding
Assuming that we want to train a neural network we specify our first layer which will be an embedding layer.
Embedding(8, 2, input_length=5)
The first argument (8) is the number of distinct words in the training set. The second argument (3) indicates the size of the embedding vectors. The input_length argument, of course, determines the size of each input sequence which is the same as the
max_length parameter that we used for padding.
Once the training is finished, we can get the weights of the embedding layer, which as expecting is of size (8, 3)->(number_unique_words,embedding_dimension). As the embedded vectors are getting updated during the training process of the deep neural network, we expect that similar meaning words will have similar representations in a multi-dimensional space.
+------------+-----------------+ | index | Embedding | +------------+-----------------+ | 0 | [1.2, 3.1, 2.5] | | 1 | [0.1, 4.2, 1.5] | | 2 | [1.0, 3.1, 2.2] | | 3 | [0.3, 2.1, 2.0] | | 4 | [2.2, 1.4, 1.2] | | 5 | [0.7, 1.7, 0.5] | | 6 | [4.1, 2.0, 4.5] | | 7 | [3.1, 1.0, 4.0] | +------------+-----------------+
Our second training example can now be represented as:
[[4.1, 2.0, 4.5], [3.1, 1.0, 4.0], [2.2, 1.4, 1.2], [1.2, 3.1, 2.5], [1.2, 3.1, 2.5]]
It is also worth mentioning that Keras manage to optimize these embeddings vectors during the training phase.
It has been common practice the last couple of years instead of initializing these embeddings with random numbers to use the embeddings learned by other methods/people in different domains.
Similar meaning words tend to be represented with similar vectors. That's an interesting thing which someone can take advantage by using these pretrained word2vec embeddings. Note that we were able to plot these high dimensional word embeddings by using TSN a dimensionality reduction technique.
Useful links for word2vec:
The main advantages of using embeddings are:
- We determine the number of dimensions whereas in one-hot-embedding is fixed based on the unique number of words.
- It has been proved that they can learn the intrinsic properties of the words and group similar categories together.
- Instead of ending up with huge one-hot encoded vectors we can use an embedding matrix to keep the size of each vector much smaller. This is computationally efficient especially when using very big datasets.
This brings us to the end of this article. Hope you got a basic understanding of how an Embedding Layer is used. Please remember to use it especially when dealing with a text preprocessing task.
If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.
Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Machine Learning.