Deep neural networks have achieved remarkable results in some NLP tasks, one of them is text classification, i.e., assigning a set of pre-defined tags on a text based on its content. Their success can be attributed to convolutional neural network, a special kind of neural network.
In this post, I will go a step forward and explain what is a multichannel convolutional neural network and how can be utilized for sentiment prediction.
Table of Contents
- Word2Vec Embeddings
- 1D Convolutions
- Why 1D convolution is better than bag of words?
- Code Implementation
In this technique, we map each word of a sentence to an embedding vector which tend to be smaller than a bag of word representation. In bag of words representation the length of the vector is being determined by the number of unique words in our corpus.
Word2Vec embeddings are pre-trained embeddings that have been determined in an unsupervised manner. These vectors have a really nice property, similar context words tend to have vectors that are collinear; point to roughly the same direction.
We can now go and replace each word with this embedding which is the feature representation of that word. For a sentence representation now we can simply take the sum of those Word2Vec vectors. This can give you a great baseline model that actually work pretty well.
Each word we'll map it to a vector; an embedding of length let's say 300 (the ml practitioner has the freedom to decide the length of the embedding or either use some pre-trained embeddings). Now let's think about a way to feed a neural network and more specifically how could you make use of 2 grams using this representation?
With Bag-of-Words representation this is fairly simple as for each particular 2 gram you had a different columns. Now, with word embeddings representation we can analyze 2 grams by running a sliding window over two neighbour embedding pairs.
Green border in the below picture represent a convolutional filter. For example, if you take the embeddings that correspond to cat-sitting and convolve with that two-gram filter result in a high activation just because this particular convolutional filter is very similar to the word embeddings of these pair of words.
This convolutional filter has weights which the model will optimize during the training phase. Similar philosophy is applicable here as for the 2D or 3D Convolutional Neural Networks that are used in computer vision.
Why 1D convolution is better than bag of words?
In bag of words manner for each particular two grams we had a different column and we have to come up with a lot of convolutional filters that will learn that representation of two grams. Also using pretrained Word2Vec embedding similar meaning words they are similar in terms of cosine distance and the cosine distance is similar to dot product and dot product is actually a convolution.
Let'go back to our example and consider a different sentence like "dog resting here" you can actually find that cat and dog have similar Word2Vec embedding representations as they're seen is similar context (i.e. sentence like my "dog ran away" or "my dog ate my homework" you can replace dog with cat and that would be a frequent sentence).
Convolutional are better because when convolving two similar N-gram like "dog resting" and "cat sitting" with the same convolutional filter you will get a high activation value.
It turns out that if we have good embeddings then using convolutions we can actually look at more higher level meaning of the two gram. For example, we can learn convolutional filter that their core meaning will be: "animal sitting" so by convolve it through all those two-grams like "cats eating" or "dog resting" or "cat resting" or "dog sitting" will produce high activations.
That is pretty cool now because you don't need a lot of columns for all possible two grams but just to look at the pairs of word embeddings and learn convolutional filters that will learn some meaningful features.
This technique can be easily extended to 3 grams 4 grams or any other ngram and contrary to a bag of words representation your feature matrix won't explode. The reason is that the feature matrix is actually fixed you only need to change the size of the filter with which you do convolution.
With convolutional neural networks one filter is not enough as we discover in image processing but we need to track many ngrams and also many different meanings of those to ngrams. That's why we need a lot of convolutional filters and these filters are called 1d convolutions because we actually slide the window only in one direction. Contrary to image processing where we slide that window both in two and three directions.
Let'see an example:We add some padding so that the size of the output is the same as the size of the input and we convolve.
The sliding window of size 3 will run vertically through the embedding matrix.
By continue running it (sliding window) we end up:
The bad thing however is that we have the same number of outputs and it is equal to the number of inputs that means that if we have variable length of sentence then we have variable number of features and we don't want that because we don't know what to do with that.
In a bag of words representation we actually lose the ordering of the words as we are more interesting, if we've or haven't seen a two-gram meaning "animal sitting" in the sentence. Similar, in a convolutional filter we don't care where it occurred in the beginning or at the end of the sentence. The only thing we do care actually is whether that combination was actually in the text or not and we can simply do that by taking the maximum activation that you got with this convolutional filter.
So, we go through the whole text convolving through the neighbor embeddings and we take the maximum value of the convolution which is actually called maximum pooling over time; just like in images.
Let's summarize our steps:
- take an input sequence
- map it words to embeddings
- take convolution window size 3 (for tri-grams) by length of embedding vector
- convolve with that filter sliding in one direction
- take the maximum activation (output)
Final Model Architecture
The final architecture might look like:
- use filters of size 3 4 & 5 so that we can capture the information about 3 4 & 5 grams and for each ngram we will learn 100 filters which means that we have 300 outputs (max pooling)
- apply Multilayer perceptron (MLP) on top of those features and you get the final output
Let's look at the above image we have actually an input sequence and we convolve using:
- a tri-gram filter (red) that corresponds to some convolutional filter with maximum activation of 0.7
- a bi-gram filter (green) that corresponds to some convolutional filter with maximum activation of -0.7
- we add both values to the output
In this way using different filters of different size we have a vector with 300 outputs. This vector is actually a kind of embedding of our input sequence and we've actually found a way to convert our input sequence into a vector of fixed size. The obvious thing now is to apply some more dense layers on top of those 300 features and train it for any task.
In this paper, they compare the results of different model showing the superiority of N-gram Multichannel Convolutional Neural Network. Remember that convolutions are usually pretty fast operations so our proposed model may work even faster than bag-of-words.
A model build in keras following the above philosophy is shown below:
def build_model(pr_hyper,m_hyper): # Convolutional block model_input = Input(shape=(pr_hyper.sequence_length,),dtype='int32') # use a random embedding for the text x = Embedding(pr_hyper.max_words, m_hyper.embedding_dim)(model_input) x = SpatialDropout1D(m_hyper.dropout_prob)(x) conv_kern_reg = regularizers.l2(0.00001) conv_bias_reg = regularizers.l2(0.00001) conv_blocks =  for sz in m_hyper.filter_sizes: conv = Convolution1D(filters=m_hyper.num_filters, kernel_size=sz, padding="same", activation="relu", strides=1, kernel_regularizer=conv_kern_reg, bias_regularizer=conv_bias_reg)(x) conv = GlobalMaxPooling1D()(conv) conv_blocks.append(conv) # merge x = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks x = Dropout(m_hyper.dropout_prob)(x) x = Dense(m_hyper.hidden_dims, activation="relu")(x) model_output = Dense(1, activation="sigmoid")(x) model = Model(model_input, model_output) model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) print(model.summary()) plot_model(model, show_shapes=True, to_file='multichannel.png') return model
Having the below structure:
Please note that the model's performance could improve by using some pre-trained word embeddings.
A full example applying N-gram Multichannel Convolutional Neural Network on the IMDB dataset can be found in my Github repo.
This brings us to the end of this article. Hope you got a basic understanding of how N-gram Multichannel Convolutional Neural Network can be used for text classification. Feel free to use the Python code snippet of this article.
If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.
Thanks for reading and I am looking forward to hear your questions :)
Stay tuned and Happy Machine Learning.