In my previous articles, I used two models to predict whether the movie reviews were positive or negative using the IMDB dataset. If you haven't read those articles I would urge you to read them before continuing.

In this article, I will improve the performance of the model by using a simple Neural Network.‌


Table of Contents

  • Introduction
  • Prepare the data
  • Simple Neural Network
  • Logistic Regression Blended with Naive Bayes
  • Conclusion
  • References‌

Introduction

In the previous articles we used a Naive Bayes & a Logistic Regression model and the basic idea was that we could take a document (e.g. a movie review), and turn it into a bag of words representation consisting of the number of times each word appears.

In this article, we will follow a slightly different approach.


‌Prepare the data

As we have seen the term document matrix is a matrix with dimensions:

print(trn_term_doc.shape)

In this case, the sequence  [3, 5] is a 80,000-dimensional vector that is all zeros except for indices 3 and 5, which are ones. It is obvious that if we try to feed this matrix as the first layer of our network will require to determine the weights of 800,000 n-grams. This approach is memory intensive.

To tackle this, we will instead preprocess the data in a different way.First we will need to map each word to a number and then pad the arrays so they all have the same length. The end result will be an integer tensor of shape max_length * num_reviews. This can be easily handled using an embedding layer as the first layer of our network.

The data preparation involves the following step:

  • Fit a keras tokenizer which vectorize a text corpus, by turning each text into a sequence of integers (each integer being the index of a token in a dictionary)
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split

print(tf.__version__)

tok = keras.preprocessing.text.Tokenizer()
tok.fit_on_texts(trn) 
X_train = tok.texts_to_sequences(trn)
X_val = tok.texts_to_sequences(val)

" ".join(map(str,X_train[0]))

As we have seen the movie reviews vary in length. For example one movie review may contain 20 words while a second one 500 words.

lengths = [len(i) for i in X_train+X_val]
print(f'Max length of sentence: {max(lengths)}')
print(f'Average length of sentence: {np.mean(lengths)}')

sns.distplot(lengths)

However, the NN expect its inputs to be of the same length. That's why we will use the pad_sequences function to standardize the lengths. We pick 256 as max_length which means that:

  • Sequences that are shorter than 256 are padded with 0 at the end.
  • Sequences longer than 256 are truncated so that they fit the desired length.
X_train = keras.preprocessing.sequence.pad_sequences(X_train,
                                                     padding='post',
                                                     maxlen=256)

X_val = keras.preprocessing.sequence.pad_sequences(X_val,
                                                   padding='post',
                                                   maxlen=256)

X_train[0]

lengths = [len(i) for i in X_train+X_val]
print(f'Max length of sentence: {max(lengths)}')
print(f'Average length of sentence: {np.mean(lengths)}')

Let's now try to convert this sequence of integer back to a text:

# create inversed vocabulary
reverse_word_map = dict(map(reversed, tok.word_index.items()))

' '.join(reverse_word_map[i] for i in X_train[0] if i!=0) # exclude 0 due to padding

Simple Neural Network

A neural network is created by stacking layers but this is not as simple as it seems. The two main architectural decisions that a Machine learning Practitioner has to make are the following:

  • How many layers to use in the model?
  • How many hidden units to use for each layer?

In our case, we have to solve a binary classification problem while the input data consists of an array of word-indices. Let's build a model for this problem:

# build model
# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = len(tok.word_index)+1

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dropout(0.1))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary()

The layers are stacked sequentially to build the classifier:

  1. The first layer is an Embedding layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. Please see my article for further details.
  2. Next, a GlobalAveragePooling1D layer returns a fixed-length output vector for each example by averaging over the sequence dimension.
  3. This fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.
  4. Next, a Dropout layer with rate=0.1 which helps prevent overfitting.
  5. The last layer is densely connected with a single output node. Using the sigmoid activation function, the output value is squeezed to a float between 0 and 1, representing a probability.

Loss function and optimizer

The Neural Network needs a loss function and an optimizer for training. Since this is a binary classification problem we'll use the binary_crossentropy loss functioand adam as optimizer.

Binary Cross-Entropy

This Loss function can simply be represenetd with an if statement as well:

and the code:

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])
model.summary()

Create a validation set

A good practice of a mchine learning practicioner is to always have a train, validation and test so as to corectly test the model's performance.

x_train, x_val, y_train, y_val = train_test_split(
                                    X_train, trn_y, test_size=0.33, random_state=42)

Fit and Evaluate the model

Now we are ready to train the model for 40 epochs in mini-batches of 512 samples. This is 40 iterations over all samples in the x_train and y_train tensors. While training, monitor the model's loss and accuracy on the validation set.

history = model.fit(x_train,y_train,
                    epochs=40,
                    validation_data=(x_val, y_val),
                    verbose=1, # print result every epoch
                    batch_size=512)

Let's evaluate the model's performance:

loss, accuracy = model.evaluate(x_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(x_val, y_val, verbose=False)
print("Validation Accuracy:  {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_val, val_y, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Plot the loss and accuracy with regards to the number of epochs:

acc = history_dict['acc']
val_acc = history_dict['val_acc']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()
plt.clf()   # clear figure

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

Let's compare its performance against the other models that we have seen til now:

|     Model         |Accuracy| 
|-------------------|--------|
|Naive Bayes        |   83%  | 
|Logistic Regression|   88%  |
|LR + Naive Bayes   |   92%  |
|Neural Network     |   88%  |

We can see that our more "sophisticated models" perform equally good as the Logistic Regression model. We will see on a seperate article how we can improve the model's performance further by using a Convolutional Neural Network. At this point I would like to make a statement:

"In a business environment when the time is precious and you want to put a model in the production ASAP it is usually preferable to start with a simple model and then experiment with more sophisticated Deep Learning models as it maight take time to tune it approriately."


Conclusion

This brings us to the end of this article. Hope you got a basic understanding of how a Neural Netowk can be used on Sentiment Analysis. Feel free to use the Python code snippet of this article.

The full code of this article can be found in this GitHub Repository.‌

If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.‌
‌Thanks for reading and I am looking forward to hear your questions :)‌
Stay tuned and Happy Machine Learning.


‌References