In my previous article, I used the Naive Bayes model to predict whether the movie reviews were positive or negative using the IMDB dataset. If you haven't read this article I would urge you to read it before continuing.

In this article, I will improve the performance of the model by using a combination of the Naive Bayes model and Logistic Regression.


Table of Contents

  • Introduction
  • Logistic Regression
  • Logistic Regression Blended with Naive Bayes
  • Conclusion
  • References

Introduction

In the previous article we used a Naive Bayes model and the basic idea was that we could take a document (e.g. a movie review), and turn it into a bag of words representation consisting of the number of times each word appears.

Then we fit a Naive Bayes model and we achieved a high accuracy of 83%. Similar performance achieved also when we simply took the ones and zeros for presence or absence of a word.


Logistic Regression

In theory, Naive Bayes sounds okay but it’s "naive" as it assumes that the choices of words in a review are independent. In order to avoid this caveat and instead of approximating these coefficients r, let’s try to learn them.

To tackle this we will use a logistic regression. That’s going to literally give us something with exactly the same functional form but now rather than using a theoretical r and theoretical b, we are going to calculate both of them using a logistic regression.

print(trn_term_doc.shape)

x=trn_term_doc
y=trn_y
m = LogisticRegression(C=1e8, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

By fitting a simply Logistic Regression we immediately see that the accuracy increased.We expected that as any theoretical models are never going to be as accurate as a data driven model. Theoretical models are good when you are dealing with some physic things or something else where you are sure how the world works. But in most of the cases this is really difficult so  it’s better to learn your coefficients and calculate them.

What’s this dual=True?

As we saw above our term document matrix is much wider than it is tall. There is an almost mathematically equivalent reformulation of logistic regression that happens to be a lot faster when it’s wider than it is tall. So the short answer is anytime it’s wider than it’s tall, use dual=True, it’ll run faster.

So in math, there is a concept of dual versions of a problem which are kind of equivalent versions of the same problem that sometimes works better for certain situations.

Let's see also what the official documentation of Sklearn say:

Binarized version

m = LogisticRegression(C=1e8, dual=True) #low C big regularization
m.fit(trn_term_doc.sign(), y) #present or abscent of words
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

Tune Regularization

As we saw previously in the dataset we have approximately 75,000 columns or other words 75,000 coefficients one for each term in our vocabulary. Having in mind that we only have 25,000 reviews this looks like we have a lot of coefficients so we inspect that regularization might improve the performance of the model.

To do that, we will tune the parameter C which is responsible for regularizing our model. High value of C means less regularization while low values means a lot. That's why we used 1e8 to basically turn off regularization. Let's decrease that value:

m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

Binarized version

m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

As we were expecting increasing regularization actually improved the model.It seems that initially the model was overfitting. Note that parameter C controls L2 regularization which is looking at the square of the weights.

In simple terms, L2 Regularization just add a term at the ned of the objective function which is just the summation of the square of all coefficients while L1 the summation of the absolute value of all coefficients. It can be proved mathematically that L1 regularization tries to make as many things zero whereas L2 regularization ties to make everything smaller. For example if we have two features that are highly correlated, L2 regularization will decrease the weight of both terms while L1 will try to make one of them zero and one of them nonzero.

In modern machine learning we don't really think a lot which of the two to use we simply try both and select the one that works better and ends up with a better error on the validation set.

The reason why we used L2 regularization for this problem is that L2 is the default regularization when we set dual=True as parameter in the Sklearn’s Logistic Regression.

Find below two useful Youtube Videos which explain L2 and L1 Regularization from a mathematical perspective:

Add n-grams

If you remember we used CountVectorizer to preprocess our features . By default, it gives unigrams that are single words. But if we say ngram_range=(1,3), that’s also going to give us bigrams and trigrams. Let's try that:

veczr =  CountVectorizer(ngram_range=(1,3), 
                         tokenizer=tokenize, max_features=800000)
# limit featutes to 800,000
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

# as expected 80,000 fetures-columns
print(trn_term_doc.shape)

# build voc
vocab = veczr.get_feature_names()

vocab[300000:300010]

As we see now the vocabulary includes bigram: 'for whatever'  and trigrams 'for what seemed'. This usually works better as now we have the ability to see the difference between not good versus not bad versus not terrible. So using trigram features we expect to improve both Naive Bayes and logistic regression. Let's validate our hypothesis.

max_features = 800000 we force the CountVectorizer to consider the first 800,000 most common ngrams. The reason we restrict it is that we have only 25,000 movie reviews so it wouldn't make sense to have so many weights.

However, even if we haven't restricted it actually works.It will end up having 70 million coefficients so the training time will increase but with careful tuning of the regularization it will achieve similar performance.

y=trn_y
x=trn_term_doc
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

Binarized version

m = LogisticRegression(C=0.1, dual=True)
m.fit(x.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

Logistic Regression Blended with Naive Bayes

As seen up to now, we used two models:

  • Naive Bayes: we calculate mathematically the weight matrix r where its weight is simply the ratio of the probability of that feature occurring if it’s class 1 and if it’s class 0.
  • Logistic Regression: we try and learn the coefficients. We start out with some random numbers, and then use stochastic gradient descent (SGD) to find slightly better ones.

Both techniques calculate the same thing (a vector of rank 1) with the only difference that one is based on theory and one based on data. So, in a way we could get advantage of the r matrix that is being calculated from the Naive Bayes model.

One way that seems to improve the performance is to multiply the term document matrix with the vector r by doing a broadcasted element-wise multiplication. So now we will use this new matrix as our independent variables, instead, in our logistic regression. Let's check its performance:

x_nb = x.multiply(r)
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y);

val_x_nb = val_term_doc.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()

Surprisingly the model's accuracy improved, but let's examine now the underlying reason which is responsible for that boost in the performance.

So this transformation of the independent variable shouldn't have any impact in the model's performance but in reality it does have. So the real question is why did it make a difference?

The answer is the regularization and let me elaborate on that by making a revision of the logistic regression's objetive function:

When having the regularization term the weights tend to be small as otherwise the penalty(aw²) gets bigger and it drowns out the cross entropy term. Ideally, we want to be a good fit and also have as little regularization as possible. So we want small weights while at the same time fit the underlying function.

Now, the vector r contains some a priori information about the terms and by multiplying it with the term document matrix we take advantage of that information. In a way the weights can remain small if the agree with the Naive Bayes and only vary when they have good reason to believe otherwise.

Why multiply only with r and not like r² or something like that when the variance would be much higher this time?

Because our prior information comes from an actual theoretical model we don’t want to rely a lot on theory but use it with caution.

The idea is that we incorporate our theoretical expectations into the data and we gave that to our model. In that way we avoid regularizing a lot which for that particular problem we did a lot as we have more features that datapoints.

Remember that this technique works well when the parameter C for Sklearn which is the reciprocal of the amount of regularization penalty is at a reasonable level. If set to a really low value then it will hurt the model's performance:

x_nb = x.multiply(r)
m = LogisticRegression(dual=True, C=1e-8)
m.fit(x_nb, y);

val_x_nb = val_term_doc.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()

The reason is that the loss function is overwhelmed by the need to reduce the weights rather than make it predictive. So this trick worked well when the regularization has a reasonable value and the philosophy is:

  • no need to push the weight up if agreed with the Naive Bayes's expectations
  • push the weight up only if they are good reasons to disagree

In a way we end up reaping the benefits of both words and that ends up giving us a very nice result


Conclusion

This brings us to the end of this article. Hope you got a basic understanding of how Logistic Regression can be used on Sentiment Analysis. Please remember to use it as it is a really fast and simple algorithm. Feel free to use the Python code snippet of this article.

The full code of this article can be found in this GitHub Repository.

If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.
Thanks for reading and I am looking forward to hear your questions :)
Stay tuned and Happy Machine Learning.


References