NLP refers to any kind of modelling where we are working with natural language text. Sentiment Analysis is a one of the most common NLP task that Data Scientists need to perform.

Interestingly enough, we are going to look at a situation where a linear model's performance is pretty close to the state of the art for solving a particular problem. In today's article, we will build a simple Naive Bayes model using the IMDB dataset.

Table of Contents

  • The problem
  • Import Libraries
  • Data
  • Text data preprocessing
  • Naive Bayes
  • Advantages and Disadvantages of Naive Bayes
  • Conclusion
  • References

The Problem

The dataset contains a collection of 50,000 reviews from IMDB. It contains an even number of positive and negative reviews. Actually, IMDb lets users rate movies on a scale from 1 to 10. To label these reviews the curator of the data, labeled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as positive Reviews with 5 or 6 stars were left out. The dataset is divided into training and test sets.

Our task is to look at these movie reviews and for each one, we are going to predict whether they were positive or negative.

Import Libraries

First, we will need to import the following Python libraries.

%reload_ext autoreload
%autoreload 2
%matplotlib inline

from glob import glob
import numpy as np
import os,re,string
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer


The data can be download it by running the following commands in a Jupyter notebook:

!gunzip aclImdb_v1.tar.gz
!tar -xvf aclImdb_v1.tar

Once the above commands finished you’ll see that you’ve got a train and a test directory and inside your train directory, you’ll see there is a negative and a positive directory. The ones that were strongly positive went in /pos and strongly negative went in /neg. In both directories, you’ll see there is a bunch of text files.

names = ['neg','pos']
!ls {PATH}
!ls {PATH}train
!ls {PATH}train/pos | head

#Similar for the test folder
!ls {PATH}test
!ls {PATH}test/pos | head

That will go through and find all of the files inside the folder (the first argument f'{PATH}train') with these names (the second argument names) and create a labeled dataset. But basically, it’s going to go through each directory, and go through each file in that directory, then stick that into a list of texts, figure out what folder it’s in, and stick that into an array of labels. So that’s how we end up with something where we have a list of the reviews and an array of the labels.

def load_texts_labels_from_folders(path, folders):
    texts,labels = [],[]
    for idx,label in enumerate(folders):
        for fname in glob(os.path.join(path, label, '*.*')):
            texts.append(open(fname, 'r').read())
    # stored as np.int8 to save space 
    return texts, np.array(labels).astype(np.int8)

trn,trn_y = load_texts_labels_from_folders(f'{PATH}train',names)
val,val_y = load_texts_labels_from_folders(f'{PATH}test',names)




The data is split evenly with 25k reviews intended for training and 25k for testing your classifier. Moreover, each set has 12.5k positive and 12.5k negative reviews.

Here is an example of a text file and its label:

print(f"Review's label: {trn_y[0]}")
# 0 represent a negative review

Jupyter Trick

If at some point when coding on Jupyter you forgot the definition of a function, you can run ??<name_of_function> and a pop out a window will appear with its definition:


Text data preprocessing

The next step is to preprocess the movie reviews. To tackle this we will the CountVectorizer API of Sklearn which convert a collection of text documents to a matrix of token counts. Practically, it creates a sparse bag of words matrix with the caveat that throws away all of the interesting stuff about language which is the order in which the words are in.

This is very often not a good idea, but in this particular case, it’s going to turn out to work not too badly. Normally, the order of the words matters a lot. If you’ve got a “not” before something, then that “not” refers to that thing.

But in this case, we are trying to predict whether something is positive or negative. If you see the word “absurd” or “cryptic” appear a lot then maybe that’s a sign that this isn’t very good. So the idea is that we are going to turn it into something called a term document matrix where for each document (i.e. each review), we are just going to create a list of what words are in it, rather than what order they are in.


Before transforming our text into a term document matrix we will need to tokenize it first. In NLP tokenization is the process of transforming your text into a list of words. For example:

"This movie is good" ---> ["This", "movie", "is", "good"]

This looks like a trivial process however it isn't. In the case we have This "movie" isn’t good., how do you deal with that punctuation? A good tokenizer would turn this:

"This 'movie' isn’t good." ---> ["This", "'", "movie", "'", "is", "n't", "good","."]

Every token is either a single piece of punctuation, word or this suffix n't which is considered like a word. That’s how we would probably want to tokenize that piece of text. You wouldn’t want just to split on spaces cause it would have resulted to weird tokens like "good." and "movie".

So this is how we create our term document matrix with a tokenizer:

re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

#create term documetn matrix
veczr = CountVectorizer(tokenizer=tokenize)

fit_transform(trn) finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the same transformation to your validation set, the second line uses just the method transform(val). We wouldn’t want the validation set and the training set to have the words in different orders in the matrices. Because then they would have different meanings. So this is here saying use the same vocabulary to create a bag of words for the validation set. trn_term_doc and val_term_doc are sparse matrices. trn_term_doc[i] represents training document i and it contains a count of words for each document for each word in the vocabulary.

trn_term_doc = veczr.fit_transform(trn)
# Important: Use same vocab for validation set
val_term_doc = veczr.transform(val)
What if the validation set has different set of words other than training set?

Most of these vocabulary creating approaches will have a special token for unknown. You can also specify as hyperparameters for the CountVectorizer:

  • max_df = 0.9: any word with frequency above 90% is thrown away
  • min_df=5: similarly any word encounter less than 5 times is thrown away

But otherwise, if you see something you haven’t seen before, call it unknown. So that (i.e. “unknown”) would just become a column in the bag of words.

In addition, a nice features of CountVectorizer is that we can specify to retun not only count of words from a text but also bigrams,trigrams any n-grams in general by coding:

veczr = CountVectorizer(tokenizer=tokenize,ngram_range=(1,3), max_features=80000)

while return word , bigrams and trigrams counts with a limit of 80,000 features.

As seen below when we create this term document matrix, the training set has 25,000 rows because there are 25,000 movie reviews and there are 75,132 columns which is the number of unique words.


75,132 columns that too many columns. Since most of the documents don’t have most of these 75,132 words we don’t want to actually store it as a normal array in memory. So instead, we store it as a sparse matrix.

What is a sparse matrix?

It simply stores as something that says whereabouts the non-zeros are located. For example, for the document number 1, word number 4 appears and it has 4 of them. term number 123 appears once, and so forth.

(1, 4) → 4
(1, 123) → 1

That’s basically how it’s stored and the important thing is that it’s efficient.

trn_term_doc[5] #83 stored elements
w0 = set([o.lower() for o in trn[5].split(' ')]); w0
# length is 78 which is pretty similar to 83. the difference
# can be attributed to the fact that I didn’t use a real tokenizer. 

We grab the sixth review and that gives us 75,132 long sparse row with 83 non-zero stored elements . So in other words, the sixth review contains 83 words.

Sklearn gives us the ability to have a look at vocabulary by saying veczr.get_feature_names . Here is an example of a few of the elements of feature names:

vocab = veczr.get_feature_names()

We simply created a unique list of words and mapped them. We could check by calling veczr.vocabulary_ to find the ID of a particular word. So this is like the reverse map of veczr.get_feature_names which maps integer to word, veczr.vocabulary_ maps word to integer.

# word 'absurb' appears twice in the first document
# word 'arching' does not appear in the first document

As we have already highlighted using this technique we have thrown away the ordering of the words. For the vast majority of NLP work this is definitely not a a good idea. This has been a standard practise for many years because we didn’t really know a better approach. However, nowadays more and more people use recurrent neural networks to tackle this kind of problems. But even now this representation works pretty well in this case.


For spam filtering the Naive Bayes techniqueworks pretty well even though it is a bag of words approach. The reason is that if you are getting a lot of email containing the word Durex and it’s always been a spam and you never get email from your friends talking about Durex, then it’s very likely something that says Durex regardless of the detail of the language is probably from a spammer. So that’s the basic theory about classification using a term document matrix.

🚀 For people who like video courses and want to kick-start a career in data science today, I highly recommend the below video course from Udacity:

Learn to Become a Data Scientist Online | Udacity | Udacity
Gain real-world data science experience with projects from industry experts. Take the first step to becoming a data scientist. Learn online, with Udacity.

📚 While for book lovers:

Naive Bayes

Naive Bayes it's a popular and easy to understand Supervised Probabilistic classification algorithm. The Naive Bayes Algorithm is based on the Bayes Rule which describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For a better understanding pf Bayes Rule please see below video:

We will walk through an example to understand it better. We assume that we have some movie reviews and we transform them to a term document matrix.

Please note that we add a row with of ones for one practical reason. Imagine if you have never seen a particular word in the positive reviews up until now. That doesn’t actually mean that the probability of this word appearing in a positive review is zero. It’s not really zero and that why we added this additional row. Also we would like to avoid situation where the probability of P(f|c=1)=0 and similarly P(f|c=0)=0 but actually we want both of them to positive of every word in the corpus. The reason is that as we will see below we calculate the log ration of these two terms. That way, nothing is ever infinitely unlikely. So I take the average of all of the times that this appears in my positive corpus plus the 1's:

Let's now calculate he probability that you would see the word this given that the class is 1 (i.e. positive review) is just the average of how often do you see this in the positive reviews similarly for the negatives.

P('good'|c=1) = 3/3 =1 and P('good'|c=1) = 1/3 =0.333

The trick now is to basically use Bayes rule to find the probability that given this particular IMDb review, what is the probability that its class is equal to positive. So, we can write:

But actually, what we are interested about is if P(c=1|d) > P(c=0|d). So we can simply take their ration:

If this number is bigger than 1, then it’s more likely to be class 1, if it’s smaller than 1, it’s more likely to be class 0.

Basically what that means is we want to calculate the probability that we would get this particular document given that the class is 1 times the probability that the class is 1 divided by the probability of getting this particular document given the class is 0 times the probability that the class is 0. Note that the probability that the class is 1 is just equal to the average of the labels.

The reason we take the log because if we take the log, then we can add things together rather than multiply them together. And once you multiply enough of these things together, it’s going to get so close to zero that you’ll probably run out of the floating-point. So we take the log of the ratios. Then, as I say, we then multiply that, or with log, we add that to the ratio of the whole class probabilities.

Let's calculate it also for our example now:

Our model is almost finished so given a document which will be a vector with size equal to the number of unique words we will multiply it by the r vector if the result is positive it can be classifies as positive review otherwise as negative.

However, is that completely correct the answer is NO since the choices are independent.

So Naive Bayes says let’s assume that if you have “this movie is bloody stupid I hate it” that the probability of hate is independent of the probability of bloody is independent of the probability of stupid which is definitely not true. So Naive Bayes aren’t actually very good but it often works pretty well and it may be useful foundation.

Let's implemented in Python.

def pr(y_i):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1) # plus 1 for the row of ones we added


r = np.log((pr(1)/pr(0))
b = np.log((y==1).mean() / (y==0).mean())


pre_preds = val_term_doc @ r.T + b
preds = pre_preds.T>0

For each document we multiply the Bayes’ probabilities by the counts (matrix multiplication). Then to add on the log of the class ratios, you can just use + b. So we end up something that looks similar to a logistic regression. But we are not learning anything (no weight-parameters).

It achieve accuracy of ~82%  and it runs pretty fast. So Naive Bayes is not nothing; it gave us something.

Improved Version

Our input matrix contains the counts of how many times a word appeared i.e “absurd” appeared twice, it turns out at least for this problem and quite often it doesn’t matter whether “absurd” appeared twice or once. All that matter is that if it appeared. So we can modify the term matrix document and go .sign() which replaces anything positive as 1, and anything negative with -1 (we don’t have negative counts obviously), binarizes the matrix. It says I don’t care that you saw “absurd” twice, I just care that you saw it.

r = np.log(pr(1)/pr(0))

pre_preds = val_term_doc.sign() @ r.T + b #sign binarize
preds = pre_preds.T>0

So if we do exactly the same thing with the binarized version, then you get a better accuracy of ~83%.

Advantages and disadvantages of Naive Bayes

Major advantages are

  • Simplicity
  • Requires small training set
  • Computationally fast
  • Scales linearly with the number of features and training examples


  • Strong feature independence assumption which rarely holds true in the real world. Remember, it's naive
  • May provide poor estimates, based on its independence assumption


This brings us to the end of this article. Hope you got a basic understanding of how Naive Bayes can be used on Sentiment Analysis. Please remember to use it as it is a really fast and simple algorithm. Feel free to use the Python code snippet of this article.

The full code of this article can be found in this GitHub Repository.

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Machine Learning.