Topic modeling is an interesting task for someone to start getting familiar with NLP. Briefly, it is the task of using unsupervised learning to extract the main topics (represented as a set of words) that occur in a collection of documents. In this article, I am solving a topic modeling problem using a popular matrix decomposition technique named Singular Value Decomposition (SVD).


Table of Contents

  • The problem
  • Motivation
  • Getting started
  • Import Libraries
  • Load the data
  • Text data preprocessing techniques
  • When to use those techniques?
  • Data Preprocessing
  • Singular Value Decomposition (SVD)
  • Full vs Reduced SVD
  • Topics Visualization
  • Randomized SVD
  • Conclusion
  • References

The problem

We start with a term-document matrix:

The above matrix can be decomposed into one tall thin matrix times one wide short matrix (possibly with a diagonal matrix in between).

Notice that this representation does not take into account word order or sentence structure. It's an example of a bag of words approach.


Motivation

Consider the most extreme case - reconstructing the matrix using an outer product of two vectors. Clearly, in most cases, we won't be able to reconstruct the matrix exactly. But if we had one vector with the relative frequency of each vocabulary word out of the total word count, and one with the average number of words per document, then that outer product would be as close as we can get.

Now consider increasing that matrices to two columns and two rows. The optimal decomposition would now be to cluster the documents into two groups, each of which has as different distribution of words as possible to each other, but as similar as possible amongst the documents in the cluster. We will call those two groups "topics". And we would cluster the words into two groups, based on those which most frequently appear in each of the topics.


Getting started

We'll take a dataset of documents of several different categories, and  we will find topics (consisting of groups of words) for them. Knowing the actual categories helps us evaluate if the topics we find make sense.

We will try this using a matrix factorization technique named Singular Value Decomposition (SVD).


Import Libraries

First, let's import the necessary Python Libraries:

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt

%matplotlib inline
np.set_printoptions(suppress=True)

Load the data

Sklearn comes with a number of built-in datasets, as well as loading utilities to load several standard external datasets. This is a great resource, and the datasets include Boston housing prices, face images, patches of forest, diabetes, breast cancer, and more. We will be using the newsgroups dataset.

Newsgroups are discussion groups on Usenet, which was popular in the 80s and 90s before the web really took off. This dataset includes 18,000 newsgroups posts with 20 topics. By running the below command we will download the data which might take some time. To limit the data we select to take a subset and simply keep all the articles belonging to categories=['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'].

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

newsgroups_train.filenames.shape, newsgroups_train.target.shape

Let's have a look at some of the data and try to guess the category of the message?

print("\n".join(newsgroups_train.data[:1]))
newsgroups_train.data[:1]

The target attribute of this message is:

np.array(newsgroups_train.target_names)[newsgroups_train.target[0]]

The target attribute is the integer index of the category and every index map to a topic:

print(newsgroups_train.target[:5])
np.array(newsgroups_train.target_names)[newsgroups_train.target[:5]]

This dataset has actually the news already grouped into key topics. Which you can have a closer look by running the following command: (remember that we have filtered out the dataset in order to get any text associated with 4 out of the 20 topic categories that this dataset has in order to speed up the process)

print(list(newsgroups_train.target_names))

Text data preprocessing techniques

Please see also my other blog post which present in details text preprocessing techniques.

Dropping common terms: stop words

\begin{figure}\begin{tabular}{llllllllll}
a & an & and & are & as & at & be & by...
... & on & that & the \\
to & was & were & will & with
\end{tabular}
\end{figure}

Sometimes, some extremely common words may have little to no value in helping select document's topic. These are called stop words and can be excluded from the vocabulary entirely. The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list , the members of which are then discarded during indexing. An example of a stop list is shown in the above figure.

Using a stop list significantly reduces the number of postings that a system has to store and a lot of the time not indexing stop words does little harm: keyword searches with terms like the and by don't seem very useful. However, this is not true for phrase searches. The phrase query "President of the United States'', which contains two stop words, is more precise than President AND ``United States''. The meaning of "flights to London" is likely to be lost if the word to is stopped out. Some song titles and well-known pieces of verse consist entirely of words that are commonly on stop lists (To be or not to be, Let It Be, I don't want to be, ...). For that reason, Web search engines generally do not use stop lists.

There is no single universal list of stop words. Each library has its own stopwords.

Sklearn

from sklearn.feature_extraction import stop_words
sorted(list(stop_words.ENGLISH_STOP_WORDS))[:20]

Spacy

Spacy is a very modern & fast nlp library. Spacy is opinionated, in that it typically offers one highly optimized way to do something (whereas nltk offers a huge variety of ways, although they are usually not as optimized).

You will need to install it.

if you use conda:

conda install -c conda-forge spacy

if you use pip:

pip install -U spacy

You will then need to download the English model:

python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
sorted(list(nlp.Defaults.stop_words))[:20]

NLTK

import nltk
from nltk.corpus import stopwords
nltk.download('wordnet')
sorted(list(stopwords.words("english")))[:20]
Exercise: What stop words appear in spacy but not in sklearn?
from sklearn.feature_extraction import stop_words
import spacy
nlp = spacy.load("en_core_web_sm")
sk_stopwords = list(stop_words.ENGLISH_STOP_WORDS)
spacy_stopwords = list(nlp.Defaults.stop_words)
list(set(spacy_stopwords)-set(sk_stopwords))
Exercise: And what stop words are in sklearn but not spacy?
list(set(sk_stopwords)-set(spacy_stopwords))

Stemming and Lemmatization

Are the below words the same?

  • organize, organizes, and organizing
  • democracy, democratic, and democratization

Stemming and Lemmatization both generate the root form of the words.

  • Lemmatization uses the rules about a language.  The resulting tokens are all actual words
  • "Stemming is the poor-man’s lemmatization." (Noah Smith, 2011) Stemming is a crude heuristic that chops the ends off of words.  The resulting tokens may not be actual words. Stemming is faster.

NLTK

Let's see an example using NLTK library:

import nltk
from nltk import stem
wnl = stem.WordNetLemmatizer()
porter = stem.porter.PorterStemmer()
word_list = ['feet', 'foot', 'foots', 'footing','fly', 'flies', 'flying',
            'organize', 'organizes', 'organizing','universe', 'university']
print([wnl.lemmatize(word) for word in word_list])
print([porter.stem(word) for word in word_list])

Stemming and lemmatization are language-dependent.  Languages with more complex morphologies may show bigger benefits.  

Spacy

Let's see the same example as above using Spacy library:

Stemming and lemmatization are implementation dependent. Spacy is a very modern & fast nlp library. Spacy is opinionated, in that it typically offers one highly optimized way to do something (whereas nltk offers a huge variety of ways, although they are usually not as optimized).

Spacy doesn't offer a stemmer (since lemmatization is considered better-- this is an example of being opinionated!)

import spacy
from spacy.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer()
[lemmatizer.lookup(word) for word in word_list]

When to use those techniques?

These were long considered standard techniques, but they can often hurt your performance if using deep learning. Stemming, lemmatization, and removing stop words all involve throwing away information.

However, they can still be useful when working with simpler models and smaller datasets.


Data Preprocessing

This involves the following:

  • Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
  • Words that have fewer than 3 characters are removed.
  • All stopwords are removed.
  • Words are lemmatized

In order to that we use Sklearn CountVectorizer method to preprocess the text, let's check its hyperparameters:

  • tokenizer=LemmaTokenizer(): a build-in tokenizer is passed to preprocess the text, lematizing and throwing away any word fewer than 3 characters and stopwords. NLTK help us for this task.
  • max_df = 0.9: any word with frequency above 90% is thrown away
  • min_df=5: similarly any word encounter less than 5 times is thrown away
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize,stem
import re

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = stem.WordNetLemmatizer()
    def __call__(self, doc):
        SYMBOLS_TO_KEEP = re.compile('[^A-Za-z0-9]+')
        doc = re.sub(SYMBOLS_TO_KEEP," ",doc)
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if (len(t) >3) & (t not in stop_words.ENGLISH_STOP_WORDS) ]

vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
						     lowercase=True,  
                             max_df = 0.9,
                             min_df =5)

# We could have used TfidfVectorizer() instead

vectors = vectorizer.fit_transform(newsgroups_train.data).todense()
vectors.shape

Also, we can have a closer view of the words that are considered by running the following code:

vocab = np.array(vectorizer.get_feature_names())
print(vocab.shape)
vocab[3000:3050]

Singular Value Decomposition (SVD)

As Gilbert Strang has mentioned, "SVD is not nearly as famous as it should be." It is actually a really useful tool.

We would clearly expect that the words that appear most frequently in one topic would appear less frequently in the other - otherwise, that word wouldn't make a good choice to separate out the two topics. Therefore, we expect the topics to be orthogonal.

The SVD algorithm factorizes a matrix into one matrix with orthogonal columns and one with orthogonal rows (along with a diagonal matrix, which contains the relative importance of each factor).

SVD is an exact decomposition, since the matrices it creates are big enough to fully cover the original matrix. SVD is extremely widely used in linear algebra, and specifically in data science.

First, we will run SVD in our count vectorized vector as follow:

%time U, s, Vh = linalg.svd(vectors, full_matrices=False)
print(U.shape, s.shape, Vh.shape, vocab.shape)

It is worth observing the fact that we specify full_matrices=False, if we had specified as True it would have given us as output:

%time U, s, Vh = linalg.svd(vectors, full_matrices=True)
print(U.shape, s.shape, Vh.shape, vocab.shape)

Full vs Reduced SVD

We set full_matrices=False to calculate the reduced SVD. For the full SVD, both U and V are square matrices, where the extra columns in U form an orthonormal basis (but zero out when multiplied by extra rows of zeros in S).

Diagrams from Trefethen

We will now plot the singular values s which as we see are sorted in descending order and all positive since we run the reduced SVD:

plt.plot(s);
sum(s<0)

The most popular topics will be the one that corresponds to the biggest singular values s.

We can now get the top-10 most popular topics by simply running the following commands and constrained them to contain only the top-8 most popular words:

num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

show_topics(Vh[:10])

# Actual topics
np.array(newsgroups_train.target_names)

We get topics that match the kinds of clusters we would expect! This is despite the fact that this is an unsupervised algorithm - which is to say, we never actually told the algorithm how our documents are grouped.


Topics Visualization

To find out how distinct our topics are, we could visualize them. Of course, we cannot visualize more than 3 dimensions so we will use a relatively new technique called UMAP (Uniform Manifold Approximation and Projection) which can help us visualize high dimensional data into lower dimensions.

import umap.umap_ as umap
embedding = umap.UMAP(n_neighbors=150, min_dist=0.5, random_state=12).fit_transform(U[:,:10])

plt.figure(figsize=(7,5))
plt.scatter(embedding[:, 0], embedding[:, 1], 
c = newsgroups_train.target,
s = 10 # size
)
print(newsgroups_train.target.shape)
plt.show()

Please note that we assumed that there are distinct topics in the data set. So if the data set is a bunch of random tweets then the model results may not be as interpretable.


Randomized SVD

As shown above, we were interested in finding only the top-10 topics by looking at the first 10 rows of Vh. So is there any way to speed up this process?

YES, it's called Randomized SVD which takes into consideration the fact that we are just interested in the vectors corresponding to the largest singular values. Let's do a time comparison:

from sklearn import decomposition
import fbpca #Randomized SVD from Facebook's library fbpca - very fast

%time u_svd, s_svd, v_svd = np.linalg.svd(vectors, full_matrices=False)
print(u_svd.shape,s_svd.shape,v_svd.shape)
print()

%time u_rand, s_rand, v_rand = decomposition.randomized_svd(vectors, 10)
print(u_rand.shape,s_rand.shape,v_rand.shape)
print()

%time u_fbpca, s_fbpca, v_fbpca = fbpca.pca(vectors, 10)
print(u_fbpca.shape,s_fbpca.shape,v_fbpca.shape)
print()

Similarly for Topics Visualization

Randomized_svd

import umap.umap_ as umap
embedding = umap.UMAP(n_neighbors=150, min_dist=0.5, random_state=12).fit_transform(u_rand)

plt.figure(figsize=(7,5))
plt.scatter(embedding[:, 0], embedding[:, 1], 
c = newsgroups_train.target,
s = 10 # size
)
print(newsgroups_train.target.shape)
plt.show()

fbpca

import umap.umap_ as umap
embedding = umap.UMAP(n_neighbors=150, min_dist=0.5, random_state=12).fit_transform(u_fbpca)

plt.figure(figsize=(7,5))
plt.scatter(embedding[:, 0], embedding[:, 1], 
c = newsgroups_train.target,
s = 10 # size
)
print(newsgroups_train.target.shape)
plt.show()

That said the fbpca is way faster than the other libraries making it an ideal candidate when working with huge amount of documents.

For more on randomized SVD, check out this PyBay 2017 talk.


Conclusion

This brings us to the end of this article. Hope you got a basic understanding of how SVD can be used on topic modeling. Please remember to use it as it is a really powerful unsupervised algorithm. Feel free to use the Python code snippet of this article.

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Machine Learning.


References