Text preprocessing is a severely overlooked topic and a lot NLP applications fail badly due to use of wrong kind of text preprocessing.

With that in mind, I thought of writing an article about different text preprocessing techniques. After reading this blog post, you’ll know some basic techniques to extract features from text, so you can use these features as input for machine learning models.


Table of Contents

  • Introduction
  • What is text?
  • What is a word?
  • Tokenization
  • Token normalization
  • Stemming
  • Lemmatization
  • Further normalization
  • StopWords
  • When not to use the above techniques
  • Conclusion
  • References

Introduction

Natural Language Processing, or NLP, is a subfield of computer science and artificial intelligence that is focused on enabling computers to understand and process human languages. It is used to apply machine learning algorithms to text and speech. On this article we will only focus on text.

A simple example of an NLP problem is text classification. That is the problem when you have a text of review as an input, and as an output, you have to produce the class of sentiment (sentiment analysis).

For example, it could be two classes like positive and negative. It could be more fine grained like positive, somewhat positive, neutral, somewhat negative, and negative, and so forth. And the example of positive review is the following. "The hotel is really beautiful. Very nice and helpful service at the front desk." So we read that and we understand that is a positive review.

As for the negative review, "We had problems to get the Wi-Fi working. The pool area was occupied with young party animals, so the area wasn't fun for us." So, it's easy for us to read this text and to understand whether it has positive or negative sentiment but for computer that is much more difficult. But before fitting any model we need first to preprocess the text.


What is text?

Text can be a sequence of different things. For example:

  • Characters
  • Words
  • Phrases and named entities
  • Sentences
  • Paragraphs
  • ...

A sequence of characters is a very low level representation of text. However, text can also be seen as a sequence of words or maybe more high level features like, phrases like, "I don't really like", that could be a phrase, or a named entity like, "the history of museum" or "the museum of history". And, it could be like bigger chunks like sentences or paragraphs and so forth.


What is a word?

It seems natural to think of a text as a sequence of words and a word as a meaningful sequence of characters.

In English, it is usually easy to find the boundaries of words by splitting a sentence by spaces or punctuation and all that is left are words.

This is no true for all languages.

  • For example in German there are compound words which are written without spaces at all:
    “Rechtsschutzversicherungsgesellschaften” stands for “insurance companies which provide legal protection”.
    This long word is still in use but for the analysis of this text, it could be beneficial to split that compound word into separate words because every one of them actually makes sense. They're just written in such form that they don't have spaces.
  • Japanese language is a different story as there are no spaces at all!
    − Butyoucanstillreaditright?
    you can actually read that sentence in English but it doesn't have spaces, but that's not a problem for a human being.

Tokenization

The process of splitting an input text into meaningful chunks is called Tokenization, and that chunk is actually called token OR Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. Tokens can be:

  • A useful unit for further semantic processing
  • Can be a word, sentence, paragraph, etc.

Let's consider the below sentence:

Whitespace Tokenizer

We will use the nltk python library, so let's first import it and download wordnet which is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus.

WordNet can be used alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more.

WhitespaceTokenizer simply splits the input sequence on white spaces, that could be a space or any other character that is not visible as below:

import nltk
nltk.download('wordnet')

text = "This is Andrew's text, isn't it?"
tokenizer = nltk.tokenize.WhitespaceTokenizer()
tokenizer.tokenize(text)

And produces:

What is the problem here? The problem is that the last token, it? it does have actually the same meaning as the token it  similarly for text, and text. But, if we tried to compare them, then these are different tokens. And that might be not a desirable effect.

Split by Punctuation

Let's try now to also split by punctuation using the WordPunctTokenizer from the NLTK library:

text = "This is Andrew's text, isn't it?"
tokenizer = nltk.tokenize.WordPunctTokenizer()
tokenizer.tokenize(text)

and this time we can get something like this:

The problem  now, is that we have apostrophes ' as different tokens and we have that s, isn, and t as separate tokens as well. But the problem is, that these tokens actually don't have much meaning because it doesn't make sense to analyze that single letter t or s. It only makes sense when it is combined with apostrophe or the previous word.

Split by set of rules

So, actually, we can come up with a set of rules or heuristics which you can find in TreeBanktokenizer and it actually uses the grammar rules of English language to make tokenization that actually makes sense for further analysis.

text = "This is Andrew's text, isn't it?"
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokenizer.tokenize(text)

This is very close to perfect tokenization that we want for English language:

So, Andrew and text are now different tokens and 's is left untouched as a different token and that actually makes much more sense, as well as is and n't .


Token Normalization

The next thing you might want to do is token normalization. We may want the same token for different forms of the word like, wolf or wolves as this is actually the same thing. So it may be beneficial to merge both tokens into a single one, wolf.

  • wolf, wolves -> wolf
  • talk, talks -> talk

There are two different process of normalizing the words:

  • Stemming
  • Lemmatization

Stemming

  • A process of removing and replacing suffixes to get to the root form of the word, which is called the stem.
  • Usually refers to heuristic that chop off suffixes or replaces them.

The oldest stemmer for English language is the well-known Porter's stemmer which has five heuristic phases of word reductions applied sequentially. They are pretty simple rules.

For example when you see the combination of characters like SSES, you just replace it with SS and strip that ES at the end, and it may work for word like caresses, and it's successfully reduced to caress. Another rule is replace IES with I. And for ponies, it actually works in any way, but what would you get in the result is not a valid word because poni shouldn't end with I, Y, and it ends with I.

text = "This is Andrew's text, isn't it?"
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokenizer.tokenize(text)

stemmer = nltk.stem.PorterStemmer()
" ".join(stemmer.stem(token) for token in tokens)

So that is a problem. But it actually works in practice, and it is well-known stemmer, and you can find it in an NLTK library as well. However, it doesn't know anything about irregular forms. For wolves, it produce wolv, which is not a valid word, but still it can be useful for analysis.

To sum up, it fails on the regular forms, and it produces non-words but that could be not much of a problem actually.


Lemmatization

When people talk about lemmatization, they use vocabularies and morphological analysis in order to return the base or dictionary form of a word, which is known as the lemma. For that purpose, you can use WordNet lemmatizer that uses WordNet Database to lookup lemmas (it can be found in NLTK library).

This time when we have a word feet, is actually successfully reduced to the normalized form, foot, because we have that in our database. We know about words of English language and all irregular forms. When you take wolves, it becomes wolf. Cats become cat, and talked becomes talked, so nothing changes.

text = "This is Andrew's text, isn't it?"
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokenizer.tokenize(text)

stemmer = nltk.stem.WordNetLemmatizer()
" ".join(stemmer.lemmatize(token) for token in tokens)

The problem is lemmatizer actually doesn't really use all the forms. So, for nouns, it might be like the normal form or lemma could be a singular form of that noun. But for verbs, that is a different story. And that might actually prevents you from merging tokens that have the same meaning.

The takeaway is the following. We need to try stemming and lemmatization, and choose what works best for our task.


Further normalization

So what you can do next, you can further normalize those tokens. And there are a bunch of different problems.

The first problem is capital letters:

What happen when we have us and US in capital form. That could be a pronoun, and a country and we need to distinguish them somehow. The problem is that we're doing text classification, and we might be working on a review which is written with capital letters, and us could mean actually us, a pronoun, but not a country. So that is a very tricky part.

To avoid that we can use heuristics for English language luckily:

  • Lowercase the beginning of the sentence because we know that every sentence starts with capital letter
  • Lowercase words that are seen in titles because in English language, titles are written in such form that every word is capitalized, so we can strip that.
  • Leave mid-sentence words as they are because if they're capitalized somewhere inside the sentence, maybe that means that that is a name or a named entity, and we should leave it as it is.

A harder way is to use machine learning to retrieve true casing, but that might be a harder problem than the original problem of sentiment analysis. Another type of normalization that you can use for your tokens is normalizing acronyms like ETA or E, T, A, or ETA written in capital form (stands for estimated time of arrival). For this, we actually can write a bunch of regular expressions that will capture those different representation of the same acronym, and we'll normalize that. But that is a pretty hard thing because you must think about all the possible forms in advance and all the acronyms that you want to normalize.


StopWords

StopWords are words which can be filtered out during the text preprocessing phase. The reason is that can add a lot of noise when applying machine learning techniques.

The NLTK library tool has a predefined list of english stopwords that include the most common used english. If it is the first time you will need to download the stop words by running this command: nltk.download(“stopwords”).

from nltk.corpus import stopwords
print(stopwords.words("english"))

Once downloading is finished, we can print out the stopwords:

The stopwords can be easily removed using the below code snippet (List comprehension technique):

stop_words = set(stopwords.words("english"))
text = "This is Andrew's text, isn't it?"
tokenizer = nltk.tokenize.TreebankWordTokenizer()
[word for word in tokenizer.tokenize(text) if not word in stop_words]

Output:

As you can notice the list of stopwords was converted into a set (abstract data type that store unique values, without any particular order) as the search operation in a set is much faster than the search operation in a list. The difference is almost unnoticeable for small lists but it is highly recommended for big lists.


When not to use the above techniques

Usually, lemmatization and stemming are used when we have limited data and a simple model. When we are using a complex enough model like Neural Network is probably a good idea not to remove stopwords and to use neither lemmatization nor stemming. The idea is that a simpler model cannot really learn complex things so it does require some "special" data preprocessing beforehand.


Conclusion

In this article we presented the following:

  • Tokenization: a process of extracting those tokens
  • Tokens can be normalized using stemming or lemmatization
  • Further preprocessing can involve normalizing casing and acronyms

The most important takeaway is that the appropriate preprocessing technique is highly depended on the text and the task we had to tackle. It doesn't exist a silver bullet that works equally well in all tasks.

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Machine Learning.


References