In this article, a useful tool in the NLP toolkit named regex will be presented in details.Regex is used as a metalanguage to find string patterns in blocks of text

Regular expressions (regex or regexp) are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern (i.e. a specific sequence of ASCII or unicode characters).


Table of Contents

  • Example: Phone Number
  • Example: Get label from filename
  • Build your own tokenizer
  • What are the uses of Regex?
  • Regex unicode example
  • Regex Errors
  • Best Practice
  • Advantages & Disadvantages
  • Regex Workflow
  • Practice Material
  • Conclusion
  • References‌

Example: Phone Number

Let's try to find all the phone numbers in the below data. As we see some of the phone numbers have different formats (hyphens, no hyphens) while some others are not phone numbers.

Phone numbers: 123-456-7890, 123 456 7890

NOT: 101 Howard, 1234, 34 50 98 21 32, (34)(50)()()982132

Let's try to solve this task without using Regex and by creating some heuristic rules (with a lot of if else statements) :

import string

def check_phone(inp):
    nums = string.digits
    valid_chars = nums + ' -()'
    num_counter = 0
    for char in inp:
        if char not in valid_chars:
            return False
        if char in nums:
            num_counter += 1
    if num_counter==10: # each phone number has only 10 digits
        return True
    else:
        return False

phone1 = "123-456-7890"

phone2 = "123 456 7890"

not_phone1 = "101 Howard"

not_phone2 = "34 50 98 21 32"

not_phone3 = "(34)(50)()()982132"

assert(check_phone(phone1))
assert(check_phone(phone2))
assert(not check_phone(not_phone1))
assert(not check_phone(not_phone2))
assert(not check_phone(not_phone3))

As we see these complicated if else rules fail. Regex can be used instead to give us a cleaner and more effective way to deal with this case. The reason that it failed is that we get phone numbers with a lot of different formats; they might have hyphens, parentheses or spaces but there is a general pattern that we can recognize with our eyes of what makes a phone number.

Regex

import re

def check_phone_reg(inp):
  return bool(re.match(r'\d{3}[-| ]\d{3}[-| ]\d{4}', inp)) 

assert(check_phone_reg(phone1))
assert(check_phone_reg(phone2))
assert(not check_phone_reg(not_phone1))
assert(not check_phone_reg(not_phone2))
assert(not check_phone_reg(not_phone3))

It seems quite simple but let's understand the Regex expression:

  • \d: Matches any digit character (0-9)
  • \d{3}: Matches a digit sequence of length 3
  • [- | ]: Match any character inside the brackets

Example: Get label from filename

In this example we try to find the label of the picture which is present in the filename. For example:

  • '/root/data/oxford-iiit-pet/images/Maine_Coon_103.jpg'
  • '/root/data/oxford-iiit-pet/images/Abyssinian_104.jpg'

Out of the above path we need to extract the bold part of the string. Let's get started:

import re

str1 = '/root/data/oxford-iiit-pet/images12/Maine_Coon_Alg_103.jpg'
str2 = '/root/data/oxford-iiit-pet/images12/Abyssinian_104.jpg1'

print(re.findall(r'\d',str1),re.findall(r'\d',str2))
re.findall(r'\d+',str1),re.findall(r'\d+',str2) #greedy approach extract any number sequence
# extract all groups seperated by '/'
re.findall(r'/([^/]+)',str1),re.findall(r'/([^/]+)',str2)
# extract all groups seperated by '/' keeping the ones ending in a sequence of number and .jpg
# $ Finds regex that must match at the end of the line.
re.findall(r'/([^/]+)_+\d+.jpg$',str1),re.findall(r'/([^/]+)_+\d+.jpg$',str2)

Build your own tokenizer

At this section we will build our own tokenizer using regex in order to get a better understanding of tokenization. Note that there's not just one way to do tokenization as it really depends on the implementation, what you want your data set to look like.

The below example is just an illustration of one approach you could take.

  • add spaces around punctuation
  • substitute n ' t back to n't
  • substitute ' s back to 's
  • replace multiple spaces with just one (whitespace usually it is not used as a token)

Note that /1 refer to the first group that we've captured as re_punc has parentheses around the outside indicating capture group so we have captured what that piece of punctuation is and we're saying is add whitespace before and after the punctuation

import re

re_punc = re.compile("([\"\''().,;:/_?!—\-])") # add spaces around punctuation
re_apos = re.compile(r"n ' t ")    # n't
re_bpos = re.compile(r" ' s ")     # 's
re_mult_space = re.compile(r"  *") # replace multiple spaces with just one

def simple_toks(sent):
    sent = re_punc.sub(r" \1 ", sent)
    sent = re_apos.sub(r" n't ", sent)
    sent = re_bpos.sub(r" 's ", sent)
    sent = re_mult_space.sub(' ', sent)
    return sent.lower().split()

text = "I don't know who Kara's new friend is-- is it 'Mr. Toad'?"
' '.join(simple_toks(text))

Let's look at each individual substitute as well:

When would you want to create your own tokenizer rather than using a library like spacy?

I think in most scenarios you would probably want to use spacy see except of situations where you will want to test a specific hypothesis.

sentences = ['All this happened, more or less.',
             'The war parts, anyway, are pretty much true.',
             "One guy I knew really was shot for taking a teapot that wasn't his.",
             'Another guy I knew really did threaten to have his personal enemies killed by hired gunmen after the war.',
             'And so on.',
             "I've changed all their names."]
tokens = list(map(simple_toks, sentences))
tokens

Once we have our tokens, we need to convert them to integer ids.  We will also need to know our vocabulary, and have a way to convert between words and ids.

import collections

PAD = 0; SOS = 1

def toks2ids(sentences):
    voc_cnt = collections.Counter(t for sent in sentences for t in sent)
    vocab = sorted(voc_cnt, key=voc_cnt.get, reverse=True)
    vocab.insert(PAD, "<PAD>")
    vocab.insert(SOS, "<SOS>")
    w2id = {w:i for i,w in enumerate(vocab)}
    ids = [[w2id[t] for t in sent] for sent in sentences]
    return ids, vocab, w2id, voc_cnt
        
ids, vocab, w2id, voc_cnt = toks2ids(tokens)
' '.join(vocab[i] for i in ids[0])

Having that mapping token to ID and ID to token are two useful mappings you'll typically want so you can get back and forth.


What are the uses of Regex?

  1. Find / Search
  2. Find & Replace
  3. Cleaning

Regex unicode example

message = "😒🎦 🤢🍕"

re_frown = re.compile(r"😒|🤢")
re_frown.sub(r"😊", message)

Regex Errors

False positives (Type I): Matching strings that we should not have matched

False negatives (Type II): Not matching strings that we should have matched

Reducing the error rate for a task often involves two antagonistic efforts:

  1. Minimizing false positives
  2. Minimizing false negatives

Important to have tests for both!

In a perfect world, you would be able to minimize both but in reality you often have to trade one for the other.


Best Practice

If I had to give one advice about Regex would be: "Be as specific as possible."

A useful thing to remember about Regex . is kind of wild card as it means any character so how do you indicate if you actually want a period.
In a regular expression an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter that we want to use as a literal.

'\.' means find literal period character (not match any character)


Advantages & Disadvantages

Advantages

  1. Concise and powerful pattern matching DSL
  2. Supported by many computer languages, including SQL (it's not just a python thing)

Disadvantages

  1. Brittle (flip side of being concise)
  2. Hard to write, can get complex to be correct
  3. Hard to read

Regex Workflow

  1. Create pattern in Plain English
  2. Map to regex language
  3. Make sure results are correct (minimize False positives and negatives)
  4. Don't over-engineer your regex. (your goal is to Get Stuff Done, not write the best regex in the world)
  5. Filtering before and after are okay. (Regex is just a tool so you don't have to try and use it end to end but use it in conjunction with other things)

Practice Material

The best way to learn how to use it is through practice. Otherwise, you feel like you're just reading lists of rules.Some good websites to practice and further develop your Regex skills are:

A useful real-world example is the one below where we try to pick out PDFs (ending in .pdf) given a list of filenames.


Conclusion

This brings us to the end of this article. Hope you got a basic understanding of how you can use Regex. Feel free to use the Python code snippet of this article.

If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.‌
‌Thanks for reading and I am looking forward to hear your questions :)‌
Stay tuned and Happy Machine Learning.


‌References