# Regex introduction

# What is a regex?

Regex stands for regular expression, and regular expressions are a way of writing patterns that match strings. Usually these patterns can be used to search strings for specific things, or to search and then replace certain things, etc. Regular expressions are great for string manipulation!

# Why do regular expressions matter?

From the first paragraph in this guide you might have guessed it, but regular expressions can be very useful whenever you have to deal with strings. From the basic renaming of a set of similarly named variables in your source code to data preprocessing. Regular expressions usually offer a concise way of expressing whatever type of things you want to find. For example, if you wanted to parse a form and look for the year that someone might have been born in, you could use something like (19)|(20)[0-9][0-9]. This is an example of a regular expression!

# Prerequisites

This guide does not assume any prior knowledge. Examples will be coded in Python, but mastery of the programming language is neither assumed nor needed. You are welcome to read the guide in your browser or to download it and to run the examples/toying around with them.

# Index

Let's dive right in!

Just a quick word: I tried to include some small exercises whenever I show you something new, so that you can try and test your knowledge. Examples of solutions are provided in the end of the notebook.

# Basic regex

A regex is just a string written in a certain format, that can then be used by specific tools/libraries/programs to perform pattern matching on strings. Throughout this guide we will use this formatting to refer to regular expressions!

The simplest regular expressions that one can create are just composed of regular characters. If you wanted to find all the occurrences of the word "Virgilio" in a text, you could write the regex Virgilio. In this regular expression, no character is doing anything special or different. In fact, this regular expression is just a normal word. That is ok, regular expressions are strings, after all!

If you were given the text "Project Virgilio is great", you could use your Virgilio regex to find the occurrence of the word "Virgilio". However, if the text was "Project virgilio is great", then your regex wouldn't work, because regular expressions are case-sensitive by default and thus should match everything exactly. We say that Virgilio matches the sequence of characters "Virgilio" literally.

# Using Python re

To check if our regular expressions are working well and to give you the opportunity to directly experiment with them, we will be using Python's re module to work with regular expressions. To use the re module we first import it, then define a regular expression and then use the search() function over a string! Pretty simple:

import re

regex = "Virgilio"
str1 = "Project Virgilio is great"
str2 = "Project virgilio is great"

if re.search(regex, str1):
    print("'{}' is in '{}'".format(regex, str1))
else:
    print("'{}' is not in '{}'".format(regex, str1))
    
if re.search(regex, str2):
    print("'{}' is in '{}'".format(regex, str2))
else:
    print("'{}' is not in '{}'".format(regex, str2))
'Virgilio' is in 'Project Virgilio is great'
'Virgilio' is not in 'Project virgilio is great'

The re.search(regex, string) function takes a regex as first argument and then searches for any matches over the string that was given as the second argument. However, the return value of the function is not a boolean, but a match object:

print(re.search(regex, str1))
<re.Match object; span=(8, 16), match='Virgilio'>

Match objects have relevant information about the match(es) encountered: the start and end positions, the string that was matched, and even some other things for more complex regular expressions.

We can see that in this case the match is exactly the same as the regular expression, so it may look like the match information inside the match object is irrelevant... but it becomes relevant as soon as we introduce options or repetitions into our regex.

If no matches are found, then the .search() function returns None:

print(re.search(regex, str2))
None

Whenever the match is not None, we can save the returned match object and use it to extract all the needed information!

m = re.search(regex, str1)
if m is not None:
    print("The match started at pos {} and ended at pos {}".format(m.start(), m.end()))
    print("Or with tuple notation, the match is at {}".format(m.span()))
    print("And btw, the actual string matched was '{}'".format(m.group()))
The match started at pos 8 and ended at pos 16
Or with tuple notation, the match is at (8, 16)
And btw, the actual string matched was 'Virgilio'

Now you should try to get some more matches and some fails with your own literal regular expressions. I provide three examples of my own:

m1 = re.search("regex", "This guide is about regexes")
if m1 is not None:
    print("The match is at {}\n".format(m1.span()))

m2 = re.search("abc", "The alphabet goes 'abdefghij...'")
if m2 is None:
    print("Woops, did I just got the alphabet wrong..?\n")
    
s = "aaaaa aaaaaa a aaa"
m3 = re.search("a", s)
if m3 is not None:
    print("I just matched '{}' inside '{}'".format(m3.group(), s))
The match is at (20, 25)

Woops, did I just got the alphabet wrong..?

I just matched 'a' inside 'aaaaa aaaaaa a aaa'

# $\pi$ lookup

$$\pi = 3.1415\cdots$$

right? Well, what comes after the dots? An infinite sequence of digits, right? Could it be that your date of birth appears in the first million digits of $\pi$? Well, we could use a regex to find that out! Change the regex variable below to look for your date of birth or for any number you want, in the first million digits of $\pi$!

pifile = "regex-bin/pi.txt"
regex = ""  # define your regex to look your favourite number up

with open(pifile, "r") as f:
    pistr = f.read()  # pistr is a string that contains 1M digits of pi
    
## search for your number here

To search for numbers in the first 100 million digits of $\pi$ (or 200 million, I didn't really get it) you can check this website.

# Matching options

We just saw a very simple regular expression that was trying to find the word "Virgilio" in text, but we also saw that we had zero flexibility and we couldn't even handle the fact that someone may have forgotten to capitalize the name properly, spelling it like "virgilio" instead.

To prevent problems like this, regular expressions can be written in a way to handle different possibilities. For our case, we want the first letter to be either "V" or "v", and that should be followed by "irgilio".

In order to handle different possibilities, we use the character |. For instance, V|v matches the letter vee, regardless of its capitalization:

v = "v"
V = "V"
regex = "v|V"
if re.search(regex, v):
    print("small v found")
if re.search(regex, V):
    print("big V found")
small v found
big V found

Now we can concatenate the regex for the first letter and the irgilio regex (for the rest of the name) to get a regex that matches the name of Virgilio, regardless of the capitalization of its first letter:

virgilio = "virgilio"
Virgilio = "Virgilio"
regex = "(V|v)irgilio"
if re.search(regex, virgilio):
    print("virgilio found!")
if re.search(regex, Virgilio):
    print("Virgilio found!")
virgilio found!
Virgilio found!

Notice that we write the regex with parenthesis: (V|v)irgilio

If we only wrote V|virgilio, then the regular expression would match either "V" or "virgilio", instead of "Virgilio" or "virgilio":

regex = "V|virgilio"
print(re.search(regex, "This sentence only has a big V"))
<re.Match object; span=(29, 30), match='V'>

So we really need to parenthesize the (V|v) there. If we do, it will work as expected!

regex = "(V|v)irgilio"
print(re.search(regex, "The name of the project is virgilio, but with a big V!"))
print(re.search(regex, "This sentence only has a big V"))
<re.Match object; span=(27, 35), match='virgilio'>
None

Maybe you didn't even notice, but there is something else going on! Notice that we used the characteres |, ( and ), and those are not present in the word "virgilio", but nonetheless our regex (V|v)irgilio matched it... that is because these three characters have special meanings in the regex world, and hence are not interpreted literally, contrary to what happens to any letter in irgilio.

# Virgilio or Virgil?

Here is a couple of paragraphs from Wikipedia's article on Virgil:

Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called Virgil or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]

Virgil is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. Virgil's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which Virgil appears as Dante's guide through Hell and Purgatory.

"Virgilio" is the italian form of "Virgil", and I edited the above paragraphs to have the italian version instead of the english one. I want you to revert this!

You might want to take a look at while cycles in Python, string indexing and string concatenation. The point is that you find a match, you break the string into the part before the match and the part after the match, and you glue those two together with Virgilio in between.

Notice that string replacement would probably be faster and easier, but that would defeat the purpose of this exercise. After fixing everything, print the final results to be sure that you fixed every occurrence of the name.

paragraphs = \
"""Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called virgilio or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]

Virgilio is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. virgilio's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which virgilio appears as Dante's guide through Hell and Purgatory."""

# Matching repetitions

Sometimes we want to find patterns that have bits that will be repeated. For example, people make a "awww" or "owww" sound when they see something cute, like a baby. But the number of "w" I used there was completely arbitrary! If the baby is really really cute, someone might write "awwwwwwwwwww". So how can I write a regex that matches "aww" and "oww", but with an arbitrary number of characters "w"?

I will illustrate several ways of capturing repetitions, by testing regular expressions against the following strings:

  • "awww" (3 letters "w")
  • "awwww" (4 letters "w")
  • "awwwwwww" (7 letters "w")
  • "awwwwwwwwwwwwwwww" (16 letters "w")
  • "aw" (1 letter "w")
  • "a" (0 letters "w")
cute_strings = [
    "awww",
    "awwww",
    "awwwwwww",
    "awwwwwwwwwwwwwwww",
    "aw",
    "a"
]

def match_cute_strings(regex):
    """Takes a regex, prints matches and non-matches"""
    for s in cute_strings:
        m = re.search(regex, s)
        if m:
            print("match: {}".format(s))
        else:
            print("non match: {}".format(s))

# At least once

If I want to match all strings that containt at least one "w", we can use the character +. A + means that we want to find one or more repetitions of whatever was to the left of it. For example, the regex a+ will match any string that has at least one "a".

regex = "aw+"
match_cute_strings(regex)
match: awww
match: awwww
match: awwwwwww
match: awwwwwwwwwwwwwwww
match: aw
non match: a

# Any number of times

If I want to match all strings that contain an arbitrary number of letters "w", I can use the character *. The character * means match any number of repetitions of whatever comes on the left of it, even 0 repetitions! So the regex a* would match the empty string "", because the empty string "" has 0 repetitions of the letter "a".

regex = "aw*"
match_cute_strings(regex)
match: awww
match: awwww
match: awwwwwww
match: awwwwwwwwwwwwwwww
match: aw
match: a

# A specific number of times

If I want to match a string that contains a certain particle a specific number of times, I can use the {n} notation, where n is replaced by the number of repetitions I want. For example, a{3} matches the string "aaa" but not the string "aa".

regex = "aw{3}"
match_cute_strings(regex)
match: awww
match: awwww
match: awwwwwww
match: awwwwwwwwwwwwwwww
non match: aw
non match: a

Wait a minute, why did the pattern aw{3} match the longer expressions of cuteness, like "awwww" or "awwwwwww"? Because the regular expressions try to find substrings that match the pattern. Our pattern is awww (if I write the w{3} explicitly) and the string awwww has that substring, just like the string awwwwwww has it, or the longer version with 16 letters "w". If we wanted to exclude the strings "awwww", "awwwwwww" and "awwwwwwwwwwwwwwww" we would have to fix our regex. A better example that demonstrates how {n} works is by considering, instead of expressions of cuteness, expressions of amusement like "wow", "woow" and "wooooooooooooow". We define some expressions of amusement:

  • "wow"
  • "woow"
  • "wooow"
  • "woooow"
  • "wooooooooow"

and now we test our {3} pattern.

wow_strings = [
    "wow",
    "woow",
    "wooow",
    "woooow",
    "wooooooooow"
]

def match_wow_strings(regex):
    """Takes a regex, prints matches and non-matches"""
    for s in wow_strings:
        m = re.search(regex, s)
        if m:
            print("match: {}".format(s))
        else:
            print("non match: {}".format(s))
regex = "wo{3}w"
match_wow_strings(regex)
non match: wow
non match: woow
match: wooow
non match: woooow
non match: wooooooooow

# Between $n$ and $m$ times

Expressing amusement with only three "o" is ok, but people might also use two or four "o". How can we capture a variable number of letters, but within a range? Say I only want to capture versions of "wow" that have between 2 and 4 letters "o". I can do it with {2,4}.

regex = "wo{2,4}w"
match_wow_strings(regex)
non match: wow
match: woow
match: wooow
match: woooow
non match: wooooooooow

# Up to $n$ times or at least $m$ times

Now we are just playing with the type of repetitions we might want, but of course we might say that we want no more than $n$ repetitions, which you would do with {,n}, or that we want at least $m$ repetitions, which you would do with {m,}.

In fact, take a look at these regular expressions:

regex = "wo{,4}w" # should not match strings with more than 4 o's
match_wow_strings(regex)
match: wow
match: woow
match: wooow
match: woooow
non match: wooooooooow
regex = "wo{3,}w" # should not match strings with less than 3 o's
match_wow_strings(regex)
non match: wow
non match: woow
match: wooow
match: woooow
match: wooooooooow

# To be or not to be

Last but not least, sometimes we care about something that might or might not be present. For example, above we dealed with the English and Italian versions of the name Virgilio. If we wanted to write a regular expression to capture both versions, we could write ((V|v)irgil)|((V|v)irgilio), or slightly more compact, (V|v)((irgil)|(irgilio)). But this does not look good at all, right? All we need to say is that the final "io" might or might not be present. We do this with the ? character. So the regex (V|v)irgil(io)? matches the upper and lower case versions of "Virgil" and "Virgilio".

regex = "(V|v)irgil(io)?"
names = ["virgil", "Virgil", "virgilio", "Virgilio"]
for name in names:
    m = re.search(regex, name)
    if m:
        print("The name {} was matched!".format(name))
The name virgil was matched!
The name Virgil was matched!
The name virgilio was matched!
The name Virgilio was matched!

# Greed

The +, ?, * and {,} operators are all greedy. What does this mean? It means that they will try to match as much as possible. They have this default behaviour, as opposed to stopping to try and find more matches as soon as the regex is satisfied. To better illustrate what I mean by this, let us look again at the information contained in the match object we have been dealing with:

regex = "a+"
s = "aaa"
m = re.search(regex, s)
print(m)
<re.Match object; span=(0, 3), match='aaa'>

Notice the part of the printed information that says match='aaa'. The function m.group() will let me know what was the actual string that was matched by the regular expression, and in this case it was "aaa". Why does it make sense to have access to this information? Well, the regex I wrote, a+, will match one or more letters "a" in a row. If I use the regex over a string and I get a match, how would I be able to know how many "a"s were matched, if I didn't have access to that type of information?

print(m.group())
aaa

So let us verify that, in fact, the operators I mentioned are all greedy. Again, because they all match as many characters as they can.

Below, we see that given a string of thirty times the letter "a",

  • the pattern a? matches 1 "a", which is as much as it could
  • the pattern a+ matches 30 "a"s, which is as much as it could
  • the pattern a* also matches 30
  • the pattern a{5,10} matches 10 "a"s, which was the limit imposed by us
s = "a"*30
print(re.search("a?", s).group())
print(re.search("a+", s).group())
print(re.search("a*", s).group())
print(re.search("a{5,10}", s).group())
a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaa

If we don't want our operators to be greedy, we just put an extra ? after them. So the following regular expressions are not greedy:

  • the pattern a?? will match no characters, much like a*?, because now their goal is to match as little as possible. But a match of length 0 is the shortest match possible!
  • the pattern a+? will only match 1 "a"
  • the pattern a{5,10}? will only match 5 "a"s

We can easily confirm what I just said by running the code below. Notice that now I print things differently, because otherwise we wouldn't be able to see the a?? and a*? patterns matching nothing.

s = "a"*30
print("'{}'".format(re.search("a??", s).group()))
print("'{}'".format(re.search("a+?", s).group()))
print("'{}'".format(re.search("a*?", s).group()))
print("'{}'".format(re.search("a{5,10}?", s).group()))
''
'a'
''
'aaaaa'

# Removing excessive spaces

Now that we know about repetitions, I am going to tell you about the sub function and we are going to use that to parse a piece of text and remove all extra spaces that are present. Typing in re.sub(regex, rep, string) will use the given regex on the given string, and whenever it matches, it removes the match and puts the rep in there.

For example, I can use that to replace all English/Italian occurrences of the name Virgilio with a standardized one:

s = "Virgilio has many names, like virgil, virgilio, Virgil, Vergil, or even vergil."
regex = "(V|v)(e|i)rgil(io)?"

print(
    re.sub(regex, "Virgilio", s)
)
Virgilio has many names, like Virgilio, Virgilio, Virgilio, Virgilio, or even Virgilio.


Now   it  is your   turn.  I am     going  to give   you this    sentence as        input, and   your  job    is to      fix the     whitespace         in it. When you    are  done,    save the    result in a  string  named   `s`, and   check    if  `s.count("  ")` is   equal   to    0  or not.
weird_text = "Now   it  is your   turn.  I am     going  to give   you this    sentence as        input, and   your  job    is to      fix the     whitespace         in it. When you    are  done,    save the    result in a  string  named   `s`, and   check    if  `s.count("  ")` is   equal   to    0  or not."
regex = ""  # put your regex here

# substitute the extra whitespace here
# save the result in 's'

# this print should be 0
print(s.count("  "))

# Character classes

So far we have been using writing some simple regular expressions that have been matching some words, and some names, and things like that. Now we have a different plan. We will write a regular expression that will match on US phone numbers, which we will assume are of the form xxx-xxx-xxxx. The first three digits are the area code, but we will not care about whether the area code actually makes sense or not. How do we match this, then?

In fact, how can I match the first digit? It can be any number from 0 to 9, so should I write (0|1|2|3|4|5|6|7|8|9) to match the first digit, and then repeat? Actually, we could do that, yes, to get this regex:

(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){4}

Does this work?

regex = "(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){4}"
numbers = [
    "202-555-0181",
    "202555-0181",
    "202 555 0181",
    "512-555-0191",
    "96-125-3546",
]
for nr in numbers:
    print(re.search(regex, nr))
<re.Match object; span=(0, 12), match='202-555-0181'>
None
None
<re.Match object; span=(0, 12), match='512-555-0191'>
None

It looks like it works, but surely there must be a better way... and there is! Instead of writing out every digit like we did, we can actually write a range of values! In fact, the regex [0-9] matches all digits from 0 to 9. So we can actually shorten our regex to [0-9]{3}-[0-9]{3}-[0-9]{4}:

regex = "[0-9]{3}-[0-9]{3}-[0-9]{4}"
numbers = [
    "202-555-0181",
    "202555-0181",
    "202 555 0181",
    "512-555-0191",
    "96-125-3546",
]
for nr in numbers:
    print(re.search(regex, nr))
<re.Match object; span=(0, 12), match='202-555-0181'>
None
None
<re.Match object; span=(0, 12), match='512-555-0191'>
None

The magic here is being done by the [], which denotes a character class. The way [] works is, the regex will try to match any of the things that are inside, and it just so happens that 0-9 is a shorter way of listing all the digits. Of course you could also do [0123456789]{3}-[0123456789]{3}-[0123456789]{4} which is slightly shorter than our first attempt, but still pretty bad. Similar to 0-9, we have a-z and A-Z, which go through all letters of the alphabet.

You can also start and end in different places, for example c-o can be used to match words that only use letters between the "c" and the "o", like "hello":

regex = "[c-o]+"
print(re.search(regex, "hello"))
print(re.search(regex, "rice"))
<re.Match object; span=(0, 5), match='hello'>
<re.Match object; span=(1, 4), match='ice'>

With these character classes we can actually rewrite our Virgilio regex into something slightly shorter, going from (V|v)(e|i)rgil(io)? to [Vv][ie]rgil(io)?.

s = "Virgilio has many names, like virgil, virgilio, Virgil, Vergil, or even vergil."
regex = "[Vv][ie]rgil(io)?"

print(
    re.sub(regex, "Virgilio", s)
)
Virgilio has many names, like Virgilio, Virgilio, Virgilio, Virgilio, or even Virgilio.

Again we see that our regular expression matched the ice in rice, because the "r" was not inside the legal range of letters, but ice was.

The character class is the square brackets [] and whatever goes inside it. Also, note that the special characters we have been using lose their meaning inside a character class! So [()?+*{}] will actually look to match any of those characters:

regex = "[()?+*{}]"
print(re.search(regex, "Did I just ask a question?"))
<re.Match object; span=(25, 26), match='?'>

A final note on character classes, if they start with ^ then we are actually saying "use everything except what is inside this":

regex = "[^c-o]+"
print(re.search(regex, "hello"))
print(re.search(regex, "rice"))
None
<re.Match object; span=(0, 1), match='r'>

# Phone numbers v1

Now that you know how to use character classes to denote ranges, you need to write a regular expression that matches american phone numbers with the format xxx-xxx-xxxx. Not only that, but you must also cope with the fact that the numbers may or may not be preceeded by the country indicator, which you can assume that will look like "+1" or "001". The country indicator may be separated from the rest of the number with a space or with a dash.

regex = ""  # write your regex here
matches = [  # you should be able to match those
    "202-555-0181",
    "001 202-555-0181",
    "+1-512-555-0191"
]
non_matches = [  # for now, none of these should be matched
    "202555-0181",
    "96-125-3546",
    "(+1)5125550191"
]
for s in matches:
    print(re.search(regex, s))
for s in non_matches:
    print(re.search(regex, s))

# More re functions

So far we only looked at the .search() function of the re module, but now I am going to tell you about a couple more function that can be quite handy when you are dealing with pattern matching. By the time you are done with this small section, you will now the following functions: match(), search(), findall(), sub() and split().

If you are here mostly for the regular expressions, and you don't care much about using them with Python, you can just skim through this section... even though it is still a nice read.

# search() and sub()

You already know these two functions, re.search(regex, string) will try to find your pattern given by regex in the given string and return the information of the match in a match object. The function re.sub(regex, rep, string) will take a regex and two strings; it will then look for the pattern you specified in string and replace the matches with the other string rep you gave it.

# match()

The function re.match(regex, string) is similar to the function re.search(), except that .match() will only check if your pattern applies to the beginning of the string. That is, if your string does not start with the pattern you provided, the function returns None.

regex = "abc"
string1 = "abcdef"
string2 = "the alphabet starts with abc"
# the .search() function finds the patterns, regardless of position
if re.search(regex, string1):
    print(".search() found {} in {}".format(regex, string1))
if re.search(regex, string2):
    print(".search() found {} in {}".format(regex, string2))
    
# the .match() function only checks if the string STARTS with the pattern
if re.match(regex, string1):
    print(".match() says that {} starts with {}".format(string1, regex))
if re.match(regex, string2):  # this one should NOT print
    print(".match() says that {} starts with {}".format(string2, regex))
.search() found abc in abcdef
.search() found abc in the alphabet starts with abc
.match() says that abcdef starts with abc

# findall()

The re.findall(regex, string) is exactly like the .search() function, except that it will return all the matches it can find, instead of just the first one. Instead of returning a match object, it just returns the string that matched.

regex = "wow"
string = "wow wow wow!"

print(re.search(regex, string))

print(re.findall(regex, string))
<re.Match object; span=(0, 3), match='wow'>
['wow', 'wow', 'wow']
regex = "ab[0-9]"
string = "ab1 ab2 ab3"

print(re.search(regex, string))

print(re.findall(regex, string))
<re.Match object; span=(0, 3), match='ab1'>
['ab1', 'ab2', 'ab3']

It is important to note that the findall() function only returns non-overlaping matches. That is, one could argue that wow appears twice in "wowow", in the beginning: wowow, and in the end: wowow. Nonetheless, findall() only returns one match because the second match overlaps with the first:

regex = "wow"
string = "wowow"
print(re.findall(regex, string))
['wow']

With this information it now makes a bit more sense to consider the greediness of the operators we showed before, like ? and +. Imagine we are dealing with the regex a+ and we have a string "aaaaaaaaa". If we use the greedy version of +, then we get a single match which is the whole string. If we use the non-greedy version of the operator +, perhaps because we want as many matches as possible, we will get a bunch of "a" matches!

regex_greedy = "a+"
regex_nongreedy = "a+?"
string = "aaaaaaaaa"

print(re.findall(regex_greedy, string))

print(re.findall(regex_nongreedy, string))
['aaaaaaaaa']
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']

# split()

The re.split(regex, string) splits the given string into bits wherever it is able to find the pattern you specified. Say we are interested in finding all the sequences of consecutive consonants in a sentence (I don't know why you would want that...). Then we can use the vowels and the space " " to break up the sentence:

regex = "[aeiou ]+" # this will eliminate all vowels/spaces that appear consecutively
string = "This is just a regular sentence"

print(re.split(regex, string))
['Th', 's', 's', 'j', 'st', 'r', 'g', 'l', 'r', 's', 'nt', 'nc', '']

# search with match

Recall that the match() function only checks if your pattern is in the beginning of the string. What I want you to do is define your own search function that takes a regex and a string, and returns True if the pattern is inside the string, and False otherwise. Can you do it?

def my_search(regex, string):
    pass  # write your code here

regex = "[0-9]{2,4}"

# your function should be able to match in all these strings
string1 = "1984 was already some years ago."
string2 = "There is also a book whose title is '1984', but the story isn't set in the year of 1984."
string3 = "Sometimes people write '84 for short."

# your function should also match with this regex and this string
regex = "a*"
string = ""

# Count matches with findall

Now I want you to define the count_matches function, which takes a regex and a string, and returns the number of non-overlaping matches there exist in the given string. Can you do it?

def count_matches(regex, string):
    pass  # your code goes here

regex = "wow"

string1 = "wow wow wow" # this should be 3
string2 = "wowow" # this should be 1
string3 = "wowowow" # this should be 2

# Special characters

It is time to ramp things up a bit! We have seen some characters that have special meanings, and now I am going to introduce a couple more of those! I will start by listing them, and then I'll explain them in more detail:

  • . is used to match any character, except for a newline
  • ^ is used to match at the beginning of the string
  • $ is used to match at the end of the string
  • \d is used to match any digit
  • \w is used to match any alphanumeric character
  • \s is used to match any type of whitespace
  • \ is used to remove the special meaning of the characters

# Dot .

The . can be used in a regular expression to capture any character that might have been used there, as long as we are still in the same line. That is, the only place where . doesn't work is if we changed lines in the text. Imagine the pattern was d.ck. Then the pattern would match

"duck"```

but it would not match

"d ck"```

because we changed lines in the middle of the string.

# Caret ^

If we use a ^ in the beginning of the regular expression, then we only care about matches in the beginning of the string. That is, ^wow would only match if the string started with "wow":

regex = "^wow"

print(re.search(regex, "wow, this is awesome"))
print(re.search(regex, "this is awesome, wow"))
<re.Match object; span=(0, 3), match='wow'>
None

Recall that ^ inside the character class can also mean "anything but whatever is in this class", so the regular expression [^d]uck would match any string that has uck in it, as long as it is not the word "duck". If the caret ^ appears inside a character class [] but it is not the first character, than it has no special meaning and it just stands for the character itself. This means that the regex [()^{}] is looking to match any of the characters listed:

regex = "[()^{}]"
print(re.search(regex, "^"))
print(re.search(regex, "("))
print(re.search(regex, "}"))
<re.Match object; span=(0, 1), match='^'>
<re.Match object; span=(0, 1), match='('>
<re.Match object; span=(0, 1), match='}'>

# Dollar sign $

Contrary to the caret ^, the dollar sign only matches at the end of the string!

regex = "wow$"

print(re.search(regex, "wow, this is awesome"))
print(re.search(regex, "this is awesome, wow"))
None
<re.Match object; span=(17, 20), match='wow'>

Combining the ^ with the $ means we are looking to match the whole string with our pattern. For example ^[a-zA-Z ]*$ checks if our string only contains letters and spaces and nothing else:

regex = "^[a-zA-Z ]*$"

s1 = "this is a sentence with only letters and spaces"
s2 = "this sentence has 1 number"
s3 = "this one has punctuation..."

print(re.search(regex, s1))
print(re.search(regex, s2))
print(re.search(regex, s3))
<re.Match object; span=(0, 47), match='this is a sentence with only letters and spaces'>
None
None

# Character groups \d, \w and \s

Whenever you see a backslash followed by a letter, that probably means that something special is going on. These three special "characters" are shorthand notation for some character classes []. For example, the \d is the same as [0-9]. The \w represents any alphanumeric character (like letters, numbers and _), and \s represents any whitespace character (like the space " ", the tab, the newline, etc).

All these three special characters I showed, can be capitalized. If they are, then they mean the exact opposite! So \D means "anything except a digit", \W means "anything except an alphanumeric character" and \S means "anything except whitespace characters.

regex = "\D+"
s = "these are some words"
print(re.findall(regex, s))
['these are some words']

Adding up to that, these special characters can be used inside a character class, so for instance [abc\d] would match any digit and the letters "a", "b" and "c". If the caret character ^ is used, then we are excluding whatever the special character refers to. As an example, if [\d] would match any digit, then [^\d] will match anything that is not a digit.

# The backslash \

We already saw the backslash being used before letters to give them some special meaning... Well, the backslash before a special character also strips it of its special meaning! So, if you wanted to match a backslash, you could use \\. If you want to match any of the other special characters we already saw, you could put a \ before them, like \+ to match a plus sign. The next regular expression can be used to match an addition expression like "16 + 6"

regex = "[\d]+ ?\+ ?[\d]+"
add1 = "16 + 6"
add2 = "4325+2"
add3 = "4+ 564"
mult1 = "56 * 2"

print(re.search(regex, add1))
print(re.search(regex, add2))
print(re.search(regex, add3))
print(re.search(regex, mult1))
<re.Match object; span=(0, 6), match='16 + 6'>
<re.Match object; span=(0, 6), match='4325+2'>
<re.Match object; span=(0, 6), match='4+ 564'>
None

# Phone numbers v2

Now I invite you to take a look at Phone numbers v1 and rewrite your regular expression to include some new special characters that you didn't know before!

regex = ""  # write your regex here
matches = [  # you should be able to match those
    "202-555-0181",
    "001 202-555-0181",
    "+1-512-555-0191"
]
non_matches = [  # for now, none of these should be matched
    "202555-0181",
    "96-125-3546",
    "(+1)5125550191"
]
for s in matches:
    print(re.search(regex, s))
for s in non_matches:
    print(re.search(regex, s))

# Groups

So far, when we used a regex to match a string we could retrieve the whole information of the match by using the .group() function on the match object:

regex = "my name? is"

m = re.search(regex, "my nam is Virgilio")
if m is not None:
    print(m.group())
my nam is

Say we are dealing with phone numbers again, and we want to look for phone numbers in a big text. But after that, we also want to extract the country from where the number is from. How could we do it..? Well, we can use a regex to match the phone numbers, and then use a second regex to extract the country code, right? (Let us just assume that phone numbers are written with the digits all in a sequence, with no spaces or "-" separating them.)

regex_number = "((00|[+])\d{1,3}[ -])\d{8,12}"
regex_code = "((00|[+])\d{1,3})"
matches = [  # you should be able to match those
    "+351 2025550181",
    "001 2025550181",
    "+1-5125550191",
    "0048 123456789"
]

for s in matches:
    m = re.search(regex_number, s)  # match the phone number
    if m is not None:
        phone_number = m.group()    # extract the phone number
        code = re.search(regex_code, phone_number)  # match the country code
        print("The country code is: {}".format(code.group()))
The country code is: +351
The country code is: 001
The country code is: +1
The country code is: 0048

But not only is this repetitive, because I just copied the beginning of the regex_number into the regex_code, but it becomes very cumbersome if I am trying to retrieve several different parts of my match. Because of this, there is a functionality of regular expressions that is grouping. By grouping parts of the regular expression, you can do things like using the repetition operators on them and retrieve their information later on.

To do grouping, one only needs to use the () parenthesis. For example, the regex (ab)+ looks for matches of the form "ab", "abab", "ababab", etcetera.

We also used the grouping in the beginning to create a regex that matched "Virgilio" and "virgilio", by writing (V|v)irgilio.

Now off to the part that really matters! We can use grouping to retrieve portions of the matches, and we do that with the .group() function! Any set of () defines a group, and then we can use the .group(i) function to retrieve group i. Just note that the 0th group is always the whole match, and then you start counting from the left!

regex_with_grouping = "(abc) (de(fg)hi)"
m = re.search(regex_with_grouping, "abc defghi jklm n opq")
print(m.group())
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(3))
print(m.groups())
abc defghi
abc defghi
abc
defghi
fg
('abc', 'defghi', 'fg')

Notice that match.group() and match.group(0) are the same thing. Also note that the function match.groups() returns all the groups in a tuple!

# Phone numbers v3

Using what you learned so far, write a regex that matches phone numbers with different country codes. Assume the following:

  • The country code starts with either 00 or +, followed by one to three digits
  • The phone number has length between 8 and 12
  • The phone number and country code are separated by a space " " or by a hyphen "-"

Have your code look for phone numbers in the string I will provide next, and have it print the different country codes it finds.

You might want to read what the exact behaviour of re.findall() is when the regex has groups in it. You can do that by checking the documentation of the re module.

paragraph = """Hello, I am Virgilio and I am from Italy.
If phones were a thing when I was alive, my number would've probably been 0039 3123456789.
I would also love to get a house with 3 floors and something like +1 000 square meters.
Now that we are at it, I can also tell you that the number 0039 3135313531 would have suited Leo da Vinci very well...
And come to think of it, someone told me that Socrates had dibs on +30-2111112222"""
# you should find 3 phone numbers
# and you should not be fooled by the other numbers that show up in the text

# Toy project about regex

For the toy project, that is far from trivial, you are left with mimicking what I did here. If you follow that link, you will find a piece of code that takes a regular expression and then prints all the strings that the given regex would match.

I'll just give you a couple of examples on how this works:

import sys
sys.path.append("./regex-bin")
import regexPrinter

def get_iter(regex):
    return regexPrinter.printRegex(regex).print()

def printall(regex):
    for poss_match in get_iter(regex):
        print(poss_match)

regex = "V|virgilio"
printall(regex)
print("-"*30)
regex = "wo+w"
printall(regex)
print("-"*30)
# notice that for some reason, dumb me used {n:m} instead of {n,m}
# also note that I only implemented {n,m}, and not {n,} nor {,m} nor {n}
# also note that this does not support nor \d nor [0-9]
regex = "((00|[+])1[ -])?[0123456789]{3:3}"
printall(regex)

Note that the code is protected against infinite patterns, which are signaled with ....

printall("this is infinite!+")
this is infinite!
this is infinite!!
this is infinite!...!

If you are completely new to this sort of things, then this will look completely impossible... but it is not, because I am a normal person and I was able to do it! So if you really want you can also do it! In the link you have listed all the functionality I decided to include, which excluded \d, for example.

I was only able to do this in the way I did because I had gone through some (not all) of the blog posts in this amazing series.

Maybe you can implement a smaller subset of the features without too much trouble? The point of this is that you could only print the strings matched by a regex if you know how regular expressions work. Try starting with only implementing literal matching and the | and ? operators. Can you now include grouping () so that (ab)? would work as expected? Can you add []? What about + and *? Or maybe start with {n,m} and write ?, + and * as {0,1}, {1,} and {0,} respectively.

You can also postpone this project for a bit, and dig deeper into the world of regex. The next section contains some additional references and some websites with exercises to practice your new knowledge!

# Further reading

For regular expressions in Python, you can take a look at the documentation of the re module, as well as this regex HOWTO.

Some nice topics to follow up on this would include, but are not limited to:

  • Non capturing groups (and named groups for Python)
  • Lookaheads (positive, negative, ...)
  • Regex compilation and flags (for Python)
  • Recursive regular expressions

This interesting website (and this one as well) provides an interface for you to type regular expressions and see what they match in a text. The tool also gives you an explanation of what your regular expression is doing.


I found some interesting websites with exercises on regular expressions. This one has more "basic" exercises, each one of them preceeded by an explanation of whatever you will need to complete the exercise. I suggest you to go through them. Hackerrank and regexplay also have some interesting exercises, but those require you to login in some way.


If you enjoyed this guide and/or it was useful, consider leaving a star in the Virgilio repository and sharing it with your friends!

This was brought to you by the editor of the Mathspp Blog, RojerGS.

# Suggested solutions

# $\pi$ lookup (solved)

pifile = "regex-bin/pi.txt"
regex = "9876"  # define your regex to look your favourite number up

with open(pifile, "r") as f:
    pistr = f.read()  # pistr is a string that contains 1M digits of pi
    
## search for your number here
m = re.search(regex, pistr)
if m:
    print("Found the number '{}' at positions {}".format(regex, m.span()))
else:
    print("Sorry, the first million digits of pi can't help you with that...")
Found the number '9876' at positions (4087, 4091)

# Virgilio or Virgil? (solved)

paragraphs = \
"""Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called virgilio or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]

Virgilio is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. virgilio's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which virgilio appears as Dante's guide through Hell and Purgatory."""

regex = "(V|v)irgilio"
parsed_str = paragraphs
m = re.search(regex, parsed_str)
while m is not None:
    parsed_str = parsed_str[:m.start()] + "Virgil" + parsed_str[m.end():]
    m = re.search(regex, parsed_str)

print(parsed_str)
Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called Virgil or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]

Virgil is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. Virgil's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which Virgil appears as Dante's guide through Hell and Purgatory.

# Removing excessive spaces (solved)

weird_text = "Now   it  is your   turn.  I am     going  to give   you this    sentence as        input, and   your  job    is to      fix the     whitespace         in it. When you    are  done,    save the    result in a  string  named   `s`, and   check    if  `s.count("  ")` is   equal   to    0  or not."
regex = " +"  # put your regex here
# there are several possible solutions, I chose this one

# substitute the extra whitespace here
s = re.sub(regex, " ", weird_text)

# this print should be 0
print(s.count("  "))
print(s)
0
Now it is your turn. I am going to give you this sentence as input, and your job is to fix the whitespace in it. When you are done, save the result in a string named `s`, and check if `s.count()` is equal to 0 or not.

# Phone numbers v1 (solved)

regex = "((00|[+])1[ -])?[0-9]{3}-[0-9]{3}-[0-9]{4}"  # write your regex here
matches = [  # you should be able to match those
    "202-555-0181",
    "001 202-555-0181",
    "+1-512-555-0191"
]
non_matches = [  # for now, none of these should be matched
    "202555-0181",
    "96-125-3546",
    "(+1)5125550191"
]
for s in matches:
    print(re.search(regex, s))
for s in non_matches:
    print(re.search(regex, s))
<re.Match object; span=(0, 12), match='202-555-0181'>
<re.Match object; span=(0, 16), match='001 202-555-0181'>
<re.Match object; span=(0, 15), match='+1-512-555-0191'>
None
None
None

# search with matched (solved)

def my_search(regex, string):
    found = False
    while string:
        m = re.match(regex, string)
        if m:
            return True
        string = string[1:]
    # check if the pattern matches the empty string
    if re.match(regex, string):
        return True
    else:
        return False

regex = "[0-9]{2,4}"

# your function should be able to match in all these strings
string1 = "1984 was already some years ago."
print(my_search(regex, string1))
string2 = "There is also a book whose title is '1984', but the story isn't set in the year of 1984."
print(my_search(regex, string2))
string3 = "Sometimes people write '84 for short."
print(my_search(regex, string3))

# your function should also match with this regex and this string
regex = "a*"
string = ""
print(my_search(regex, string))
True
True
True
True

# Count matches with findall (solved)

def count_matches(regex, string):
    return len(re.findall(regex, string))

regex = "wow"

string1 = "wow wow wow" # this should be 3
print(count_matches(regex, string1))
string2 = "wowow" # this should be 1
print(count_matches(regex, string2))
string3 = "wowowow" # this should be 2
print(count_matches(regex, string3))
3
1
2

# Phone numbers v2 (solved)

regex = "((00|[+])1[ -])?\d{3}-\d{3}-\d{4}"  # write your regex here
matches = [  # you should be able to match those
    "202-555-0181",
    "001 202-555-0181",
    "+1-512-555-0191"
]
non_matches = [  # for now, none of these should be matched
    "202555-0181",
    "96-125-3546",
    "(+1)5125550191"
]
for s in matches:
    print(re.search(regex, s))
for s in non_matches:
    print(re.search(regex, s))
<re.Match object; span=(0, 12), match='202-555-0181'>
<re.Match object; span=(0, 16), match='001 202-555-0181'>
<re.Match object; span=(0, 15), match='+1-512-555-0191'>
None
None
None

# Phone numbers v3 (solved)

For this "problem", one thinks of using the .findall() function to look for all matches. When we do that, we don't get a list of the match objects, but instead a list with tuples, where each tuple has a specific group from our regex. This is the behaviour that is documented for the re.findall() function.

This is fine, because we really only cared about the number code, and we can print it easily. If we wanted the match objects, then the alternative would be to use the re.finditer() function.

paragraph = """Hello, I am Virgilio and I am from Italy.
If phones were a thing when I was alive, my number would've probably been 0039 3123456789.
I would also love to get a house with 3 floors and something like +1 000 square meters.
Now that we are at it, I can also tell you that the number 0039 3135313531 would have suited Leo da Vinci very well...
And come to think of it, someone told me that Socrates had dibs on +30-2111112222"""
# you should find 3 phone numbers
# and you should not be fooled by the other numbers that show up in the text

regex = "((00|[+])\d{1,3})[ -]\d{8,12}"
ns = re.findall(regex, paragraph)  # find numbers
for n in ns:
    # n is a tuple with the two groups our string has
    print(n)
    
for n in re.finditer(regex, paragraph):
    print("The number '{}' has country code: {}".format(n.group(), n.group(1)))
('0039', '00')
('0039', '00')
('+30', '+')
The number '0039 3123456789' has country code: 0039
The number '0039 3135313531' has country code: 0039
The number '+30-2111112222' has country code: +30