Skip to content

Lesson 17

Python and Regex (Part 01)

Use regular expressions to extract structured data — like dates — from messy strings.

The first major library we’ll use in this course is regex — regular expressions. The string methods from Lesson 03 (split, replace, find, etc.) handle exact matches well, but they break down the moment your data has consistent variation: dates with sometimes-zero-padded months, names sometimes followed by titles, numbers sometimes with commas. Regex is the tool for that case.

A regular expression is a small pattern language for describing strings. The same pattern works in Python, in your editor’s find-and-replace, in grep, in JavaScript, in databases, and in basically every modern programming environment. Learning the syntax once pays dividends across all of them.

The motivating problem

Consider a dataset where dates are inconsistently formatted: 5/3/1989, 11/12/95, 2/14/2001. With string methods alone, handling all of those variants takes a tower of if/elif checking string lengths and contents. With regex, it’s one line.

Importing

Python’s regex library ships with the language. Bring it in at the top of your script:

import re

For most needs re is enough. For Unicode-heavy or very advanced patterns, the third-party regex library (uv add regex) is a strict superset; the API is identical, so you can swap one for the other later.

A first pattern — find all the digits

import re

description = "I have 1 cat, 2 dogs, and 3 birds. There are 5 of each. So 5 + 5 + 5 = 15."

print(re.findall(r"\d", description))
# ['1', '2', '3', '5', '5', '5', '1', '5']

Two important details in that one call:

  • re.findall(pattern, string) returns every match as a list.
  • The r prefix turns the pattern into a raw string, which means Python doesn’t pre-process the backslashes before passing the pattern to regex. Always use raw strings for regex patterns. ("\d" happens to work here because Python doesn’t recognize \d as a special escape — but "\b" would silently become a backspace character. Just use r"..." every time.)
  • \d is regex for “any single digit.” It matches 1, 2, etc., one digit per match — which is why 15 came back as two separate items.

Quantifiers — match more than one

{m,n} means “between m and n of the preceding thing”:

import re

description = "I have 1 cat, 2 dogs, and 3 birds. There are 5 of each. So 5 + 5 + 5 = 15."

print(re.findall(r"[0-9]{1,2}", description))
# ['1', '2', '3', '5', '5', '5', '15']

[0-9] is a character class — match any character in this set. {1,2} says match it once or twice. Now 15 comes through as a single two-digit number.

The four most useful quantifiers:

QuantifierMeaning
?zero or one
*zero or more
+one or more
{m,n}between m and n
{m,}at least m
{m}exactly m

Quantifiers are greedy by default — they match as much as they can. Add a ? after the quantifier (+?, *?, {m,n}?) to make it lazy (match as little as possible). Greedy vs lazy matters whenever you’re matching across delimiters; if a pattern is “eating” too much, that’s the first thing to check.

A pattern for dates

Now let’s tackle the motivating example:

import re

text = "The event happened on 5/3/1989 and again on 11/12/95."
pattern = r"[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}"
print(re.findall(pattern, text))
# ['5/3/1989', '11/12/95']

Read it left to right:

  • [0-9]{1,2} — a one- or two-digit number (month)
  • / — a literal slash
  • [0-9]{1,2} — another one- or two-digit number (day)
  • / — another slash
  • [0-9]{2,4} — a two- to four-digit number (year)

Both date shapes match cleanly. That tolerance for variation is exactly what regex is good for.

The character classes you’ll use most

ClassMatchesEquivalent
\da digit[0-9]
\Da non-digit[^0-9]
\wa word character[A-Za-z0-9_]
\Wa non-word character
\swhitespace (space, tab, newline)
\Snon-whitespace
.any character except newline
[abc]a, b, or c
[^abc]anything except a, b, or c
[A-Z]any uppercase letter

A note on Unicode: \d and \w in Python’s re do match Unicode digits and word characters by default (so \d matches Arabic-Indic numerals, \w matches accented letters). That’s the right default for humanities work, but it can surprise you. If you need ASCII-only matching, pass flags=re.ASCII.

Anchors and groups — a quick taste

Two more pieces show up in nearly every real pattern.

Anchors pin a match to a position in the string:

AnchorMeaning
^start of the string (or of a line, with re.MULTILINE)
$end of the string (or of a line)
\bword boundary
import re

print(re.findall(r"^\w+", "Once upon a time"))    # ['Once']
print(re.findall(r"\bthe\b", "the other theory")) # ['the']  — not 'theory'

Groups wrap part of a pattern in parentheses to capture it:

import re

text = "5/3/1989 and 11/12/95"
print(re.findall(r"([0-9]{1,2})/([0-9]{1,2})/([0-9]{2,4})", text))
# [('5', '3', '1989'), ('11', '12', '95')]

When the pattern has groups, findall returns tuples of the captured pieces — month, day, year — instead of the whole match. That’s how you parse data, not just extract it.

Functions in re you’ll actually use

FunctionReturnsUse it when
re.findall(p, s)list of all matchesyou want every hit
re.search(p, s)the first Match (or None)you want one and want detail
re.match(p, s)a Match only if s starts with prare; mostly use search
re.fullmatch(p, s)a Match only if p matches the whole svalidation
re.sub(p, repl, s)a new string with replacementsfind-and-replace
re.split(p, s)a list, splitting on each matchsmarter than str.split
re.compile(p)a reusable Pattern objectwhen you’ll use p often

re.sub is especially useful for cleanup. To collapse runs of whitespace into single spaces:

import re

print(re.sub(r"\s+", " ", "foo    bar\n\nbaz").strip())
# 'foo bar baz'

To redact dates while keeping the surrounding text intact:

import re

text = "The event happened on 5/3/1989 and again on 11/12/95."
print(re.sub(r"[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}", "[date]", text))
# 'The event happened on [date] and again on [date].'

Where to go to learn more

Regex is easy to start with and hard to master. Two resources that pay off:

  • regex101.com — paste a string, type a pattern, see matches highlight in real time. Set the flavor to “Python” in the left sidebar. This is the single best way to learn.
  • The re module documentation — the language reference for everything Python supports.

Once you’re comfortable with the basics, continue to Lesson 18: Python and Regex (Part 02), where we apply regex to a real text file.

Running the code

re ships with Python, so there’s nothing to install. Save any snippet from this lesson to a file — say try.py — and run it from your project folder:

uv run try.py

uv run uses the project’s Python automatically; no virtualenv to activate. If you haven’t set the project up yet, Lesson 01 walks through it.