Lesson 17

Python and Regex (Part 01)

Use regular expressions to extract structured data — like dates — from messy strings.

The first major library we’ll use in this course is regex — regular expressions. The string methods from Lesson 03 (split, replace, find, etc.) handle exact matches well, but they break down the moment your data has consistent variation: dates with sometimes-zero-padded months, names sometimes followed by titles, numbers sometimes with commas. Regex is the tool for that case.

A regular expression is a small pattern language for describing strings. The same pattern works in Python, in your editor’s find-and-replace, in grep, in JavaScript, in databases, and in basically every modern programming environment. Learning the syntax once pays dividends across all of them.

The motivating problem

Consider a dataset where dates are inconsistently formatted: 5/3/1989, 11/12/95, 2/14/2001. With string methods alone, handling all of those variants takes a tower of if/elif checking string lengths and contents. With regex, it’s one line.

Importing

Python’s regex library ships with the language. Bring it in at the top of your script:

import re

For most needs re is enough. For Unicode-heavy or very advanced patterns, the third-party regex library (uv add regex) is a strict superset; the API is identical, so you can swap one for the other later.

A first pattern — find all the digits

import re

description = "I have 1 cat, 2 dogs, and 3 birds. There are 5 of each. So 5 + 5 + 5 = 15."

print(re.findall(r"\d", description))
# ['1', '2', '3', '5', '5', '5', '1', '5']

Two important details in that one call:

re.findall(pattern, string) returns every match as a list.
The r prefix turns the pattern into a raw string, which means Python doesn’t pre-process the backslashes before passing the pattern to regex. Always use raw strings for regex patterns. ("\d" happens to work here because Python doesn’t recognize \d as a special escape — but "\b" would silently become a backspace character. Just use r"..." every time.)
\d is regex for “any single digit.” It matches 1, 2, etc., one digit per match — which is why 15 came back as two separate items.

Quantifiers — match more than one

{m,n} means “between m and n of the preceding thing”:

import re

description = "I have 1 cat, 2 dogs, and 3 birds. There are 5 of each. So 5 + 5 + 5 = 15."

print(re.findall(r"[0-9]{1,2}", description))
# ['1', '2', '3', '5', '5', '5', '15']

[0-9] is a character class — match any character in this set. {1,2} says match it once or twice. Now 15 comes through as a single two-digit number.

The four most useful quantifiers:

Quantifier	Meaning
`?`	zero or one
`*`	zero or more
`+`	one or more
`{m,n}`	between m and n
`{m,}`	at least m
`{m}`	exactly m

Quantifiers are greedy by default — they match as much as they can. Add a ? after the quantifier (+?, *?, {m,n}?) to make it lazy (match as little as possible). Greedy vs lazy matters whenever you’re matching across delimiters; if a pattern is “eating” too much, that’s the first thing to check.

A pattern for dates

Now let’s tackle the motivating example:

import re

text = "The event happened on 5/3/1989 and again on 11/12/95."
pattern = r"[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}"
print(re.findall(pattern, text))
# ['5/3/1989', '11/12/95']

Read it left to right:

[0-9]{1,2} — a one- or two-digit number (month)
/ — a literal slash
[0-9]{1,2} — another one- or two-digit number (day)
/ — another slash
[0-9]{2,4} — a two- to four-digit number (year)

Both date shapes match cleanly. That tolerance for variation is exactly what regex is good for.

The character classes you’ll use most

Class	Matches	Equivalent
`\d`	a digit	`[0-9]`
`\D`	a non-digit	`[^0-9]`
`\w`	a word character	`[A-Za-z0-9_]`
`\W`	a non-word character
`\s`	whitespace (space, tab, newline)
`\S`	non-whitespace
`.`	any character except newline
`[abc]`	`a`, `b`, or `c`
`[^abc]`	anything except `a`, `b`, or `c`
`[A-Z]`	any uppercase letter

A note on Unicode: \d and \w in Python’s re do match Unicode digits and word characters by default (so \d matches Arabic-Indic numerals, \w matches accented letters). That’s the right default for humanities work, but it can surprise you. If you need ASCII-only matching, pass flags=re.ASCII.

Anchors and groups — a quick taste

Two more pieces show up in nearly every real pattern.

Anchors pin a match to a position in the string:

Anchor	Meaning
`^`	start of the string (or of a line, with `re.MULTILINE`)
`$`	end of the string (or of a line)
`\b`	word boundary

import re

print(re.findall(r"^\w+", "Once upon a time"))    # ['Once']
print(re.findall(r"\bthe\b", "the other theory")) # ['the']  — not 'theory'

Groups wrap part of a pattern in parentheses to capture it:

import re

text = "5/3/1989 and 11/12/95"
print(re.findall(r"([0-9]{1,2})/([0-9]{1,2})/([0-9]{2,4})", text))
# [('5', '3', '1989'), ('11', '12', '95')]

When the pattern has groups, findall returns tuples of the captured pieces — month, day, year — instead of the whole match. That’s how you parse data, not just extract it.

Functions in `re` you’ll actually use

Function	Returns	Use it when
`re.findall(p, s)`	list of all matches	you want every hit
`re.search(p, s)`	the first `Match` (or `None`)	you want one and want detail
`re.match(p, s)`	a `Match` only if `s` starts with `p`	rare; mostly use `search`
`re.fullmatch(p, s)`	a `Match` only if `p` matches the whole `s`	validation
`re.sub(p, repl, s)`	a new string with replacements	find-and-replace
`re.split(p, s)`	a list, splitting on each match	smarter than `str.split`
`re.compile(p)`	a reusable `Pattern` object	when you’ll use `p` often

re.sub is especially useful for cleanup. To collapse runs of whitespace into single spaces:

import re

print(re.sub(r"\s+", " ", "foo    bar\n\nbaz").strip())
# 'foo bar baz'

To redact dates while keeping the surrounding text intact:

import re

text = "The event happened on 5/3/1989 and again on 11/12/95."
print(re.sub(r"[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}", "[date]", text))
# 'The event happened on [date] and again on [date].'

Where to go to learn more

Regex is easy to start with and hard to master. Two resources that pay off:

regex101.com — paste a string, type a pattern, see matches highlight in real time. Set the flavor to “Python” in the left sidebar. This is the single best way to learn.
The re module documentation — the language reference for everything Python supports.

Once you’re comfortable with the basics, continue to Lesson 18: Python and Regex (Part 02), where we apply regex to a real text file.

Running the code

re ships with Python, so there’s nothing to install. Save any snippet from this lesson to a file — say try.py — and run it from your project folder:

uv run try.py

uv run uses the project’s Python automatically; no virtualenv to activate. If you haven’t set the project up yet, Lesson 01 walks through it.