Lesson 17
Python and Regex (Part 01)
Use regular expressions to extract structured data — like dates — from messy strings.
The first major library we’ll use in this course is regex — regular expressions. The string methods from Lesson 03 (split, replace, find, etc.) handle exact matches well, but they break down the moment your data has consistent variation: dates with sometimes-zero-padded months, names sometimes followed by titles, numbers sometimes with commas. Regex is the tool for that case.
A regular expression is a small pattern language for describing strings. The same pattern works in Python, in your editor’s find-and-replace, in grep, in JavaScript, in databases, and in basically every modern programming environment. Learning the syntax once pays dividends across all of them.
The motivating problem
Consider a dataset where dates are inconsistently formatted: 5/3/1989, 11/12/95, 2/14/2001. With string methods alone, handling all of those variants takes a tower of if/elif checking string lengths and contents. With regex, it’s one line.
Importing
Python’s regex library ships with the language. Bring it in at the top of your script:
import re
For most needs re is enough. For Unicode-heavy or very advanced patterns, the third-party regex library (uv add regex) is a strict superset; the API is identical, so you can swap one for the other later.
A first pattern — find all the digits
import re
description = "I have 1 cat, 2 dogs, and 3 birds. There are 5 of each. So 5 + 5 + 5 = 15."
print(re.findall(r"\d", description))
# ['1', '2', '3', '5', '5', '5', '1', '5']
Two important details in that one call:
re.findall(pattern, string)returns every match as a list.- The
rprefix turns the pattern into a raw string, which means Python doesn’t pre-process the backslashes before passing the pattern to regex. Always use raw strings for regex patterns. ("\d"happens to work here because Python doesn’t recognize\das a special escape — but"\b"would silently become a backspace character. Just user"..."every time.) \dis regex for “any single digit.” It matches1,2, etc., one digit per match — which is why15came back as two separate items.
Quantifiers — match more than one
{m,n} means “between m and n of the preceding thing”:
import re
description = "I have 1 cat, 2 dogs, and 3 birds. There are 5 of each. So 5 + 5 + 5 = 15."
print(re.findall(r"[0-9]{1,2}", description))
# ['1', '2', '3', '5', '5', '5', '15']
[0-9] is a character class — match any character in this set. {1,2} says match it once or twice. Now 15 comes through as a single two-digit number.
The four most useful quantifiers:
| Quantifier | Meaning |
|---|---|
? | zero or one |
* | zero or more |
+ | one or more |
{m,n} | between m and n |
{m,} | at least m |
{m} | exactly m |
Quantifiers are greedy by default — they match as much as they can. Add a ? after the quantifier (+?, *?, {m,n}?) to make it lazy (match as little as possible). Greedy vs lazy matters whenever you’re matching across delimiters; if a pattern is “eating” too much, that’s the first thing to check.
A pattern for dates
Now let’s tackle the motivating example:
import re
text = "The event happened on 5/3/1989 and again on 11/12/95."
pattern = r"[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}"
print(re.findall(pattern, text))
# ['5/3/1989', '11/12/95']
Read it left to right:
[0-9]{1,2}— a one- or two-digit number (month)/— a literal slash[0-9]{1,2}— another one- or two-digit number (day)/— another slash[0-9]{2,4}— a two- to four-digit number (year)
Both date shapes match cleanly. That tolerance for variation is exactly what regex is good for.
The character classes you’ll use most
| Class | Matches | Equivalent |
|---|---|---|
\d | a digit | [0-9] |
\D | a non-digit | [^0-9] |
\w | a word character | [A-Za-z0-9_] |
\W | a non-word character | |
\s | whitespace (space, tab, newline) | |
\S | non-whitespace | |
. | any character except newline | |
[abc] | a, b, or c | |
[^abc] | anything except a, b, or c | |
[A-Z] | any uppercase letter |
A note on Unicode: \d and \w in Python’s re do match Unicode digits and word characters by default (so \d matches Arabic-Indic numerals, \w matches accented letters). That’s the right default for humanities work, but it can surprise you. If you need ASCII-only matching, pass flags=re.ASCII.
Anchors and groups — a quick taste
Two more pieces show up in nearly every real pattern.
Anchors pin a match to a position in the string:
| Anchor | Meaning |
|---|---|
^ | start of the string (or of a line, with re.MULTILINE) |
$ | end of the string (or of a line) |
\b | word boundary |
import re
print(re.findall(r"^\w+", "Once upon a time")) # ['Once']
print(re.findall(r"\bthe\b", "the other theory")) # ['the'] — not 'theory'
Groups wrap part of a pattern in parentheses to capture it:
import re
text = "5/3/1989 and 11/12/95"
print(re.findall(r"([0-9]{1,2})/([0-9]{1,2})/([0-9]{2,4})", text))
# [('5', '3', '1989'), ('11', '12', '95')]
When the pattern has groups, findall returns tuples of the captured pieces — month, day, year — instead of the whole match. That’s how you parse data, not just extract it.
Functions in re you’ll actually use
| Function | Returns | Use it when |
|---|---|---|
re.findall(p, s) | list of all matches | you want every hit |
re.search(p, s) | the first Match (or None) | you want one and want detail |
re.match(p, s) | a Match only if s starts with p | rare; mostly use search |
re.fullmatch(p, s) | a Match only if p matches the whole s | validation |
re.sub(p, repl, s) | a new string with replacements | find-and-replace |
re.split(p, s) | a list, splitting on each match | smarter than str.split |
re.compile(p) | a reusable Pattern object | when you’ll use p often |
re.sub is especially useful for cleanup. To collapse runs of whitespace into single spaces:
import re
print(re.sub(r"\s+", " ", "foo bar\n\nbaz").strip())
# 'foo bar baz'
To redact dates while keeping the surrounding text intact:
import re
text = "The event happened on 5/3/1989 and again on 11/12/95."
print(re.sub(r"[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}", "[date]", text))
# 'The event happened on [date] and again on [date].'
Where to go to learn more
Regex is easy to start with and hard to master. Two resources that pay off:
- regex101.com — paste a string, type a pattern, see matches highlight in real time. Set the flavor to “Python” in the left sidebar. This is the single best way to learn.
- The
remodule documentation — the language reference for everything Python supports.
Once you’re comfortable with the basics, continue to Lesson 18: Python and Regex (Part 02), where we apply regex to a real text file.
Running the code
re ships with Python, so there’s nothing to install. Save any snippet from this lesson to a file — say try.py — and run it from your project folder:
uv run try.py
uv run uses the project’s Python automatically; no virtualenv to activate. If you haven’t set the project up yet, Lesson 01 walks through it.