Lesson 26

Python and BeautifulSoup

Parse HTML with BeautifulSoup and pull out the tags, text, and attributes you actually want.

requests fetches HTML; BeautifulSoup parses it. Together they’re the standard toolkit for scraping.

A BeautifulSoup object is a navigable tree built from HTML. You query it the way you’d query the page in devtools — by tag, by class, by id, by CSS selector. You get back tag objects you can read text from or drill further into.

Installing

uv add beautifulsoup4 lxml

The Python module is named bs4. The lxml package is an optional faster parser; once installed, you can pass "lxml" instead of "html.parser" and BeautifulSoup will use it. For small scripts the difference doesn’t matter; for thousands of pages, lxml is several times faster.

Building a soup

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")
print(soup.title)

soup now represents the parsed page. Everything else in this lesson is methods on soup (or on tags inside it).

Finding tags — `find` and `find_all`

The two methods that do most of the work:

soup.find(tag, attrs) — returns the first matching tag, or None.
soup.find_all(tag, attrs) — returns a list of every matching tag.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")

title = soup.find("h1")
print(title.text)

for link in soup.find_all("a"):
    print(link.get("href"))

Two safe-defaults you should adopt:

tag.text gives you the visible text inside the tag (and its descendants), with HTML stripped. Use tag.get_text(strip=True) to also strip leading/trailing whitespace.
tag.get("href") reads an attribute, returning None if it’s missing. Safer than tag["href"], which raises KeyError.

Filtering by class, id, or attribute

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")

print(soup.find("div", class_="biography"))
print(soup.find("div", id="main-content"))
print(soup.find_all("p", class_="lede"))

# more general — any attribute
print(soup.find_all("a", attrs={"data-track": "biography-link"}))

Note the trailing underscore on class_. class is a Python keyword, so BeautifulSoup uses class_ to avoid the conflict.

CSS selectors — often the cleanest option

soup.select(selector) accepts the same selectors you’d use in CSS or in devtools’ “Copy selector”:

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")

print(soup.select("div.biography h2"))
print(soup.select("a[href^='/king/']"))
print(soup.select_one("div#main-content > p"))

For nested queries (one element inside another), CSS selectors are usually shorter than chained find calls. Pick the form that reads better.

Walking from a parent — the typical scraping shape

The standard scraping pattern: locate a container, then drill into it for fields.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")

records = []
for box in soup.select("div.biography"):
    name = box.select_one(".biography-name").get_text(strip=True)
    dates = box.select_one(".biography-dates").get_text(strip=True)
    link = box.select_one("a.biography-link")
    href = link.get("href") if link else None

    records.append({"name": name, "dates": dates, "href": href})
print(records)

A few habits in that snippet worth copying:

Loop over the containers, then select_one inside each. This guarantees every field belongs to the same record. If you instead select all names and all dates separately and zip them, one missing element shifts the whole rest of your data by one.
Guard against missing optional fields with if link else None. Real HTML is never as clean as the example.
Build a list of dictionaries. Easy to dump to JSON, write to a spreadsheet, or pass to pd.DataFrame() for analysis.

Getting just the text — `get_text` vs `.text`

These look interchangeable but get_text has options that frequently save you a second pass:

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")
tag = soup.find("h1")

print(tag.get_text())                   # default: just like .text
print(tag.get_text(strip=True))         # collapse leading/trailing whitespace
print(tag.get_text(separator=" ", strip=True))   # join inner tags with a space

The third form is what you almost always want for paragraphs that contain inline tags like <em> or <a>. Without separator=" ", the words on either side of those inline tags get jammed together.

Navigating — children, siblings, parents

You can also walk the tree relative to a tag you already have:

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")

header = soup.find("h2", class_="biography-name")
print(header.parent)          # the enclosing element
print(header.find_next("p"))  # the next <p> after this header
print(header.find_previous_sibling("h2"))
print(header.find_all_next("p", limit=3))

These are useful when classes are unhelpful and the only structure is positional (“the paragraph right after this heading”).

A complete tiny scraper

import time
import requests
from bs4 import BeautifulSoup

URLS = ["https://en.wikipedia.org/wiki/Theuderic_IV"]

session = requests.Session()
session.headers.update({"User-Agent": "DH-research-bot/1.0"})

records = []
for url in URLS:
    html = session.get(url, timeout=10).text
    soup = BeautifulSoup(html, "lxml")

    title = soup.select_one("h1").get_text(strip=True)
    paragraphs = [p.get_text(" ", strip=True) for p in soup.select("p")]

    records.append({
        "url": url,
        "title": title,
        "intro": paragraphs[0] if paragraphs else "",
    })
    time.sleep(1)

print(f"scraped {len(records)} pages")

That’s the whole shape: session, fetch, parse, extract, append, sleep. Once it works for one URL, it works for thousands.

What BeautifulSoup can’t do

Run JavaScript. If the page is rendered client-side, BeautifulSoup sees the empty shell. Either find the JSON API the page is calling, or use Playwright/Selenium to render first.
Fix invalid HTML perfectly. Most pages parse cleanly with html.parser or lxml. Wildly broken pages may require trying multiple parsers and picking the cleanest tree.

Once you have data in hand, you’ll want to save it. We start that in Lesson 28: Storing Data in Text Files.

Running the code

This lesson uses requests, beautifulsoup4, and lxml. Add them to your project once:

uv add requests beautifulsoup4 lxml

Then save the snippet to a file — say scrape.py — and run it from your project folder:

uv run scrape.py

uv run uses the project’s Python and dependencies automatically; no virtualenv to activate. If you haven’t set the project up yet, Lesson 01 walks through it.