Lesson 26
Python and BeautifulSoup
Parse HTML with BeautifulSoup and pull out the tags, text, and attributes you actually want.
requests fetches HTML; BeautifulSoup parses it. Together they’re the standard toolkit for scraping.
A BeautifulSoup object is a navigable tree built from HTML. You query it the way you’d query the page in devtools — by tag, by class, by id, by CSS selector. You get back tag objects you can read text from or drill further into.
Installing
uv add beautifulsoup4 lxml
The Python module is named bs4. The lxml package is an optional faster parser; once installed, you can pass "lxml" instead of "html.parser" and BeautifulSoup will use it. For small scripts the difference doesn’t matter; for thousands of pages, lxml is several times faster.
Building a soup
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")
print(soup.title)
soup now represents the parsed page. Everything else in this lesson is methods on soup (or on tags inside it).
Finding tags — find and find_all
The two methods that do most of the work:
soup.find(tag, attrs)— returns the first matching tag, orNone.soup.find_all(tag, attrs)— returns a list of every matching tag.
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")
title = soup.find("h1")
print(title.text)
for link in soup.find_all("a"):
print(link.get("href"))
Two safe-defaults you should adopt:
tag.textgives you the visible text inside the tag (and its descendants), with HTML stripped. Usetag.get_text(strip=True)to also strip leading/trailing whitespace.tag.get("href")reads an attribute, returningNoneif it’s missing. Safer thantag["href"], which raisesKeyError.
Filtering by class, id, or attribute
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")
print(soup.find("div", class_="biography"))
print(soup.find("div", id="main-content"))
print(soup.find_all("p", class_="lede"))
# more general — any attribute
print(soup.find_all("a", attrs={"data-track": "biography-link"}))
Note the trailing underscore on class_. class is a Python keyword, so BeautifulSoup uses class_ to avoid the conflict.
CSS selectors — often the cleanest option
soup.select(selector) accepts the same selectors you’d use in CSS or in devtools’ “Copy selector”:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")
print(soup.select("div.biography h2"))
print(soup.select("a[href^='/king/']"))
print(soup.select_one("div#main-content > p"))
For nested queries (one element inside another), CSS selectors are usually shorter than chained find calls. Pick the form that reads better.
Walking from a parent — the typical scraping shape
The standard scraping pattern: locate a container, then drill into it for fields.
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")
records = []
for box in soup.select("div.biography"):
name = box.select_one(".biography-name").get_text(strip=True)
dates = box.select_one(".biography-dates").get_text(strip=True)
link = box.select_one("a.biography-link")
href = link.get("href") if link else None
records.append({"name": name, "dates": dates, "href": href})
print(records)
A few habits in that snippet worth copying:
- Loop over the containers, then
select_oneinside each. This guarantees every field belongs to the same record. If you insteadselectall names and all dates separately and zip them, one missing element shifts the whole rest of your data by one. - Guard against missing optional fields with
if link else None. Real HTML is never as clean as the example. - Build a list of dictionaries. Easy to dump to JSON, write to a spreadsheet, or pass to
pd.DataFrame()for analysis.
Getting just the text — get_text vs .text
These look interchangeable but get_text has options that frequently save you a second pass:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")
tag = soup.find("h1")
print(tag.get_text()) # default: just like .text
print(tag.get_text(strip=True)) # collapse leading/trailing whitespace
print(tag.get_text(separator=" ", strip=True)) # join inner tags with a space
The third form is what you almost always want for paragraphs that contain inline tags like <em> or <a>. Without separator=" ", the words on either side of those inline tags get jammed together.
Navigating — children, siblings, parents
You can also walk the tree relative to a tag you already have:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Theuderic_IV"
html = requests.get(url, timeout=10).text
soup = BeautifulSoup(html, "html.parser")
header = soup.find("h2", class_="biography-name")
print(header.parent) # the enclosing element
print(header.find_next("p")) # the next <p> after this header
print(header.find_previous_sibling("h2"))
print(header.find_all_next("p", limit=3))
These are useful when classes are unhelpful and the only structure is positional (“the paragraph right after this heading”).
A complete tiny scraper
import time
import requests
from bs4 import BeautifulSoup
URLS = ["https://en.wikipedia.org/wiki/Theuderic_IV"]
session = requests.Session()
session.headers.update({"User-Agent": "DH-research-bot/1.0"})
records = []
for url in URLS:
html = session.get(url, timeout=10).text
soup = BeautifulSoup(html, "lxml")
title = soup.select_one("h1").get_text(strip=True)
paragraphs = [p.get_text(" ", strip=True) for p in soup.select("p")]
records.append({
"url": url,
"title": title,
"intro": paragraphs[0] if paragraphs else "",
})
time.sleep(1)
print(f"scraped {len(records)} pages")
That’s the whole shape: session, fetch, parse, extract, append, sleep. Once it works for one URL, it works for thousands.
What BeautifulSoup can’t do
- Run JavaScript. If the page is rendered client-side, BeautifulSoup sees the empty shell. Either find the JSON API the page is calling, or use Playwright/Selenium to render first.
- Fix invalid HTML perfectly. Most pages parse cleanly with
html.parserorlxml. Wildly broken pages may require trying multiple parsers and picking the cleanest tree.
Once you have data in hand, you’ll want to save it. We start that in Lesson 28: Storing Data in Text Files.
Running the code
This lesson uses requests, beautifulsoup4, and lxml. Add them to your project once:
uv add requests beautifulsoup4 lxml
Then save the snippet to a file — say scrape.py — and run it from your project folder:
uv run scrape.py
uv run uses the project’s Python and dependencies automatically; no virtualenv to activate. If you haven’t set the project up yet, Lesson 01 walks through it.