Lesson 25

Python and the Requests Module

Use the requests library to fetch raw HTML from the web — and learn the small set of habits that make scraping reliable and polite.

This lesson introduces requests — the canonical Python library for fetching things from the web. Once you can fetch, the next lesson parses; together they’re the foundation of every scraping project.

requests is friendly. The basic case is three lines. The interesting parts are the small set of habits that turn a fragile script into a reliable one: error handling, identifying yourself, retries, and rate limiting.

(There’s also httpx, a modern alternative with the same API plus async support. For this course requests is fine; if you ever need to hit thousands of URLs concurrently, httpx is what to look at.)

Installing

uv add requests

Basic usage

import requests

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
response = requests.get(url)
html = response.text
print(html)

What each line does:

Line 1 imports the library.
Line 2 calls requests.get(url), which sends an HTTP GET request and waits for the response.
Line 3 pulls the response body out as a string. This is the same HTML you’d see with View Source in your browser.

The response object has more than just .text:

Attribute	What it gives you
`response.text`	the body as a decoded string
`response.content`	the body as raw bytes (use this for images, PDFs, binary files)
`response.json()`	the body parsed as JSON, if the response is JSON
`response.status_code`	the HTTP status (200 OK, 404 Not Found, etc.)
`response.headers`	a dictionary of response headers
`response.url`	the final URL after any redirects
`response.encoding`	the encoding requests inferred for `.text`

Always check that the request worked

A failed request still returns a Response object — requests.get(...) doesn’t raise an exception when the page is missing. You have to check.

The terse way:

import requests

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
response = requests.get(url)
response.raise_for_status()
html = response.text
print(html)

raise_for_status() raises an HTTPError for any 4xx or 5xx status. If the request succeeded, it does nothing. This is the line of defense that turns “the script silently scraped 404 pages for two hours” into a clean failure.

The explicit way, when you want to handle errors yourself:

import requests

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
response = requests.get(url)
if response.status_code != 200:
    print(f"Got {response.status_code} for {url}")
else:
    html = response.text
    print(html)

Identifying yourself with a User-Agent

By default requests sends a User-Agent like python-requests/2.32.0. Many servers either block that or treat it differently. Set a polite, descriptive header:

import requests

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
headers = {
    "User-Agent": "DH-research-bot/1.0 (your-email@example.org)",
}
response = requests.get(url, headers=headers)
print(response.status_code)

The format isn’t strict. The norm is: name of your tool, version, and a contact (email or URL) so a server admin can reach you if there’s a problem.

Sending parameters

To hit a URL like https://example.org/search?q=voltaire&year=1750, don’t build the query string by hand. Pass params:

import requests

response = requests.get(
    "https://example.org/search",
    params={"q": "voltaire", "year": 1750},
)
print(response.url)

requests URL-encodes everything correctly, including spaces and Unicode.

Timeouts — always set one

By default requests will wait forever for a server to respond. In a long scraping loop, one slow server can hang the whole script. Always set a timeout:

import requests

url = "https://en.wikipedia.org/wiki/Theuderic_IV"
response = requests.get(url, timeout=10)   # 10 seconds
print(response.status_code)

You’ll catch a requests.exceptions.Timeout if the server doesn’t answer in time, which is a clean signal to skip and move on.

Polite scraping — sleep between requests

When you scrape many pages from the same host, slow yourself down:

import time
import requests

urls = ["https://en.wikipedia.org/wiki/Theuderic_IV"]
headers = {"User-Agent": "DH-research-bot/1.0"}

def save(text):
    print(len(text))

for url in urls:
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
    save(response.text)
    time.sleep(1)            # one request per second

A second per request is a sensible default for most sites. For a slow or important server (a small institutional archive, say), make it five seconds. Some sites publish a recommended rate; respect it if so.

Retries with exponential backoff

For a long scrape, network errors are normal. Retry the failed request a few times, with increasing wait between attempts:

import time
import requests

def fetch(url: str, headers: dict, attempts: int = 4) -> str:
    for i in range(attempts):
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            return response.text
        except requests.RequestException as e:
            wait = 2 ** i
            print(f"  attempt {i + 1} failed ({e}); sleeping {wait}s")
            time.sleep(wait)
    raise RuntimeError(f"gave up on {url}")

html = fetch("https://en.wikipedia.org/wiki/Theuderic_IV", {"User-Agent": "DH-research-bot/1.0"})
print(len(html))

2 ** i doubles the wait each time: 1, 2, 4, 8 seconds. After four tries the script gives up rather than hanging forever.

Sessions — when you’ll fetch many URLs from one host

For a long scrape, create a Session once and reuse it. It keeps the underlying TCP connection open and reapplies your headers automatically:

import time
import requests

urls = ["https://en.wikipedia.org/wiki/Theuderic_IV"]

def save(text):
    print(len(text))

session = requests.Session()
session.headers.update({"User-Agent": "DH-research-bot/1.0"})

for url in urls:
    response = session.get(url, timeout=10)
    response.raise_for_status()
    save(response.text)
    time.sleep(1)

A session is faster (no reconnecting between requests) and tidier (headers in one place).

Saving the HTML

For any scrape that takes more than a minute, save the raw HTML to disk before parsing. Then if your parser changes, you don’t have to refetch:

import requests
from pathlib import Path

session = requests.Session()
session.headers.update({"User-Agent": "DH-research-bot/1.0"})

cache = Path("html-cache")
cache.mkdir(exist_ok=True)

def fetch_cached(url: str, name: str) -> str:
    f = cache / f"{name}.html"
    if f.exists():
        return f.read_text(encoding="utf-8")
    response = session.get(url, timeout=10)
    response.raise_for_status()
    f.write_text(response.text, encoding="utf-8")
    return response.text

html = fetch_cached("https://en.wikipedia.org/wiki/Theuderic_IV", "theuderic")
print(len(html))

You hit the network once per page; from then on, parsing is instant. This is the single biggest quality-of-life upgrade for any scraper.

What `requests` can’t do

requests retrieves bytes. It doesn’t parse HTML, doesn’t run JavaScript, doesn’t navigate forms. For parsing, that’s BeautifulSoup, which is the next lesson. For JavaScript-rendered pages, it’s Playwright or Selenium — out of scope for this course but worth knowing about.

Try grabbing HTML from a few sites yourself. Look at response.status_code, response.encoding, and the first hundred characters of response.text. When that feels routine, continue to Lesson 26: Python and BeautifulSoup.

Running the code

This lesson uses requests. Add it to your project once:

uv add requests

Then save the snippet to a file — say fetch.py — and run it from your project folder:

uv run fetch.py

uv run uses the project’s Python and dependencies automatically; no virtualenv to activate. If you haven’t set the project up yet, Lesson 01 walks through it.