Lesson 23
Capstone — Analyzing a Real CSV Dataset
End-to-end pandas workflow on a small but realistic dataset of Enlightenment correspondence — load, clean, filter, group, summarize, and export.
You’ve met pandas in three lessons: a DataFrame is a table; you can filter, group, and clean it; you can write the result back out. This capstone does all of that in a single, end-to-end workflow on a realistic-looking dataset of letters from Enlightenment writers.
The pattern you walk through here is the basic shape of every quantitative DH project. You’ll do it dozens of times in your career; the value of a capstone is to feel the whole thing in one sitting.
The dataset
A small but realistic letters dataset — the kind a digital humanist might assemble from an archive. Real datasets are bigger; the moves are identical.
import pandas as pd
raw = """sender,recipient,year,place,word_count
Voltaire,Frederick II,1750,Berlin,1240
Voltaire ,Madame du Deffand,1759,Geneva,890
Voltaire,Madame du Deffand,1760,Ferney,
voltaire,Frederick II,1751,Berlin,1500
Émilie du Châtelet,Voltaire,1740,Cirey,420
Émilie du Châtelet,Pierre-Louis Maupertuis,1741,Cirey,610
Diderot,Voltaire,1762,Paris,n/a
Diderot,Sophie Volland,1759,Paris,2150
Rousseau,Voltaire,1760,Geneva,3100
Rousseau,Madame d'Épinay,1757,Paris,?
Hume,Adam Smith,1759,London,720
Hume,Rousseau,1766,London,1400
"""
with open("letters.csv", "w", encoding="utf-8") as f:
f.write(raw)
print("wrote letters.csv")
Even at 12 rows it has every problem real data has: stray spaces, inconsistent capitalisation, mixed missing-value markers, a column that came in as text because of those markers.
Step 1 — Load and look
import pandas as pd
raw = """sender,recipient,year,place,word_count
Voltaire,Frederick II,1750,Berlin,1240
Voltaire ,Madame du Deffand,1759,Geneva,890
Voltaire,Madame du Deffand,1760,Ferney,
voltaire,Frederick II,1751,Berlin,1500
Émilie du Châtelet,Voltaire,1740,Cirey,420
Émilie du Châtelet,Pierre-Louis Maupertuis,1741,Cirey,610
Diderot,Voltaire,1762,Paris,n/a
Diderot,Sophie Volland,1759,Paris,2150
Rousseau,Voltaire,1760,Geneva,3100
Rousseau,Madame d'Épinay,1757,Paris,?
Hume,Adam Smith,1759,London,720
Hume,Rousseau,1766,London,1400
"""
with open("letters.csv", "w", encoding="utf-8") as f:
f.write(raw)
df = pd.read_csv("letters.csv", na_values=["n/a", "?"])
print(df.shape)
print(df.head())
print(df.dtypes)
print(df.isna().sum())
That’s the four-line health check from Lesson 22. Already we can see: 12 rows, 5 columns, word_count is now numeric, two missing values.
Step 2 — Clean the senders
The sender column has whitespace and capitalisation issues. .str.strip() and .str.title() handle both:
import pandas as pd
raw = """sender,recipient,year,place,word_count
Voltaire,Frederick II,1750,Berlin,1240
Voltaire ,Madame du Deffand,1759,Geneva,890
voltaire,Frederick II,1751,Berlin,1500
Rousseau,Voltaire,1760,Geneva,3100
Rousseau,Madame d'Épinay,1757,Paris,720
"""
with open("letters.csv", "w", encoding="utf-8") as f:
f.write(raw)
df = pd.read_csv("letters.csv")
df["sender"] = df["sender"].str.strip().str.title()
print(df["sender"].value_counts())
Now Voltaire, Voltaire (trailing space), and voltaire all collapse to a single value. The first time you do this on real data, you’ll watch your row counts change and feel a real “ohhh, that’s why my numbers were off” moment.
Step 3 — Drop unusable rows
Word counts of NaN are unhelpful for the analysis we want. Drop them:
import pandas as pd
raw = """sender,recipient,year,place,word_count
Voltaire,Frederick II,1750,Berlin,1240
Voltaire,Madame du Deffand,1759,Geneva,890
Voltaire,Madame du Deffand,1760,Ferney,
Diderot,Voltaire,1762,Paris,n/a
Hume,Adam Smith,1759,London,720
"""
with open("letters.csv", "w", encoding="utf-8") as f:
f.write(raw)
df = pd.read_csv("letters.csv", na_values=["n/a", "?"])
before = len(df)
df = df.dropna(subset=["word_count"])
print(f"dropped {before - len(df)} row(s); {len(df)} remaining")
Be honest in your write-up that you did this. “We dropped 12 letters with missing word counts” is exactly the kind of methodological detail that makes a paper trustworthy.
Step 4 — Quick filtering
Boolean masks (Lesson 21) for “Voltaire’s letters from the 1750s”:
import pandas as pd
df = pd.DataFrame({
"sender": ["Voltaire", "Voltaire", "Voltaire", "Diderot"],
"year": [1750, 1759, 1762, 1759],
"word_count": [1240, 890, 1500, 2150],
})
voltaire_50s = df[(df["sender"] == "Voltaire") & (df["year"].between(1750, 1759))]
print(voltaire_50s)
Two conditions, joined with &, each in parentheses. The pattern you’ll use a thousand times.
Step 5 — Group by sender
The “letters per sender” question is one line:
import pandas as pd
df = pd.DataFrame({
"sender": ["Voltaire", "Voltaire", "Voltaire", "Diderot", "Diderot", "Hume"],
"word_count": [1240, 890, 1500, 2150, 1100, 720],
})
per_sender = df.groupby("sender").agg(
n_letters=("word_count", "count"),
total_words=("word_count", "sum"),
avg_words=("word_count", "mean"),
).sort_values("n_letters", ascending=False)
print(per_sender)
Three statistics per sender, in a single chained expression. This is the pandas idiom worth memorising — it’s the answer to most “summarise by category” questions.
Step 6 — Group by place and year
You can group by more than one column:
import pandas as pd
df = pd.DataFrame({
"place": ["Paris", "Paris", "Geneva", "Geneva", "London"],
"year": [1759, 1759, 1760, 1760, 1759],
"word_count": [890, 2150, 3100, 1500, 720],
})
per_place_year = df.groupby(["place", "year"])["word_count"].sum().reset_index()
print(per_place_year)
The reset_index() flattens the result back into a regular DataFrame instead of leaving it indexed by the group keys. That’s almost always what you want when you’re going to keep working with the result.
Step 7 — A small “report”
Pulling it all together, a short script that produces a one-page summary:
import pandas as pd
raw = """sender,recipient,year,place,word_count
Voltaire,Frederick II,1750,Berlin,1240
Voltaire ,Madame du Deffand,1759,Geneva,890
Voltaire,Madame du Deffand,1760,Ferney,
voltaire,Frederick II,1751,Berlin,1500
Émilie du Châtelet,Voltaire,1740,Cirey,420
Émilie du Châtelet,Pierre-Louis Maupertuis,1741,Cirey,610
Diderot,Voltaire,1762,Paris,n/a
Diderot,Sophie Volland,1759,Paris,2150
Rousseau,Voltaire,1760,Geneva,3100
Rousseau,Madame d'Épinay,1757,Paris,?
Hume,Adam Smith,1759,London,720
Hume,Rousseau,1766,London,1400
"""
with open("letters.csv", "w", encoding="utf-8") as f:
f.write(raw)
# 1. Load and clean.
df = pd.read_csv("letters.csv", na_values=["n/a", "?"])
df["sender"] = df["sender"].str.strip().str.title()
df = df.dropna(subset=["word_count"])
# 2. Headline numbers.
print(f"{len(df)} letters")
print(f"Mean word count: {df['word_count'].mean():.0f}")
print(f"Total words: {df['word_count'].sum():.0f}")
# 3. Top senders.
print("\nLetters per sender:")
print(df.groupby("sender").size().sort_values(ascending=False))
# 4. Letters per decade.
df["decade"] = (df["year"] // 10) * 10
print("\nLetters per decade:")
print(df.groupby("decade").size())
# 5. Save the cleaned data.
df.to_csv("letters_clean.csv", index=False)
print("\nwrote letters_clean.csv")
Six steps, twenty lines. That’s a respectable first cut at a quantitative analysis. The final to_csv is the export step from Lesson 22 — if anyone asks “what data did you use?”, you have a clean file you can hand over.
What this is the start of
Everything more sophisticated you’ll do with tabular DH data builds on the same shape:
- Network analysis. Group by
(sender, recipient)and you have an edge list ready for NetworkX. - Time-series. A column of years and a column of counts is exactly what you need to plot frequencies over time.
- Comparative studies. Filter to two authors, run the same summary on each, compare.
The cleaning + grouping + exporting loop you just walked is the loop. New libraries replace the analysis step with something fancier; the rest stays.
Try it yourself
- Plot letters per decade with
df.groupby("decade").size().plot.bar(). (Pandas has plotting built in via matplotlib.) - Build a
correspondentsDataFrame fromdf["sender"].unique()anddf["recipient"].unique()joined together — the union of every name that appears. - Try a pivot table:
pd.pivot_table(df, index="sender", columns="decade", values="word_count", aggfunc="sum"). What does it tell you that the grouped views didn’t?
Where to next
Part 6 takes you onto the open web. You’ll learn how to read HTML, fetch pages with requests, and parse them with BeautifulSoup — the trio behind most of the DH datasets you’ll ever build yourself.
Continue to Lesson 24: Finding HTML Code.
Running the code
This lesson uses pandas. Add it to your project once:
uv add pandas
Save any snippet to a file — say letters.py — and run it from your project folder:
uv run letters.py
uv run uses the project’s Python and dependencies automatically; no virtualenv to activate. If you haven’t set the project up yet, Lesson 01 walks through it.