Lesson 21
Filtering and Querying with Pandas
Pull the rows you want out of a DataFrame — Boolean masks, .query, sorting, and the GroupBy that does most analytics work.
A DataFrame you can’t filter is just a CSV. The whole reason to load a table into pandas is to ask questions of it: which rows match this condition?, what’s the count by category?, what’s the top 10?. This lesson covers the four moves that answer most of those questions: Boolean masks, .query, sort_values, and groupby.
We’ll work with a slightly richer sample throughout the lesson:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
print(df)
Every snippet below assumes that file exists. Each one re-creates it so you can run any of them in isolation.
Boolean masks
The most fundamental filter in pandas is a Boolean mask — a Series of True/False values, one per row, that you use to pick which rows survive.
A comparison on a column produces a mask:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
mask = df["letters"] > 1000
print(mask)
That mask, dropped back into df[...], returns only the rows where it was True:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
prolific = df[df["letters"] > 1000]
print(prolific)
Combine masks with & (and), | (or), ~ (not). Each comparison must be wrapped in parentheses — Python’s operator precedence will bite you otherwise:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
# French writers born after 1700
print(df[(df["country"] == "France") & (df["born"] > 1700)])
# Anyone NOT in France
print(df[~(df["country"] == "France")])
# Anyone with at least 1000 letters or born after 1750
print(df[(df["letters"] >= 1000) | (df["born"] > 1750)])
and and or (the Python keywords) don’t work on Series. Use & and |. This is the single most common newcomer mistake; if you see ValueError: The truth value of a Series is ambiguous, that’s what happened.
Membership filtering — .isin
Filtering “rows where country is France or Germany” with masks works, but .isin is shorter and reads better:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
continental = df[df["country"].isin(["France", "Germany"])]
print(continental)
Negate it with ~:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
print(df[~df["country"].isin(["France", "Germany"])])
Filtering by string content — .str
Text columns expose a .str accessor with all the string methods you already know:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
# Names containing a lowercase d
print(df[df["name"].str.contains("d", case=True, na=False)])
# Names starting with a vowel
print(df[df["name"].str.match(r"^[AEIOUaeiou]")])
# Country in lowercase
print(df["country"].str.lower())
na=False says “treat any missing values as not-matching.” Without it, a missing value in the column raises an error.
.query — readable filters as strings
For longer conditions, .query reads better than nested masks:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
print(df.query("country == 'France' and born > 1700"))
print(df.query("letters > 1000 and country != 'France'"))
Inside .query, you write and/or/not (not &/|/~). Reference outside variables with @:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
cutoff = 1000
print(df.query("letters >= @cutoff"))
For one-condition filters, masks are still cleaner. For three or more conditions, .query wins on readability.
Sorting
sort_values reorders rows by one or more columns:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
print(df.sort_values("letters", ascending=False))
print(df.sort_values(["country", "born"])) # multi-key
print(df.sort_values(["country", "letters"], ascending=[True, False]))
sort_values returns a new DataFrame — it does not modify df in place unless you pass inplace=True. For most analysis flows, you want the new copy.
The n-largest and n-smallest are common enough to have their own shortcut:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
print(df.nlargest(3, "letters")) # top 3 most prolific
print(df.nsmallest(3, "born")) # 3 earliest-born
Counting categories — value_counts
When you want “how many rows of each value in this column?” — value_counts is the move:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
print(df["country"].value_counts())
That’s a one-line histogram. For categorical analysis it’s the most useful method in pandas.
GroupBy — the workhorse
groupby is the move that does most “real” analytics work. The recipe:
- Split the rows into groups by a column.
- Apply an aggregation to each group.
- Combine the results back into a DataFrame or Series.
The shortest demonstration: total letters per country.
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
print(df.groupby("country")["letters"].sum())
Read it left to right: group by country, take the letters column, sum each group. The other one-shot aggregations work the same:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
print(df.groupby("country")["letters"].mean())
print(df.groupby("country")["letters"].count())
print(df.groupby("country")["born"].min())
print(df.groupby("country").size()) # rows per group
For multiple aggregates at once, agg:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
result = df.groupby("country").agg(
n_writers=("name", "count"),
total_letters=("letters", "sum"),
earliest_birth=("born", "min"),
)
print(result)
("name", "count") reads as: take the name column, apply count. The keyword on the left becomes the new column name. This is the form that scales — for a real research dataset you’ll often have a half-dozen aggregates in one call.
Chaining
Operations return new DataFrames, so they chain. The result is something close to readable English:
import pandas as pd
sample = """name,born,died,letters,country
Voltaire,1694,1778,21000,France
Émilie du Châtelet,1706,1749,430,France
Diderot,1713,1784,3500,France
d'Alembert,1717,1783,1200,France
Hume,1711,1776,800,Scotland
Smith,1723,1790,360,Scotland
Wollstonecraft,1759,1797,500,England
Kant,1724,1804,1100,Germany
"""
with open("writers.csv", "w", encoding="utf-8") as f:
f.write(sample)
df = pd.read_csv("writers.csv")
result = (
df[df["letters"] >= 500]
.sort_values("letters", ascending=False)
.head(5)
[["name", "country", "letters"]]
)
print(result)
Read aloud: take rows with at least 500 letters, sort by letters descending, keep the top five, show name/country/letters. Once chaining clicks, you stop writing intermediate variables and the analysis fits on one screen.
Where to next
You can now read a table, pick the rows you want, sort them, and summarize them by category. That’s enough for most “what does this dataset look like?” workflows. The last lesson in this part — Lesson 22 — Cleaning and Exporting — handles missing values, text cleanup, and writing the result back out to CSV or Excel for sharing.
Running the code
This lesson uses pandas. Add it to your project once:
uv add pandas openpyxl
Save any snippet to a file — say try.py — and run it from your project folder:
uv run try.py
uv run uses the project’s Python and dependencies automatically; no virtualenv to activate. If you haven’t set the project up yet, Lesson 01 walks through it.