Lesson 28
Storing Data — Text, CSV, and JSON
Write your data out to plain text, CSV, and JSON — and pick the right format for the shape you have.
Once you’ve gathered or computed some data, you usually want to keep it. Plain .txt files are the simplest option, but they’re rarely the right one. The right format depends on the shape of the data:
- Plain text — for unstructured prose. Transcripts, notes, scraped article bodies.
- CSV — for tabular data. Anything that looks like a spreadsheet.
- JSON — for nested or hierarchical data. The output of a scraper or API.
This lesson covers all three. They’re the formats you’ll write 95% of the time.
Plain text
The pattern is the same one we used to read text files in Lesson 15, with the file opened in write mode ("w"):
with open("output.txt", "w", encoding="utf-8") as f:
f.write("Hello, file!\n")
with open("output.txt", "r", encoding="utf-8") as f:
print(f.read())
A few points worth pinning down:
- The
withblock automatically closes the file when you leave it. "w"overwrites any existing file with the same name. Use"a"(append) to add without erasing.f.writedoes not add a newline. Note the explicit\n.- Always set
encoding="utf-8". The default differs across operating systems; UTF-8 is the modern standard everywhere else.
For a list of strings, write them one per line:
names = ["Theuderic IV", "Pippin the Short", "Charlemagne"]
with open("names.txt", "w", encoding="utf-8") as f:
for name in names:
f.write(name + "\n")
with open("names.txt", "r", encoding="utf-8") as f:
print(f.read())
pathlib has a one-liner equivalent that’s even cleaner:
from pathlib import Path
names = ["Theuderic IV", "Pippin the Short", "Charlemagne"]
Path("names.txt").write_text("\n".join(names) + "\n", encoding="utf-8")
print(Path("names.txt").read_text(encoding="utf-8"))
Reading it back is symmetrical:
from pathlib import Path
names = ["Theuderic IV", "Pippin the Short", "Charlemagne"]
Path("names.txt").write_text("\n".join(names) + "\n", encoding="utf-8")
text = Path("names.txt").read_text(encoding="utf-8")
lines = text.splitlines()
print(lines)
splitlines handles \n, \r\n, and \r uniformly and drops the trailing newlines, which is almost always what you want.
CSV — for tabular data
When your data is rows and columns, don’t join with commas yourself. Quoting, escaping, and embedded commas in fields are a swamp; the standard library’s csv module handles all of that for you.
The cleanest writer takes a list of dictionaries:
import csv
records = [
{"name": "Voltaire", "born": 1694, "letters": 21000},
{"name": "Émilie", "born": 1706, "letters": 430},
{"name": "Diderot", "born": 1713, "letters": 3500},
]
with open("writers.csv", "w", encoding="utf-8", newline="") as f:
writer = csv.DictWriter(f, fieldnames=list(records[0].keys()))
writer.writeheader()
writer.writerows(records)
with open("writers.csv", "r", encoding="utf-8") as f:
print(f.read())
Two non-obvious arguments to open you need every time you work with CSV:
newline="". Without it,csvand the OS will both add line endings on Windows, and you’ll get blank rows between every record.encoding="utf-8". Same reason as before.
Reading it back with the same shape:
import csv
records = [
{"name": "Voltaire", "born": 1694, "letters": 21000},
{"name": "Émilie", "born": 1706, "letters": 430},
{"name": "Diderot", "born": 1713, "letters": 3500},
]
with open("writers.csv", "w", encoding="utf-8", newline="") as f:
writer = csv.DictWriter(f, fieldnames=list(records[0].keys()))
writer.writeheader()
writer.writerows(records)
with open("writers.csv", "r", encoding="utf-8", newline="") as f:
reader = csv.DictReader(f)
records = [row for row in reader]
print(records[0])
# {'name': 'Voltaire', 'born': '1694', 'letters': '21000'}
Notice that everything came back as a string — CSV has no type information. If you need numbers, convert them yourself: int(row["born"]). (pandas.read_csv and polars.read_csv infer types automatically; for serious tabular work, those are usually the right tool.)
JSON — for nested data
When your records have nested fields (lists, sub-dictionaries), CSV stops working cleanly. Use JSON. Python’s json module reads and writes it directly.
import json
records = [
{
"name": "Voltaire",
"born": 1694,
"places": ["Paris", "Cirey", "Ferney"],
"correspondents": {"Émilie": 192, "Diderot": 8},
},
]
with open("writers.json", "w", encoding="utf-8") as f:
json.dump(records, f, indent=2, ensure_ascii=False)
with open("writers.json", "r", encoding="utf-8") as f:
print(f.read())
Three useful arguments to json.dump:
indent=2writes a pretty-printed, human-readable file. Without it the whole record is on one line. For a file you’ll ever look at, always indent.ensure_ascii=Falsekeeps non-ASCII characters as themselves (Émilie) instead of escaping them (\u00c9milie). Almost always what you want.default=stris a fallback for objectsjsondoesn’t know how to serialize natively — most usefullydatetimeobjects.
Reading is even simpler:
import json
records = [
{
"name": "Voltaire",
"born": 1694,
"places": ["Paris", "Cirey", "Ferney"],
"correspondents": {"Émilie": 192, "Diderot": 8},
},
]
with open("writers.json", "w", encoding="utf-8") as f:
json.dump(records, f, indent=2, ensure_ascii=False)
with open("writers.json", "r", encoding="utf-8") as f:
records = json.load(f)
print(records)
You get back the same list of dictionaries you wrote.
JSONL — one JSON object per line
For long-running scrapers and large datasets, JSONL (also called NDJSON) is the format to know. Each line of the file is one valid JSON object. You can append to it as you go without rewriting the whole file:
import json
records = [
{"name": "Voltaire", "born": 1694},
{"name": "Émilie", "born": 1706},
]
with open("scraped.jsonl", "a", encoding="utf-8") as f:
for record in records:
f.write(json.dumps(record, ensure_ascii=False) + "\n")
print("wrote", len(records), "records")
Reading streams the file:
import json
records = [
{"name": "Voltaire", "born": 1694},
{"name": "Émilie", "born": 1706},
]
with open("scraped.jsonl", "w", encoding="utf-8") as f:
for record in records:
f.write(json.dumps(record, ensure_ascii=False) + "\n")
with open("scraped.jsonl", "r", encoding="utf-8") as f:
records = [json.loads(line) for line in f]
print(records)
This is the format every modern data pipeline uses for logs and scrape outputs. Tools like jq, polars, and pandas all read it natively. Once a scrape gets past a few hundred records, JSONL is a much better choice than a single growing JSON list.
Picking a format
A short rule of thumb:
| Your data is… | Use |
|---|---|
| free-form text (transcripts, articles) | plain .txt |
| flat rows with consistent columns | CSV |
| nested or variable-shape records | JSON |
| a long, growing log of records | JSONL |
| something you’ll edit by hand later | the most readable of the above |
| something a colleague will open in Excel | CSV |
When you’re comfortable, Lesson 29 covers the last format you’ll meet often: XML.
Running the code
csv, json, and pathlib all ship with Python — nothing to install. Save any snippet from this lesson to a file — say try.py — and run it from your project folder:
uv run try.py
uv run uses the project’s Python automatically; no virtualenv to activate. If you haven’t set the project up yet, Lesson 01 walks through it.