Skip to content

Lesson 28

Storing Data — Text, CSV, and JSON

Write your data out to plain text, CSV, and JSON — and pick the right format for the shape you have.

Once you’ve gathered or computed some data, you usually want to keep it. Plain .txt files are the simplest option, but they’re rarely the right one. The right format depends on the shape of the data:

  • Plain text — for unstructured prose. Transcripts, notes, scraped article bodies.
  • CSV — for tabular data. Anything that looks like a spreadsheet.
  • JSON — for nested or hierarchical data. The output of a scraper or API.

This lesson covers all three. They’re the formats you’ll write 95% of the time.

Plain text

The pattern is the same one we used to read text files in Lesson 15, with the file opened in write mode ("w"):

with open("output.txt", "w", encoding="utf-8") as f:
    f.write("Hello, file!\n")

with open("output.txt", "r", encoding="utf-8") as f:
    print(f.read())

A few points worth pinning down:

  • The with block automatically closes the file when you leave it.
  • "w" overwrites any existing file with the same name. Use "a" (append) to add without erasing.
  • f.write does not add a newline. Note the explicit \n.
  • Always set encoding="utf-8". The default differs across operating systems; UTF-8 is the modern standard everywhere else.

For a list of strings, write them one per line:

names = ["Theuderic IV", "Pippin the Short", "Charlemagne"]

with open("names.txt", "w", encoding="utf-8") as f:
    for name in names:
        f.write(name + "\n")

with open("names.txt", "r", encoding="utf-8") as f:
    print(f.read())

pathlib has a one-liner equivalent that’s even cleaner:

from pathlib import Path

names = ["Theuderic IV", "Pippin the Short", "Charlemagne"]
Path("names.txt").write_text("\n".join(names) + "\n", encoding="utf-8")
print(Path("names.txt").read_text(encoding="utf-8"))

Reading it back is symmetrical:

from pathlib import Path

names = ["Theuderic IV", "Pippin the Short", "Charlemagne"]
Path("names.txt").write_text("\n".join(names) + "\n", encoding="utf-8")

text = Path("names.txt").read_text(encoding="utf-8")
lines = text.splitlines()
print(lines)

splitlines handles \n, \r\n, and \r uniformly and drops the trailing newlines, which is almost always what you want.

CSV — for tabular data

When your data is rows and columns, don’t join with commas yourself. Quoting, escaping, and embedded commas in fields are a swamp; the standard library’s csv module handles all of that for you.

The cleanest writer takes a list of dictionaries:

import csv

records = [
    {"name": "Voltaire", "born": 1694, "letters": 21000},
    {"name": "Émilie",   "born": 1706, "letters":   430},
    {"name": "Diderot",  "born": 1713, "letters":  3500},
]

with open("writers.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=list(records[0].keys()))
    writer.writeheader()
    writer.writerows(records)

with open("writers.csv", "r", encoding="utf-8") as f:
    print(f.read())

Two non-obvious arguments to open you need every time you work with CSV:

  • newline="". Without it, csv and the OS will both add line endings on Windows, and you’ll get blank rows between every record.
  • encoding="utf-8". Same reason as before.

Reading it back with the same shape:

import csv

records = [
    {"name": "Voltaire", "born": 1694, "letters": 21000},
    {"name": "Émilie",   "born": 1706, "letters":   430},
    {"name": "Diderot",  "born": 1713, "letters":  3500},
]

with open("writers.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=list(records[0].keys()))
    writer.writeheader()
    writer.writerows(records)

with open("writers.csv", "r", encoding="utf-8", newline="") as f:
    reader = csv.DictReader(f)
    records = [row for row in reader]

print(records[0])
# {'name': 'Voltaire', 'born': '1694', 'letters': '21000'}

Notice that everything came back as a string — CSV has no type information. If you need numbers, convert them yourself: int(row["born"]). (pandas.read_csv and polars.read_csv infer types automatically; for serious tabular work, those are usually the right tool.)

JSON — for nested data

When your records have nested fields (lists, sub-dictionaries), CSV stops working cleanly. Use JSON. Python’s json module reads and writes it directly.

import json

records = [
    {
        "name": "Voltaire",
        "born": 1694,
        "places": ["Paris", "Cirey", "Ferney"],
        "correspondents": {"Émilie": 192, "Diderot": 8},
    },
]

with open("writers.json", "w", encoding="utf-8") as f:
    json.dump(records, f, indent=2, ensure_ascii=False)

with open("writers.json", "r", encoding="utf-8") as f:
    print(f.read())

Three useful arguments to json.dump:

  • indent=2 writes a pretty-printed, human-readable file. Without it the whole record is on one line. For a file you’ll ever look at, always indent.
  • ensure_ascii=False keeps non-ASCII characters as themselves (Émilie) instead of escaping them (\u00c9milie). Almost always what you want.
  • default=str is a fallback for objects json doesn’t know how to serialize natively — most usefully datetime objects.

Reading is even simpler:

import json

records = [
    {
        "name": "Voltaire",
        "born": 1694,
        "places": ["Paris", "Cirey", "Ferney"],
        "correspondents": {"Émilie": 192, "Diderot": 8},
    },
]

with open("writers.json", "w", encoding="utf-8") as f:
    json.dump(records, f, indent=2, ensure_ascii=False)

with open("writers.json", "r", encoding="utf-8") as f:
    records = json.load(f)
print(records)

You get back the same list of dictionaries you wrote.

JSONL — one JSON object per line

For long-running scrapers and large datasets, JSONL (also called NDJSON) is the format to know. Each line of the file is one valid JSON object. You can append to it as you go without rewriting the whole file:

import json

records = [
    {"name": "Voltaire", "born": 1694},
    {"name": "Émilie",   "born": 1706},
]

with open("scraped.jsonl", "a", encoding="utf-8") as f:
    for record in records:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")
print("wrote", len(records), "records")

Reading streams the file:

import json

records = [
    {"name": "Voltaire", "born": 1694},
    {"name": "Émilie",   "born": 1706},
]

with open("scraped.jsonl", "w", encoding="utf-8") as f:
    for record in records:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")

with open("scraped.jsonl", "r", encoding="utf-8") as f:
    records = [json.loads(line) for line in f]
print(records)

This is the format every modern data pipeline uses for logs and scrape outputs. Tools like jq, polars, and pandas all read it natively. Once a scrape gets past a few hundred records, JSONL is a much better choice than a single growing JSON list.

Picking a format

A short rule of thumb:

Your data is…Use
free-form text (transcripts, articles)plain .txt
flat rows with consistent columnsCSV
nested or variable-shape recordsJSON
a long, growing log of recordsJSONL
something you’ll edit by hand laterthe most readable of the above
something a colleague will open in ExcelCSV

When you’re comfortable, Lesson 29 covers the last format you’ll meet often: XML.

Running the code

csv, json, and pathlib all ship with Python — nothing to install. Save any snippet from this lesson to a file — say try.py — and run it from your project folder:

uv run try.py

uv run uses the project’s Python automatically; no virtualenv to activate. If you haven’t set the project up yet, Lesson 01 walks through it.