Lesson 29

Storing Data in XML Files

Write and read structured data in XML using Python's standard library — and know when XML is actually the right format.

XML still matters in the humanities. The TEI Guidelines, EAD finding aids, EpiDoc inscriptions, METS/MODS, plenty of museum and library standards — they’re all XML. If your data has to interoperate with any of those traditions, you need to know how to read and write XML cleanly from Python. This lesson covers the basics with Python’s standard xml.etree.ElementTree module, then notes when to reach for the more powerful third-party lxml.

For brand-new data with no XML obligations, JSON is usually a better fit (Lesson 28). XML earns its keep when you need attributes, mixed content (text and tags interleaved), namespaces, or compatibility with an existing schema.

The library

Python ships with xml.etree.ElementTree, conventionally imported as ET:

import xml.etree.ElementTree as ET

The model is straightforward: build a tree of Element objects, then serialize it. Reading reverses the process — parse the tree, then walk it.

Writing — building a tree

Every XML document has a single root element. Create it, then add children:

import xml.etree.ElementTree as ET

root = ET.Element("biographies")

person = ET.SubElement(root, "person", attrib={"id": "theuderic-iv"})
ET.SubElement(person, "name").text = "Theuderic IV"
ET.SubElement(person, "reign", attrib={"start": "721", "end": "737"})

print(ET.tostring(root, encoding="unicode"))

Two important moves there:

.text sets the textual content of an element (<name>Theuderic IV</name>).
attrib={...} sets attributes (<reign start="721" end="737"/>).

You can also set attributes after creation: person.set("id", "theuderic-iv").

Building from a list of dictionaries

Most often, you’ve already got Python data — a list of dictionaries from a scrape or a CSV — and you want to serialize it:

import xml.etree.ElementTree as ET

people = [
    {"id": "theuderic-iv",  "name": "Theuderic IV",  "start": "721", "end": "737"},
    {"id": "childeric-iii", "name": "Childeric III", "start": "743", "end": "751"},
]

root = ET.Element("biographies")

for p in people:
    person = ET.SubElement(root, "person", attrib={"id": p["id"]})
    ET.SubElement(person, "name").text = p["name"]
    ET.SubElement(person, "reign", attrib={"start": p["start"], "end": p["end"]})

print(ET.tostring(root, encoding="unicode"))

Saving to disk — pretty-printed

Wrap the root in an ElementTree, indent it (Python 3.9+), and write:

import xml.etree.ElementTree as ET

people = [
    {"id": "theuderic-iv",  "name": "Theuderic IV",  "start": "721", "end": "737"},
    {"id": "childeric-iii", "name": "Childeric III", "start": "743", "end": "751"},
]

root = ET.Element("biographies")
for p in people:
    person = ET.SubElement(root, "person", attrib={"id": p["id"]})
    ET.SubElement(person, "name").text = p["name"]
    ET.SubElement(person, "reign", attrib={"start": p["start"], "end": p["end"]})

tree = ET.ElementTree(root)
ET.indent(tree, space="  ")
tree.write("biographies.xml", encoding="utf-8", xml_declaration=True)

print(open("biographies.xml", encoding="utf-8").read())

The result:

<?xml version='1.0' encoding='utf-8'?>
<biographies>
  <person id="theuderic-iv">
    <name>Theuderic IV</name>
    <reign start="721" end="737" />
  </person>
  <person id="childeric-iii">
    <name>Childeric III</name>
    <reign start="743" end="751" />
  </person>
</biographies>

xml_declaration=True writes the <?xml version='1.0' ?> header. ET.indent is what turns the default single-line blob into something a human can read.

Reading XML

Reverse direction — parse a file, then walk the tree:

import xml.etree.ElementTree as ET

people = [
    {"id": "theuderic-iv",  "name": "Theuderic IV",  "start": "721", "end": "737"},
    {"id": "childeric-iii", "name": "Childeric III", "start": "743", "end": "751"},
]

root = ET.Element("biographies")
for p in people:
    person = ET.SubElement(root, "person", attrib={"id": p["id"]})
    ET.SubElement(person, "name").text = p["name"]
    ET.SubElement(person, "reign", attrib={"start": p["start"], "end": p["end"]})

tree = ET.ElementTree(root)
ET.indent(tree, space="  ")
tree.write("biographies.xml", encoding="utf-8", xml_declaration=True)

tree = ET.parse("biographies.xml")
root = tree.getroot()

for person in root.findall("person"):
    pid = person.get("id")
    name = person.findtext("name")
    reign = person.find("reign")
    start, end = reign.get("start"), reign.get("end")
    print(pid, name, start, end)

The four most useful methods on an Element:

Method	Returns
`el.find(path)`	the first matching child element (or `None`)
`el.findall(path)`	a list of matching children
`el.findtext(path)`	the `.text` of the first match (or `None`)
`el.get(name)`	the value of an attribute (or `None`)

path can be a tag name, or a more elaborate XPath-ish expression — "person/name", "person[@id='theuderic-iv']". ElementTree supports a useful subset; for full XPath you’ll want lxml.

Namespaces — the gotcha

Real-world XML usually has namespaces, and that changes how you query it. A TEI document looks like:

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>...</teiHeader>
</TEI>

In ElementTree, you have to qualify your queries with the namespace URI:

import xml.etree.ElementTree as ET

with open("tei-doc.xml", "w", encoding="utf-8") as f:
    f.write('<TEI xmlns="http://www.tei-c.org/ns/1.0"><teiHeader/><text><body><div type="chapter"/><div type="section"/></body></text></TEI>')

ns = {"tei": "http://www.tei-c.org/ns/1.0"}

tree = ET.parse("tei-doc.xml")
for div in tree.findall(".//tei:div", ns):
    print(div.get("type"))

Forgetting the namespace is the most common reason a query returns nothing. If findall is mysteriously empty, check whether the document declares a namespace.

When to use `lxml` instead

xml.etree.ElementTree is good enough for most jobs. Reach for the third-party lxml package when you need:

Full XPath 1.0 (and 2.0 via extensions).
XSLT transformations.
Schema validation (XSD, DTD, RELAX NG).
Faster parsing of very large documents.
Round-tripping with comments and processing instructions preserved.

The lxml API is mostly compatible with ElementTree (most ElementTree code works under from lxml import etree as ET), so it’s a small switch when the time comes.

uv add lxml

Running the code

xml.etree.ElementTree ships with Python, so a basic XML script needs nothing extra:

uv run try.py

If you’ve followed the lxml note above, add it to your project first:

uv add lxml

uv run uses the project’s Python and dependencies automatically; no virtualenv to activate. If you haven’t set the project up yet, Lesson 01 walks through it.

Where to next

This is the last numbered lesson in the introductory course. From here, two natural next steps:

The free textbooks — deeper guides on Python for DH, named entity recognition, spaCy, and BookNLP. Each builds on the foundations from this course.
A real project. A small, finishable one. Pick something you’d otherwise do by hand — count names in a corpus, scrape a finding aid, transform a folder of texts — and write the script. Everything you learn next, you’ll learn faster than what you’ve learned so far.