Data Formats
Session 3
Parsing and extracting data from XML documents using XPath and Python libraries
2026 WayUp
eXtensible Markup Language
XML is a hierarchical data format designed for structured information exchange between systems
Data organized as a tree of nodes with parent-child relationships
Elements (tags), text content, and attributes
Ideal for representing nested, detailed data with metadata
Visualizing hierarchical relationships
Every XML document forms a tree with a single root element and nested children
| Root | breakfast_menu |
| Child | food |
| Grandchild | name, price, calories |
| Text | "Belgian Waffles" |
Understanding the syntax
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with...</description>
<calories>650</calories>
</food>
<food>
<name>French Toast</name>
<price>$4.50</price>
<description>Thick slices made from our homemade bread...</description>
<calories>450</calories>
</food>
</breakfast_menu>
<breakfast_menu> - Root element
<food> - Child element (repeatable)
Belgian Waffles - Text node
$5.95 - Text node
Query language for XML navigation
XPath is a query language for selecting nodes from an XML document using path expressions
XPath traces a path from the root element to the target node
Each level in the hierarchy is separated by /
Use [...] to filter results with conditions
| XPath Expression | Meaning |
|---|---|
/breakfast_menu/food | All food elements from root |
./food/name | Relative path to name elements |
//name | All name elements anywhere |
/food[price>5] | Food items with price over $5 |
Using xml.etree.ElementTree
Python includes xml.etree.ElementTree for basic XML parsing without external dependencies
import xml.etree.ElementTree as xmlReader
# Parse XML from file
tree = xmlReader.parse('menu.xml')
root = tree.getroot()
# Iterate through elements
for food in root.findall('food'):
name = food.find('name').text
price = food.find('price').text
print(f"{name}: {price}")
lxml for advanced queries.
Enhanced XPath support
The lxml library provides extensive XPath 1.0 support, making complex queries much easier
from lxml import etree
# Parse XML document
tree = etree.parse('menu.xml')
root = tree.getroot()
# Use powerful XPath queries
expensive_items = root.xpath('//food[number(substring-after(price, "$")) > 5]')
for item in expensive_items:
name = item.find('name').text
price = item.find('price').text
print(f"{name}: {price}")
pip install lxml
Building a data list from XML
Extract all food items into a list of [name, price] pairs for analysis
Use ./ for relative paths from current element
from lxml import etree
tree = etree.parse('menu.xml')
root = tree.getroot()
# Get all food elements
elems = root.findall('./food')
# Extract name and price from each
data = [
[elem.find("./name").text, elem.find("./price").text]
for elem in elems
]
print(data)
# Output: [['Belgian Waffles', '$5.95'], ['French Toast', '$4.50'], ...]
Key syntax patterns
/breakfast_menu/food
Starts with / → query from root
./food/name
Starts with ./ → query from current node
//name
Starts with // → find anywhere in document
//food[starts-with(./name/text(), 'Be')]
Use [...] to filter results
Converting XML to Orange-compatible format
Orange can work with "raw" Python data through its Table.from_list API
from Orange.data import *
from lxml import etree
# Parse XML
tree = etree.parse('menu.xml')
root = tree.getroot()
# Extract data
data = []
for food in root.findall('./food'):
name = food.find('./name').text
price = float(food.find('./price').text.replace('$', ''))
calories = int(food.find('./calories').text)
data.append([name, price, calories])
# Define Orange domain
name_var = StringVariable('name')
price_var = ContinuousVariable('price')
calories_var = ContinuousVariable('calories')
domain = Domain([name_var, price_var, calories_var])
# Create Orange table
table = Table.from_list(domain, data)
Practice with the Python Script widget
Import the menu.xml file through Orange's Python Script widget and convert it to a data table
From the Data category in Orange
Use lxml or xml.etree to read the file
Store result in out_data variable
Connect to Data Table widget to view
out_data variable is automatically passed to connected widgets.
Processing multiple XML files
Download this archive containing ML publications data in XML format
Use Orange visualization widgets to explore publication trends over time
glob module to find all XML files: glob.glob('*.xml')
1
XML is hierarchical. Think in trees with parent-child relationships, not flat tables.
2
XPath is powerful. Learn XPath syntax to efficiently navigate complex XML structures.
3
lxml over native. Use lxml for production work—it has better XPath support and performance.
4
Orange integration is straightforward. Convert XML to lists, then use Table.from_list().