Working with XML

Data Formats

Session 3

Parsing and extracting data from XML documents using XPath and Python libraries

What is XML?

eXtensible Markup Language

XML is a hierarchical data format designed for structured information exchange between systems

Hierarchical Structure

Data organized as a tree of nodes with parent-child relationships

Three Node Types

Elements (tags), text content, and attributes

Complex Structures

Ideal for representing nested, detailed data with metadata

Use cases: Configuration files, API responses, data exchange between systems, document storage.

XML as a Tree Structure

Visualizing hierarchical relationships

Every XML document forms a tree with a single root element and nested children

Example Structure

Rootbreakfast_menu
Childfood
Grandchildname, price, calories
Text"Belgian Waffles"

Tree Visualization

breakfast_menu (root)
├── food
│ ├── name: "Belgian Waffles"
│ ├── price: "$5.95"
│ └── calories: 650
└── food
├── name: "French Toast"
├── price: "$4.50"
└── calories: 450

XML Document Example

Understanding the syntax

<breakfast_menu>
    <food>
        <name>Belgian Waffles</name>
        <price>$5.95</price>
        <description>Two of our famous Belgian Waffles with...</description>
        <calories>650</calories>
    </food>
    <food>
        <name>French Toast</name>
        <price>$4.50</price>
        <description>Thick slices made from our homemade bread...</description>
        <calories>450</calories>
    </food>
</breakfast_menu>

Elements

<breakfast_menu> - Root element

<food> - Child element (repeatable)

Text Content

Belgian Waffles - Text node

$5.95 - Text node

Targeting Elements with XPath

Query language for XML navigation

What is XPath?

XPath is a query language for selecting nodes from an XML document using path expressions

1 Path from Root

XPath traces a path from the root element to the target node

2 Levels with /

Each level in the hierarchy is separated by /

3 Predicates

Use [...] to filter results with conditions

XPath ExpressionMeaning
/breakfast_menu/foodAll food elements from root
./food/nameRelative path to name elements
//nameAll name elements anywhere
/food[price>5]Food items with price over $5

Reading XML with Python (Native)

Using xml.etree.ElementTree

Built-in Library

Python includes xml.etree.ElementTree for basic XML parsing without external dependencies

import xml.etree.ElementTree as xmlReader

# Parse XML from file
tree = xmlReader.parse('menu.xml')
root = tree.getroot()

# Iterate through elements
for food in root.findall('food'):
    name = food.find('name').text
    price = food.find('price').text
    print(f"{name}: {price}")
Limitation: The native library has limited XPath support. Use lxml for advanced queries.

Reading XML with lxml

Enhanced XPath support

Why lxml?

The lxml library provides extensive XPath 1.0 support, making complex queries much easier

from lxml import etree

# Parse XML document
tree = etree.parse('menu.xml')
root = tree.getroot()

# Use powerful XPath queries
expensive_items = root.xpath('//food[number(substring-after(price, "$")) > 5]')

for item in expensive_items:
    name = item.find('name').text
    price = item.find('price').text
    print(f"{name}: {price}")
Installation: pip install lxml

Extracting Data with XPath

Building a data list from XML

The Goal

Extract all food items into a list of [name, price] pairs for analysis

XPath Approach

Use ./ for relative paths from current element

from lxml import etree

tree = etree.parse('menu.xml')
root = tree.getroot()

# Get all food elements
elems = root.findall('./food')

# Extract name and price from each
data = [
    [elem.find("./name").text, elem.find("./price").text]
    for elem in elems
]

print(data)
# Output: [['Belgian Waffles', '$5.95'], ['French Toast', '$4.50'], ...]

XPath Essentials

Key syntax patterns

Absolute Path

/breakfast_menu/food

Starts with / → query from root

Relative Path

./food/name

Starts with ./ → query from current node

Anywhere Search

//name

Starts with // → find anywhere in document

Predicates

//food[starts-with(./name/text(), 'Be')]

Use [...] to filter results

Exercise: Load this XML file in Python, then try the same in Orange using the Python Script widget.

Creating Orange Data Tables from XML

Converting XML to Orange-compatible format

Orange can work with "raw" Python data through its Table.from_list API

from Orange.data import *
from lxml import etree

# Parse XML
tree = etree.parse('menu.xml')
root = tree.getroot()

# Extract data
data = []
for food in root.findall('./food'):
    name = food.find('./name').text
    price = float(food.find('./price').text.replace('$', ''))
    calories = int(food.find('./calories').text)
    data.append([name, price, calories])

# Define Orange domain
name_var = StringVariable('name')
price_var = ContinuousVariable('price')
calories_var = ContinuousVariable('calories')

domain = Domain([name_var, price_var, calories_var])

# Create Orange table
table = Table.from_list(domain, data)

Exercise 1: XML to Orange Table

Practice with the Python Script widget

Your Task

Import the menu.xml file through Orange's Python Script widget and convert it to a data table

1 Add Python Script Widget

From the Data category in Orange

2 Parse XML

Use lxml or xml.etree to read the file

3 Convert to Table

Store result in out_data variable

4 Visualize

Connect to Data Table widget to view

Hint: The out_data variable is automatically passed to connected widgets.

Exercise 2: Machine Learning Publications

Processing multiple XML files

Challenge: Multi-File Processing

Download this archive containing ML publications data in XML format

Step 1: Single File

  • Extract one XML file
  • Parse year, month, and title
  • Add a "topic" column with value "ML"

Step 2: All Files

  • Iterate over all files in folder
  • Combine data into single dataset
  • Load into Orange via Python Script

Step 3: Analysis

Use Orange visualization widgets to explore publication trends over time

Tip: Use Python's glob module to find all XML files: glob.glob('*.xml')

Key Takeaways

1

XML is hierarchical. Think in trees with parent-child relationships, not flat tables.

2

XPath is powerful. Learn XPath syntax to efficiently navigate complex XML structures.

3

lxml over native. Use lxml for production work—it has better XPath support and performance.

4

Orange integration is straightforward. Convert XML to lists, then use Table.from_list().

Next: Continue to REST APIs and web scraping to expand your data acquisition toolkit.

Slide Overview