← Back to Data Exploration
Practical Work 3

XML Data Processing

Parse and transform XML documents into structured data using Python and XPath

Duration 2-3 hours
Difficulty Intermediate
Session XML Data Formats

Objectives

By the end of this practical work, you will be able to:

  • Parse XML documents using Python's lxml library
  • Navigate XML tree structures using XPath expressions
  • Extract data from complex nested XML elements
  • Transform XML data into pandas DataFrames
  • Integrate XML data with Orange Data Mining

Prerequisites

  • Python 3.8+ installed
  • Basic understanding of XML structure
  • Orange Data Mining (optional, for visualization)

Install required packages:

pip install lxml pandas requests

Dataset

We'll work with two XML datasets:

  • menu.xml - A simple restaurant menu (warm-up exercise)
  • ml.zip - Machine Learning article metadata (main exercise)

Note: Download the files before starting the exercises.

Instructions

Step 1: Parse a Simple XML File

Start with the restaurant menu XML file:

from lxml import etree

# Parse the XML file
tree = etree.parse('menu.xml')
root = tree.getroot()

# Print basic info
print(f"Root element: {root.tag}")
print(f"Number of children: {len(root)}")

# List all child element tags
for child in root:
    print(f"  - {child.tag}")

Expected Output: You should see the root element name and a list of food items.

Step 2: Navigate with XPath

Use XPath to extract specific elements:

# Get all food names
food_names = root.xpath('//food/name/text()')
print("Menu items:")
for name in food_names:
    print(f"  - {name}")

# Get all prices
prices = root.xpath('//food/price/text()')
print("\nPrices:", prices)

# Get food items with calories > 500
high_cal_foods = root.xpath('//food[calories > 500]/name/text()')
print("\nHigh calorie items (>500):", high_cal_foods)

Step 3: Extract Data to a Dictionary

Create a structured data extraction function:

def extract_menu_data(xml_path):
    """Extract all menu items as a list of dictionaries."""
    tree = etree.parse(xml_path)
    root = tree.getroot()

    menu_items = []
    for food in root.xpath('//food'):
        item = {
            'name': food.findtext('name'),
            'price': food.findtext('price'),
            'description': food.findtext('description'),
            'calories': int(food.findtext('calories', default='0'))
        }
        menu_items.append(item)

    return menu_items

# Test the function
menu = extract_menu_data('menu.xml')
for item in menu:
    print(f"{item['name']}: {item['price']} ({item['calories']} cal)")

Step 4: Convert to DataFrame

Transform the extracted data into a pandas DataFrame:

import pandas as pd

# Convert menu data to DataFrame
df = pd.DataFrame(extract_menu_data('menu.xml'))

# Clean the price column (remove $ and convert to float)
df['price_numeric'] = df['price'].str.replace('$', '').astype(float)

# Display the DataFrame
print(df)

# Basic analysis
print(f"\nMost expensive item: {df.loc[df['price_numeric'].idxmax(), 'name']}")
print(f"Average calories: {df['calories'].mean():.0f}")
print(f"Total items: {len(df)}")

Step 5: Process Multiple XML Files

Now work with the ML articles dataset. First, extract the zip file, then process multiple XML files:

import zipfile
import os
from pathlib import Path

# Extract the zip file
with zipfile.ZipFile('ml.zip', 'r') as zip_ref:
    zip_ref.extractall('ml_articles')

# Find all XML files
xml_files = list(Path('ml_articles').glob('**/*.xml'))
print(f"Found {len(xml_files)} XML files")

# Function to extract article metadata
def extract_article_data(xml_path):
    """Extract metadata from a single article XML."""
    try:
        tree = etree.parse(str(xml_path))
        root = tree.getroot()

        # Adjust XPath based on actual XML structure
        # Common patterns for article metadata:
        return {
            'title': root.findtext('.//title', default='Unknown'),
            'year': root.findtext('.//year', default='Unknown'),
            'month': root.findtext('.//month', default='Unknown'),
            'source_file': xml_path.name,
            'topic': 'ML'  # We know all articles are ML-related
        }
    except Exception as e:
        print(f"Error parsing {xml_path}: {e}")
        return None

# Process all files
articles = []
for xml_file in xml_files[:10]:  # Start with first 10
    data = extract_article_data(xml_file)
    if data:
        articles.append(data)

# Create DataFrame
articles_df = pd.DataFrame(articles)
print(articles_df)

Warning: XML structures vary. You may need to inspect the actual XML files and adjust the XPath expressions accordingly.

Step 6: Inspect XML Structure

When dealing with unknown XML structures, use this helper function:

def describe_xml_structure(xml_path, max_depth=3):
    """Print the structure of an XML file."""
    tree = etree.parse(str(xml_path))
    root = tree.getroot()

    def print_element(elem, depth=0):
        if depth > max_depth:
            return
        indent = "  " * depth
        attrs = ' '.join(f'{k}="{v}"' for k, v in elem.attrib.items())
        text = elem.text.strip()[:50] if elem.text and elem.text.strip() else ''

        print(f"{indent}<{elem.tag}> {attrs}")
        if text:
            print(f"{indent}  TEXT: {text}...")

        for child in elem:
            print_element(child, depth + 1)

    print_element(root)

# Inspect the first XML file
if xml_files:
    print("Structure of first XML file:")
    describe_xml_structure(xml_files[0])

Step 7: Integrate with Orange

Create an Orange Data Table from the extracted XML data:

from Orange.data import Table, Domain, StringVariable, ContinuousVariable, DiscreteVariable

# Define the domain based on our data
domain = Domain(
    [],  # No continuous features in this example
    [DiscreteVariable("topic", values=["ML"])],  # Target
    [StringVariable("title"),
     StringVariable("year"),
     StringVariable("month"),
     StringVariable("source_file")]  # Meta variables
)

# Convert DataFrame to list of lists
data = articles_df[['title', 'year', 'month', 'source_file', 'topic']].values.tolist()

# Create Orange table
out_data = Table.from_list(domain, data)
print(f"Created Orange table with {len(out_data)} rows")

Step 8: Analyze Publication Trends

Once you have all articles processed, analyze publication trends:

# Process ALL XML files
all_articles = []
for xml_file in xml_files:
    data = extract_article_data(xml_file)
    if data:
        all_articles.append(data)

df = pd.DataFrame(all_articles)

# Convert year to numeric
df['year_num'] = pd.to_numeric(df['year'], errors='coerce')

# Group by year
yearly_counts = df.groupby('year_num').size()
print("Publications by year:")
print(yearly_counts.sort_index())

# Visualize (if matplotlib is available)
try:
    import matplotlib.pyplot as plt
    yearly_counts.sort_index().plot(kind='bar', figsize=(10, 5))
    plt.title('ML Publications by Year')
    plt.xlabel('Year')
    plt.ylabel('Number of Articles')
    plt.tight_layout()
    plt.savefig('ml_publications_trend.png')
    print("Chart saved to ml_publications_trend.png")
except ImportError:
    print("Install matplotlib for visualization: pip install matplotlib")

Expected Output

After completing this practical work, you should have:

  • A working XML parser for the menu dataset
  • A DataFrame containing all ML article metadata
  • An Orange Data Table ready for visualization
  • A chart showing publication trends over time

Deliverables

  • Python Script: Complete XML processing script (.py file)
  • CSV Export: Extracted article data as CSV
  • Visualization: Chart showing publication trends
  • Report: Answer these questions:
    1. How many articles are in the dataset?
    2. What is the date range of publications?
    3. Which year had the most publications?

Bonus Challenges

  • Challenge 1: Extract additional fields like authors or abstract from the XML files
  • Challenge 2: Use namespaces if the XML uses them (common in scientific data)
  • Challenge 3: Create a function that can handle multiple XML schemas automatically
  • Challenge 4: Parse an RSS feed (which is XML) from a news site

XPath Quick Reference

Expression Description
/root/child Select child elements of root
//element Select all elements anywhere in document
./child Select child relative to current element
//element[@attr='value'] Select elements with specific attribute
//element/text() Get text content of elements
//element[position() < 5] Select first 4 elements

Resources