Practical Work: XML Data Processing

Objectives

By the end of this practical work, you will be able to:

Parse XML documents using Python's lxml library
Navigate XML tree structures using XPath expressions
Extract data from complex nested XML elements
Transform XML data into pandas DataFrames
Integrate XML data with Orange Data Mining

Prerequisites

Python 3.8+ installed
Basic understanding of XML structure
Orange Data Mining (optional, for visualization)

Install required packages:

pip install lxml pandas requests

Dataset

We'll work with two XML datasets:

menu.xml - A simple restaurant menu (warm-up exercise)
ml.zip - Machine Learning article metadata (main exercise)

Note: Download the files before starting the exercises.

Instructions

Step 1: Parse a Simple XML File

Start with the restaurant menu XML file:

from lxml import etree

# Parse the XML file
tree = etree.parse('menu.xml')
root = tree.getroot()

# Print basic info
print(f"Root element: {root.tag}")
print(f"Number of children: {len(root)}")

# List all child element tags
for child in root:
    print(f"  - {child.tag}")

Expected Output: You should see the root element name and a list of food items.

Step 2: Navigate with XPath

Use XPath to extract specific elements:

# Get all food names
food_names = root.xpath('//food/name/text()')
print("Menu items:")
for name in food_names:
    print(f"  - {name}")

# Get all prices
prices = root.xpath('//food/price/text()')
print("\nPrices:", prices)

# Get food items with calories > 500
high_cal_foods = root.xpath('//food[calories > 500]/name/text()')
print("\nHigh calorie items (>500):", high_cal_foods)

Step 3: Extract Data to a Dictionary

Create a structured data extraction function:

def extract_menu_data(xml_path):
    """Extract all menu items as a list of dictionaries."""
    tree = etree.parse(xml_path)
    root = tree.getroot()

    menu_items = []
    for food in root.xpath('//food'):
        item = {
            'name': food.findtext('name'),
            'price': food.findtext('price'),
            'description': food.findtext('description'),
            'calories': int(food.findtext('calories', default='0'))
        }
        menu_items.append(item)

    return menu_items

# Test the function
menu = extract_menu_data('menu.xml')
for item in menu:
    print(f"{item['name']}: {item['price']} ({item['calories']} cal)")

Step 4: Convert to DataFrame

Transform the extracted data into a pandas DataFrame:

import pandas as pd

# Convert menu data to DataFrame
df = pd.DataFrame(extract_menu_data('menu.xml'))

# Clean the price column (remove $ and convert to float)
df['price_numeric'] = df['price'].str.replace('$', '').astype(float)

# Display the DataFrame
print(df)

# Basic analysis
print(f"\nMost expensive item: {df.loc[df['price_numeric'].idxmax(), 'name']}")
print(f"Average calories: {df['calories'].mean():.0f}")
print(f"Total items: {len(df)}")

Step 5: Process Multiple XML Files

Now work with the ML articles dataset. First, extract the zip file, then process multiple XML files:

import zipfile
import os
from pathlib import Path

# Extract the zip file
with zipfile.ZipFile('ml.zip', 'r') as zip_ref:
    zip_ref.extractall('ml_articles')

# Find all XML files
xml_files = list(Path('ml_articles').glob('**/*.xml'))
print(f"Found {len(xml_files)} XML files")

# Function to extract article metadata
def extract_article_data(xml_path):
    """Extract metadata from a single article XML."""
    try:
        tree = etree.parse(str(xml_path))
        root = tree.getroot()

        # Adjust XPath based on actual XML structure
        # Common patterns for article metadata:
        return {
            'title': root.findtext('.//title', default='Unknown'),
            'year': root.findtext('.//year', default='Unknown'),
            'month': root.findtext('.//month', default='Unknown'),
            'source_file': xml_path.name,
            'topic': 'ML'  # We know all articles are ML-related
        }
    except Exception as e:
        print(f"Error parsing {xml_path}: {e}")
        return None

# Process all files
articles = []
for xml_file in xml_files[:10]:  # Start with first 10
    data = extract_article_data(xml_file)
    if data:
        articles.append(data)

# Create DataFrame
articles_df = pd.DataFrame(articles)
print(articles_df)

Warning: XML structures vary. You may need to inspect the actual XML files and adjust the XPath expressions accordingly.

Step 6: Inspect XML Structure

When dealing with unknown XML structures, use this helper function:

def describe_xml_structure(xml_path, max_depth=3):
    """Print the structure of an XML file."""
    tree = etree.parse(str(xml_path))
    root = tree.getroot()

    def print_element(elem, depth=0):
        if depth > max_depth:
            return
        indent = "  " * depth
        attrs = ' '.join(f'{k}="{v}"' for k, v in elem.attrib.items())
        text = elem.text.strip()[:50] if elem.text and elem.text.strip() else ''

        print(f"{indent}<{elem.tag}> {attrs}")
        if text:
            print(f"{indent}  TEXT: {text}...")

        for child in elem:
            print_element(child, depth + 1)

    print_element(root)

# Inspect the first XML file
if xml_files:
    print("Structure of first XML file:")
    describe_xml_structure(xml_files[0])

Step 7: Integrate with Orange

Create an Orange Data Table from the extracted XML data:

from Orange.data import Table, Domain, StringVariable, ContinuousVariable, DiscreteVariable

# Define the domain based on our data
domain = Domain(
    [],  # No continuous features in this example
    [DiscreteVariable("topic", values=["ML"])],  # Target
    [StringVariable("title"),
     StringVariable("year"),
     StringVariable("month"),
     StringVariable("source_file")]  # Meta variables
)

# Convert DataFrame to list of lists
data = articles_df[['title', 'year', 'month', 'source_file', 'topic']].values.tolist()

# Create Orange table
out_data = Table.from_list(domain, data)
print(f"Created Orange table with {len(out_data)} rows")

Step 8: Analyze Publication Trends

Once you have all articles processed, analyze publication trends:

# Process ALL XML files
all_articles = []
for xml_file in xml_files:
    data = extract_article_data(xml_file)
    if data:
        all_articles.append(data)

df = pd.DataFrame(all_articles)

# Convert year to numeric
df['year_num'] = pd.to_numeric(df['year'], errors='coerce')

# Group by year
yearly_counts = df.groupby('year_num').size()
print("Publications by year:")
print(yearly_counts.sort_index())

# Visualize (if matplotlib is available)
try:
    import matplotlib.pyplot as plt
    yearly_counts.sort_index().plot(kind='bar', figsize=(10, 5))
    plt.title('ML Publications by Year')
    plt.xlabel('Year')
    plt.ylabel('Number of Articles')
    plt.tight_layout()
    plt.savefig('ml_publications_trend.png')
    print("Chart saved to ml_publications_trend.png")
except ImportError:
    print("Install matplotlib for visualization: pip install matplotlib")

Expected Output

After completing this practical work, you should have:

A working XML parser for the menu dataset
A DataFrame containing all ML article metadata
An Orange Data Table ready for visualization
A chart showing publication trends over time

Deliverables

Python Script: Complete XML processing script (.py file)
CSV Export: Extracted article data as CSV
Visualization: Chart showing publication trends
Report: Answer these questions:
1. How many articles are in the dataset?
2. What is the date range of publications?
3. Which year had the most publications?

Bonus Challenges

Challenge 1: Extract additional fields like authors or abstract from the XML files
Challenge 2: Use namespaces if the XML uses them (common in scientific data)
Challenge 3: Create a function that can handle multiple XML schemas automatically
Challenge 4: Parse an RSS feed (which is XML) from a news site

XPath Quick Reference

Expression	Description
`/root/child`	Select child elements of root
`//element`	Select all elements anywhere in document
`./child`	Select child relative to current element
`//element[@attr='value']`	Select elements with specific attribute
`//element/text()`	Get text content of elements
`//element[position() < 5]`	Select first 4 elements