XML Data Processing
Parse and transform XML documents into structured data using Python and XPath
Objectives
By the end of this practical work, you will be able to:
- Parse XML documents using Python's
lxmllibrary - Navigate XML tree structures using XPath expressions
- Extract data from complex nested XML elements
- Transform XML data into pandas DataFrames
- Integrate XML data with Orange Data Mining
Prerequisites
- Python 3.8+ installed
- Basic understanding of XML structure
- Orange Data Mining (optional, for visualization)
Install required packages:
pip install lxml pandas requests
Dataset
We'll work with two XML datasets:
- menu.xml - A simple restaurant menu (warm-up exercise)
- ml.zip - Machine Learning article metadata (main exercise)
Note: Download the files before starting the exercises.
Instructions
Step 1: Parse a Simple XML File
Start with the restaurant menu XML file:
from lxml import etree
# Parse the XML file
tree = etree.parse('menu.xml')
root = tree.getroot()
# Print basic info
print(f"Root element: {root.tag}")
print(f"Number of children: {len(root)}")
# List all child element tags
for child in root:
print(f" - {child.tag}")
Expected Output: You should see the root element name and a list of food items.
Step 2: Navigate with XPath
Use XPath to extract specific elements:
# Get all food names
food_names = root.xpath('//food/name/text()')
print("Menu items:")
for name in food_names:
print(f" - {name}")
# Get all prices
prices = root.xpath('//food/price/text()')
print("\nPrices:", prices)
# Get food items with calories > 500
high_cal_foods = root.xpath('//food[calories > 500]/name/text()')
print("\nHigh calorie items (>500):", high_cal_foods)
Step 3: Extract Data to a Dictionary
Create a structured data extraction function:
def extract_menu_data(xml_path):
"""Extract all menu items as a list of dictionaries."""
tree = etree.parse(xml_path)
root = tree.getroot()
menu_items = []
for food in root.xpath('//food'):
item = {
'name': food.findtext('name'),
'price': food.findtext('price'),
'description': food.findtext('description'),
'calories': int(food.findtext('calories', default='0'))
}
menu_items.append(item)
return menu_items
# Test the function
menu = extract_menu_data('menu.xml')
for item in menu:
print(f"{item['name']}: {item['price']} ({item['calories']} cal)")
Step 4: Convert to DataFrame
Transform the extracted data into a pandas DataFrame:
import pandas as pd
# Convert menu data to DataFrame
df = pd.DataFrame(extract_menu_data('menu.xml'))
# Clean the price column (remove $ and convert to float)
df['price_numeric'] = df['price'].str.replace('$', '').astype(float)
# Display the DataFrame
print(df)
# Basic analysis
print(f"\nMost expensive item: {df.loc[df['price_numeric'].idxmax(), 'name']}")
print(f"Average calories: {df['calories'].mean():.0f}")
print(f"Total items: {len(df)}")
Step 5: Process Multiple XML Files
Now work with the ML articles dataset. First, extract the zip file, then process multiple XML files:
import zipfile
import os
from pathlib import Path
# Extract the zip file
with zipfile.ZipFile('ml.zip', 'r') as zip_ref:
zip_ref.extractall('ml_articles')
# Find all XML files
xml_files = list(Path('ml_articles').glob('**/*.xml'))
print(f"Found {len(xml_files)} XML files")
# Function to extract article metadata
def extract_article_data(xml_path):
"""Extract metadata from a single article XML."""
try:
tree = etree.parse(str(xml_path))
root = tree.getroot()
# Adjust XPath based on actual XML structure
# Common patterns for article metadata:
return {
'title': root.findtext('.//title', default='Unknown'),
'year': root.findtext('.//year', default='Unknown'),
'month': root.findtext('.//month', default='Unknown'),
'source_file': xml_path.name,
'topic': 'ML' # We know all articles are ML-related
}
except Exception as e:
print(f"Error parsing {xml_path}: {e}")
return None
# Process all files
articles = []
for xml_file in xml_files[:10]: # Start with first 10
data = extract_article_data(xml_file)
if data:
articles.append(data)
# Create DataFrame
articles_df = pd.DataFrame(articles)
print(articles_df)
Warning: XML structures vary. You may need to inspect the actual XML files and adjust the XPath expressions accordingly.
Step 6: Inspect XML Structure
When dealing with unknown XML structures, use this helper function:
def describe_xml_structure(xml_path, max_depth=3):
"""Print the structure of an XML file."""
tree = etree.parse(str(xml_path))
root = tree.getroot()
def print_element(elem, depth=0):
if depth > max_depth:
return
indent = " " * depth
attrs = ' '.join(f'{k}="{v}"' for k, v in elem.attrib.items())
text = elem.text.strip()[:50] if elem.text and elem.text.strip() else ''
print(f"{indent}<{elem.tag}> {attrs}")
if text:
print(f"{indent} TEXT: {text}...")
for child in elem:
print_element(child, depth + 1)
print_element(root)
# Inspect the first XML file
if xml_files:
print("Structure of first XML file:")
describe_xml_structure(xml_files[0])
Step 7: Integrate with Orange
Create an Orange Data Table from the extracted XML data:
from Orange.data import Table, Domain, StringVariable, ContinuousVariable, DiscreteVariable
# Define the domain based on our data
domain = Domain(
[], # No continuous features in this example
[DiscreteVariable("topic", values=["ML"])], # Target
[StringVariable("title"),
StringVariable("year"),
StringVariable("month"),
StringVariable("source_file")] # Meta variables
)
# Convert DataFrame to list of lists
data = articles_df[['title', 'year', 'month', 'source_file', 'topic']].values.tolist()
# Create Orange table
out_data = Table.from_list(domain, data)
print(f"Created Orange table with {len(out_data)} rows")
Step 8: Analyze Publication Trends
Once you have all articles processed, analyze publication trends:
# Process ALL XML files
all_articles = []
for xml_file in xml_files:
data = extract_article_data(xml_file)
if data:
all_articles.append(data)
df = pd.DataFrame(all_articles)
# Convert year to numeric
df['year_num'] = pd.to_numeric(df['year'], errors='coerce')
# Group by year
yearly_counts = df.groupby('year_num').size()
print("Publications by year:")
print(yearly_counts.sort_index())
# Visualize (if matplotlib is available)
try:
import matplotlib.pyplot as plt
yearly_counts.sort_index().plot(kind='bar', figsize=(10, 5))
plt.title('ML Publications by Year')
plt.xlabel('Year')
plt.ylabel('Number of Articles')
plt.tight_layout()
plt.savefig('ml_publications_trend.png')
print("Chart saved to ml_publications_trend.png")
except ImportError:
print("Install matplotlib for visualization: pip install matplotlib")
Expected Output
After completing this practical work, you should have:
- A working XML parser for the menu dataset
- A DataFrame containing all ML article metadata
- An Orange Data Table ready for visualization
- A chart showing publication trends over time
Deliverables
- Python Script: Complete XML processing script (.py file)
- CSV Export: Extracted article data as CSV
- Visualization: Chart showing publication trends
- Report: Answer these questions:
- How many articles are in the dataset?
- What is the date range of publications?
- Which year had the most publications?
Bonus Challenges
- Challenge 1: Extract additional fields like authors or abstract from the XML files
- Challenge 2: Use namespaces if the XML uses them (common in scientific data)
- Challenge 3: Create a function that can handle multiple XML schemas automatically
- Challenge 4: Parse an RSS feed (which is XML) from a news site
XPath Quick Reference
| Expression | Description |
|---|---|
/root/child |
Select child elements of root |
//element |
Select all elements anywhere in document |
./child |
Select child relative to current element |
//element[@attr='value'] |
Select elements with specific attribute |
//element/text() |
Get text content of elements |
//element[position() < 5] |
Select first 4 elements |