& HTML Parsing
Session 5
Extracting structured data from web pages using Python, BeautifulSoup, and XPath
2026 WayUp
Extracting data from regular web pages
Web Scraping is a technique for fetching and extracting data from web pages designed for human viewing, not machine consumption
Unlike APIs, web pages lack a consistent structure for data representation
Pages can change layout at any time, breaking your scraper
Use scraping only when no API is available
From URL to DataFrame
Each step requires error handling—pages may be unavailable, structure may change, or elements may be missing
HyperText Markup Language
Browsers highly tolerate errors in HTML
Without this tolerance, most websites wouldn't display properly
This makes HTML easier to write but harder to parse programmatically
Opening and closing tags
<article class="product_pod">
<h3>
<a href="catalogue/book_1.html" title="A Light in the Attic">
A Light in...
</a>
</h3>
<p class="price_color">£51.77</p>
<p class="star-rating Three">
<i class="icon-star"></i>
</p>
</article>
<article> | Container element |
<h3> | Heading level 3 |
<p> | Paragraph |
.price_color//p[@class='price_color']What you see is not what you get
What you see in browser DevTools is the rendered DOM (Document Object Model), not the raw HTML
A safe scraping sandbox
books.toscrape.com - A fictional bookstore designed specifically for scraping practice
HTML structure won't change unexpectedly
All content loads in initial HTML
1000 books with titles, prices, ratings
No robots.txt restrictions
Python's most popular HTML parsing library
User-friendly library for parsing HTML and navigating the document tree
Installation: pip install beautifulsoup4 requests
import requests
from bs4 import BeautifulSoup
# Fetch the web page
url = "http://books.toscrape.com"
response = requests.get(url)
# Parse HTML with BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
# Find all book elements using CSS selectors
books = soup.select("article.product_pod")
# Extract data from each book
for book in books[:5]: # First 5 books
title = book.select_one("h3 a")["title"]
price = book.select_one(".price_color").text
print(f"{title}: {price}")
More powerful but steeper learning curve
from lxml import html
import requests
response = requests.get("http://books.toscrape.com")
tree = html.fromstring(response.content)
# XPath to select all book titles and prices
titles = tree.xpath("//article[@class='product_pod']//h3/a/@title")
prices = tree.xpath("//p[@class='price_color']/text()")
# Combine results
for title, price in zip(titles[:5], prices[:5]):
print(f"{title}: {price}")
Sometimes manual is faster than automated
Extract the Human Development Index table from Wikipedia as fast as possible
URL: Wikipedia HDI Page
Open the Wikipedia page
Find the HDI table in the page
Use ANY technique: copy-paste, scraping, browser tools
Produce a CSV file
When simple requests aren't enough
Many modern websites load content dynamically using JavaScript after the initial page load
Works with: requests + BeautifulSoup/lxml
Requires: Headless browser (Selenium, Playwright)
Cost-benefit analysis
When you cannot do otherwise - Scraping should be your last resort, not your first choice
Always look for official data access first
Is automation worth the development time?
Small datasets might be faster to copy-paste
Budget time for fixing broken scrapers
The classic XKCD truth about automation
From web pages to data tables
Use the Python Script widget to scrape and convert to Orange tables
import requests
from bs4 import BeautifulSoup
from Orange.data import Table, Domain, StringVariable, ContinuousVariable
# Scrape book data
response = requests.get("http://books.toscrape.com")
soup = BeautifulSoup(response.content, "html.parser")
# Extract data from each book
data = []
for book in soup.select("article.product_pod"):
title = book.select_one("h3 a")["title"]
# Remove £ symbol and convert to float
price = float(book.select_one(".price_color").text[1:])
data.append([title, price])
# Create Orange table
domain = Domain([StringVariable("title")], [ContinuousVariable("price")])
out_data = Table.from_list(domain, data)
Multi-page scraping challenge
Scrape books from multiple pages and analyze patterns
Scrape first 3 pages of books.toscrape.com
Title, price, star rating (1-5)
Create table with all extracted data
Price distribution, rating vs price
Handle pagination automatically by following "next" links programmatically
Structured data sources
REST APIs and XML files are almost always better than web scraping when available
1
Last resort only. Web scraping is fragile and high-maintenance. Always check for APIs or data exports first.
2
DOM ≠ HTML source. JavaScript-rendered content requires headless browsing, not simple HTTP requests.
3
Cost-benefit matters. Small datasets might be faster to extract manually than to automate.
4
BeautifulSoup or lxml. Choose based on familiarity—CSS selectors (BS4) or XPath (lxml).