Web Scraping with Python
Learn to extract data from websites using BeautifulSoup and lxml
Objectives
By the end of this practical work, you will be able to:
- Fetch web pages using the
requestslibrary - Parse HTML content using BeautifulSoup
- Extract data using CSS selectors and XPath
- Handle pagination to scrape multiple pages
- Store scraped data in a structured format (CSV, Orange Table)
Prerequisites
- Python 3.8+ installed
- Basic understanding of HTML structure
- A code editor (VS Code, PyCharm, or Jupyter Notebook)
Install required packages:
pip install requests beautifulsoup4 lxml pandas
Instructions
Step 1: Fetch a Web Page
Start by fetching the books.toscrape.com homepage:
import requests
url = "http://books.toscrape.com"
response = requests.get(url)
# Check if request was successful
if response.status_code == 200:
print("Success! Page fetched.")
print(f"Content length: {len(response.content)} bytes")
else:
print(f"Error: {response.status_code}")
Tip: Always check the status code before processing the response!
Step 2: Parse HTML with BeautifulSoup
Create a BeautifulSoup object to parse the HTML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
# Print the page title
print(f"Page title: {soup.title.string}")
# Find all h3 elements (book titles are in h3 tags)
h3_tags = soup.find_all("h3")
print(f"Found {len(h3_tags)} h3 elements")
Step 3: Extract Book Data
Extract title and price for each book:
# Find all book containers
books = soup.select("article.product_pod")
print(f"Found {len(books)} books on this page")
# Extract data from each book
for book in books[:5]: # First 5 books
# Title is in the 'title' attribute of the anchor tag
title = book.select_one("h3 a")["title"]
# Price is in a paragraph with class 'price_color'
price = book.select_one(".price_color").text
# Rating is encoded in the class name
rating_class = book.select_one(".star-rating")["class"][1]
print(f"{title} - {price} - {rating_class} stars")
Step 4: Create a Data Collection Function
Organize your code into a reusable function:
def scrape_books(url):
"""Scrape all books from a single page."""
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
books_data = []
for book in soup.select("article.product_pod"):
# Convert rating word to number
rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
rating_class = book.select_one(".star-rating")["class"][1]
books_data.append({
"title": book.select_one("h3 a")["title"],
"price": float(book.select_one(".price_color").text[1:]), # Remove currency
"rating": rating_map.get(rating_class, 0),
"in_stock": "In stock" in book.select_one(".availability").text
})
return books_data
# Test the function
books = scrape_books("http://books.toscrape.com")
print(f"Scraped {len(books)} books")
for book in books[:3]:
print(book)
Step 5: Handle Pagination
Scrape multiple pages by following "next" links:
def scrape_all_books(base_url, max_pages=5):
"""Scrape books from multiple pages."""
all_books = []
current_url = base_url
for page in range(max_pages):
print(f"Scraping page {page + 1}...")
response = requests.get(current_url)
soup = BeautifulSoup(response.content, "html.parser")
# Scrape books on current page
all_books.extend(scrape_books(current_url))
# Find next page link
next_link = soup.select_one("li.next a")
if next_link:
next_href = next_link["href"]
# Handle relative URLs
if not next_href.startswith("http"):
current_url = base_url.rsplit("/", 1)[0] + "/" + next_href
else:
current_url = next_href
else:
print("No more pages.")
break
return all_books
# Scrape first 3 pages
all_books = scrape_all_books("http://books.toscrape.com/catalogue/page-1.html", max_pages=3)
print(f"Total books scraped: {len(all_books)}")
Step 6: Save to CSV
Export the scraped data to a CSV file:
import pandas as pd
# Convert to DataFrame
df = pd.DataFrame(all_books)
# Save to CSV
df.to_csv("books_scraped.csv", index=False)
print(f"Saved {len(df)} books to books_scraped.csv")
# Display summary statistics
print("\nSummary:")
print(f"Average price: ${df['price'].mean():.2f}")
print(f"Average rating: {df['rating'].mean():.1f} stars")
print(f"Books in stock: {df['in_stock'].sum()} / {len(df)}")
Step 7: Integrate with Orange (Optional)
Load the scraped data into Orange for visualization:
from Orange.data import Table, Domain, StringVariable, ContinuousVariable, DiscreteVariable
# Define domain
domain = Domain(
[ContinuousVariable("price"), ContinuousVariable("rating")],
[DiscreteVariable("in_stock", values=["False", "True"])],
[StringVariable("title")]
)
# Create Orange table
data = [[b["price"], b["rating"], str(b["in_stock"]), b["title"]] for b in all_books]
out_data = Table.from_list(domain, data)
Success! You can now use this data in Orange widgets like Scatter Plot, Distributions, etc.
Expected Output
After completing this practical work, you should have:
- A working Python script that scrapes book data
- A CSV file with 60+ books (from 3 pages)
- Understanding of CSS selectors for data extraction
- Experience handling pagination in web scraping
Deliverables
- Python Script: Your complete scraping script (.py file)
- CSV File: The scraped book data
- Screenshot: Terminal output showing the scraping process
- Analysis: Answer these questions:
- What is the most expensive book?
- What is the average rating of 5-star books?
- How many books are out of stock?
Bonus Challenges
- Challenge 1: Scrape all 50 pages (1000 books) and analyze price distribution by rating
- Challenge 2: Use XPath with lxml instead of CSS selectors
- Challenge 3: Add error handling for network failures (retries, timeouts)
- Challenge 4: Scrape book detail pages to get the full description