← Back to Data Exploration
Practical Work 2

Web Scraping with Python

Learn to extract data from websites using BeautifulSoup and lxml

Duration 2-3 hours
Difficulty Beginner
Session Web Scraping

Objectives

By the end of this practical work, you will be able to:

  • Fetch web pages using the requests library
  • Parse HTML content using BeautifulSoup
  • Extract data using CSS selectors and XPath
  • Handle pagination to scrape multiple pages
  • Store scraped data in a structured format (CSV, Orange Table)

Prerequisites

  • Python 3.8+ installed
  • Basic understanding of HTML structure
  • A code editor (VS Code, PyCharm, or Jupyter Notebook)

Install required packages:

pip install requests beautifulsoup4 lxml pandas

Instructions

Step 1: Fetch a Web Page

Start by fetching the books.toscrape.com homepage:

import requests

url = "http://books.toscrape.com"
response = requests.get(url)

# Check if request was successful
if response.status_code == 200:
    print("Success! Page fetched.")
    print(f"Content length: {len(response.content)} bytes")
else:
    print(f"Error: {response.status_code}")

Tip: Always check the status code before processing the response!

Step 2: Parse HTML with BeautifulSoup

Create a BeautifulSoup object to parse the HTML:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, "html.parser")

# Print the page title
print(f"Page title: {soup.title.string}")

# Find all h3 elements (book titles are in h3 tags)
h3_tags = soup.find_all("h3")
print(f"Found {len(h3_tags)} h3 elements")

Step 3: Extract Book Data

Extract title and price for each book:

# Find all book containers
books = soup.select("article.product_pod")
print(f"Found {len(books)} books on this page")

# Extract data from each book
for book in books[:5]:  # First 5 books
    # Title is in the 'title' attribute of the anchor tag
    title = book.select_one("h3 a")["title"]

    # Price is in a paragraph with class 'price_color'
    price = book.select_one(".price_color").text

    # Rating is encoded in the class name
    rating_class = book.select_one(".star-rating")["class"][1]

    print(f"{title} - {price} - {rating_class} stars")

Step 4: Create a Data Collection Function

Organize your code into a reusable function:

def scrape_books(url):
    """Scrape all books from a single page."""
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    books_data = []
    for book in soup.select("article.product_pod"):
        # Convert rating word to number
        rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
        rating_class = book.select_one(".star-rating")["class"][1]

        books_data.append({
            "title": book.select_one("h3 a")["title"],
            "price": float(book.select_one(".price_color").text[1:]),  # Remove currency
            "rating": rating_map.get(rating_class, 0),
            "in_stock": "In stock" in book.select_one(".availability").text
        })

    return books_data

# Test the function
books = scrape_books("http://books.toscrape.com")
print(f"Scraped {len(books)} books")
for book in books[:3]:
    print(book)

Step 5: Handle Pagination

Scrape multiple pages by following "next" links:

def scrape_all_books(base_url, max_pages=5):
    """Scrape books from multiple pages."""
    all_books = []
    current_url = base_url

    for page in range(max_pages):
        print(f"Scraping page {page + 1}...")
        response = requests.get(current_url)
        soup = BeautifulSoup(response.content, "html.parser")

        # Scrape books on current page
        all_books.extend(scrape_books(current_url))

        # Find next page link
        next_link = soup.select_one("li.next a")
        if next_link:
            next_href = next_link["href"]
            # Handle relative URLs
            if not next_href.startswith("http"):
                current_url = base_url.rsplit("/", 1)[0] + "/" + next_href
            else:
                current_url = next_href
        else:
            print("No more pages.")
            break

    return all_books

# Scrape first 3 pages
all_books = scrape_all_books("http://books.toscrape.com/catalogue/page-1.html", max_pages=3)
print(f"Total books scraped: {len(all_books)}")

Step 6: Save to CSV

Export the scraped data to a CSV file:

import pandas as pd

# Convert to DataFrame
df = pd.DataFrame(all_books)

# Save to CSV
df.to_csv("books_scraped.csv", index=False)
print(f"Saved {len(df)} books to books_scraped.csv")

# Display summary statistics
print("\nSummary:")
print(f"Average price: ${df['price'].mean():.2f}")
print(f"Average rating: {df['rating'].mean():.1f} stars")
print(f"Books in stock: {df['in_stock'].sum()} / {len(df)}")

Step 7: Integrate with Orange (Optional)

Load the scraped data into Orange for visualization:

from Orange.data import Table, Domain, StringVariable, ContinuousVariable, DiscreteVariable

# Define domain
domain = Domain(
    [ContinuousVariable("price"), ContinuousVariable("rating")],
    [DiscreteVariable("in_stock", values=["False", "True"])],
    [StringVariable("title")]
)

# Create Orange table
data = [[b["price"], b["rating"], str(b["in_stock"]), b["title"]] for b in all_books]
out_data = Table.from_list(domain, data)

Success! You can now use this data in Orange widgets like Scatter Plot, Distributions, etc.

Expected Output

After completing this practical work, you should have:

  • A working Python script that scrapes book data
  • A CSV file with 60+ books (from 3 pages)
  • Understanding of CSS selectors for data extraction
  • Experience handling pagination in web scraping

Deliverables

  • Python Script: Your complete scraping script (.py file)
  • CSV File: The scraped book data
  • Screenshot: Terminal output showing the scraping process
  • Analysis: Answer these questions:
    1. What is the most expensive book?
    2. What is the average rating of 5-star books?
    3. How many books are out of stock?

Bonus Challenges

  • Challenge 1: Scrape all 50 pages (1000 books) and analyze price distribution by rating
  • Challenge 2: Use XPath with lxml instead of CSS selectors
  • Challenge 3: Add error handling for network failures (retries, timeouts)
  • Challenge 4: Scrape book detail pages to get the full description

Resources