Practical Work 2

Dataset Analysis & Preparation

Learn to analyze, clean, and prepare image datasets for computer vision tasks

Duration 1.5 hours

Difficulty Intermediate

Session 2 - Data Preparation

Objectives

By the end of this practical work, you will be able to:

Load and explore a real-world image dataset
Generate statistical visualizations to understand data distribution
Identify and address data quality issues
Create stratified train/validation/test splits

Prerequisites

A Kaggle account (free registration)
Python 3.8+ installed
Jupyter Notebook or JupyterLab

Install required packages:

pip install pandas seaborn scikit-learn Pillow imagehash matplotlib

Instructions

Step 1: Download the Dogs vs Cats Dataset

Go to the Dogs vs Cats competition page on Kaggle and download the dataset. Extract the files to a local directory.

import os
from pathlib import Path

# Define paths  # (#1:Set up dataset directory)
DATASET_DIR = Path("./dogs-vs-cats")
TRAIN_DIR = DATASET_DIR / "train"

# Verify dataset exists
print(f"Dataset directory exists: {TRAIN_DIR.exists()}")
print(f"Number of images: {len(list(TRAIN_DIR.glob('*.jpg')))}")

Note: The Dogs vs Cats dataset contains 25,000 labeled images (12,500 dogs and 12,500 cats). You may need to accept the competition rules on Kaggle before downloading.

Dataset Sample Images

CAT

cat.0.jpg

DOG

dog.0.jpg

CAT

cat.1.jpg

DOG

dog.1.jpg

Images are named with class prefix (cat.N.jpg or dog.N.jpg)

Step 2: Load and Explore Dataset Structure

Create a DataFrame to organize the dataset metadata:

import pandas as pd
from PIL import Image

def load_dataset_info(train_dir):  # (#1:Function to extract image metadata)
    data = []
    for img_path in train_dir.glob("*.jpg"):
        filename = img_path.name
        label = filename.split(".")[0]  # (#2:Extract label from filename)

        try:
            with Image.open(img_path) as img:
                width, height = img.size
                mode = img.mode
                data.append({
                    "filename": filename,
                    "path": str(img_path),
                    "label": label,
                    "width": width,
                    "height": height,
                    "aspect_ratio": width / height,  # (#3:Calculate aspect ratio)
                    "mode": mode
                })
        except Exception as e:
            print(f"Error loading {filename}: {e}")

    return pd.DataFrame(data)

df = load_dataset_info(TRAIN_DIR)
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")

Step 3: Analyze Class Distribution

Create a bar chart to visualize the class distribution:

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
plt.figure(figsize=(8, 6))

# Create bar chart  # (#1:Visualize class balance)
class_counts = df["label"].value_counts()
ax = sns.barplot(x=class_counts.index, y=class_counts.values, palette="viridis")

# Add value labels on bars  # (#2:Annotate bars with counts)
for i, v in enumerate(class_counts.values):
    ax.text(i, v + 100, str(v), ha='center', fontweight='bold')

plt.title("Class Distribution in Dogs vs Cats Dataset", fontsize=14)
plt.xlabel("Class")
plt.ylabel("Number of Images")
plt.tight_layout()
plt.savefig("class_distribution.png", dpi=150)
plt.show()

# Print balance ratio  # (#3:Check for class imbalance)
print(f"\nClass balance ratio: {class_counts.min() / class_counts.max():.2%}")

Expected Output: Class Distribution

12,500

cat

12,500

dog

Class

Perfectly Balanced - Class ratio: 100%

Equal number of samples per class. Your counts may differ slightly if duplicates are removed.

Step 4: Plot Dimension Histograms

Analyze the distribution of image dimensions:

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Width distribution  # (#1:Histogram for image widths)
axes[0].hist(df["width"], bins=50, color="steelblue", edgecolor="black", alpha=0.7)
axes[0].set_title("Width Distribution")
axes[0].set_xlabel("Width (pixels)")
axes[0].set_ylabel("Frequency")
axes[0].axvline(df["width"].median(), color="red", linestyle="--", label=f'Median: {df["width"].median():.0f}')
axes[0].legend()

# Height distribution  # (#2:Histogram for image heights)
axes[1].hist(df["height"], bins=50, color="seagreen", edgecolor="black", alpha=0.7)
axes[1].set_title("Height Distribution")
axes[1].set_xlabel("Height (pixels)")
axes[1].set_ylabel("Frequency")
axes[1].axvline(df["height"].median(), color="red", linestyle="--", label=f'Median: {df["height"].median():.0f}')
axes[1].legend()

# Aspect ratio distribution  # (#3:Histogram for aspect ratios)
axes[2].hist(df["aspect_ratio"], bins=50, color="coral", edgecolor="black", alpha=0.7)
axes[2].set_title("Aspect Ratio Distribution")
axes[2].set_xlabel("Aspect Ratio (W/H)")
axes[2].set_ylabel("Frequency")
axes[2].axvline(1.0, color="red", linestyle="--", label="Square (1:1)")
axes[2].legend()

plt.tight_layout()
plt.savefig("dimension_histograms.png", dpi=150)
plt.show()

# Print statistics  # (#4:Summary statistics)
print("\nDimension Statistics:")
print(df[["width", "height", "aspect_ratio"]].describe())

Expected Output: Dimension Distributions

Width Distribution

50px~350px500px

Height Distribution

50px~280px450px

Aspect Ratio (W/H)

0.51.02.0

Red markers indicate median/reference values. Your distributions may vary.

Step 5: Detect and Remove Corrupted Images

Use try/except to identify images that cannot be loaded properly:

from PIL import Image
import shutil

def check_image_integrity(image_path):  # (#1:Function to validate image files)
    """Check if an image can be fully loaded and decoded."""
    try:
        with Image.open(image_path) as img:
            img.verify()  # (#2:Verify image integrity)

        # Re-open to actually load the data  # (#3:Load pixels to check for truncation)
        with Image.open(image_path) as img:
            img.load()
        return True
    except Exception as e:
        return False

# Check all images
corrupted_images = []
for idx, row in df.iterrows():
    if not check_image_integrity(row["path"]):
        corrupted_images.append(row["filename"])
        print(f"Corrupted: {row['filename']}")

print(f"\nTotal corrupted images: {len(corrupted_images)}")

# Remove corrupted images from DataFrame  # (#4:Filter out bad images)
df_clean = df[~df["filename"].isin(corrupted_images)].copy()
print(f"Clean dataset size: {len(df_clean)}")

Warning: Always keep a backup of your original data before removing any files. In production, move corrupted files to a quarantine folder rather than deleting them.

Step 6: Calculate Color Statistics per Class

Analyze color channel distributions for each class:

import numpy as np

def calculate_color_stats(image_path):  # (#1:Extract RGB statistics from image)
    """Calculate mean and std of color channels."""
    with Image.open(image_path) as img:
        if img.mode != "RGB":
            img = img.convert("RGB")
        pixels = np.array(img)

        return {
            "r_mean": pixels[:,:,0].mean(),
            "g_mean": pixels[:,:,1].mean(),
            "b_mean": pixels[:,:,2].mean(),
            "r_std": pixels[:,:,0].std(),
            "g_std": pixels[:,:,1].std(),
            "b_std": pixels[:,:,2].std()
        }

# Sample images for faster processing  # (#2:Use sampling for large datasets)
sample_df = df_clean.groupby("label").apply(lambda x: x.sample(min(500, len(x)))).reset_index(drop=True)

# Calculate stats
color_stats = []
for idx, row in sample_df.iterrows():
    stats = calculate_color_stats(row["path"])
    stats["label"] = row["label"]
    color_stats.append(stats)

color_df = pd.DataFrame(color_stats)

# Visualize color statistics per class  # (#3:Compare classes by color)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Mean values
color_means = color_df.groupby("label")[["r_mean", "g_mean", "b_mean"]].mean()
color_means.plot(kind="bar", ax=axes[0], color=["red", "green", "blue"], alpha=0.7)
axes[0].set_title("Mean Color Values by Class")
axes[0].set_ylabel("Mean Pixel Value")
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=0)
axes[0].legend(["Red", "Green", "Blue"])

# Standard deviation
color_stds = color_df.groupby("label")[["r_std", "g_std", "b_std"]].mean()
color_stds.plot(kind="bar", ax=axes[1], color=["red", "green", "blue"], alpha=0.7)
axes[1].set_title("Color Standard Deviation by Class")
axes[1].set_ylabel("Std Pixel Value")
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=0)
axes[1].legend(["Red", "Green", "Blue"])

plt.tight_layout()
plt.savefig("color_statistics.png", dpi=150)
plt.show()

Expected Output: Color Statistics by Class

Cat - Mean RGB

118

112

105

Dog - Mean RGB

122

110

101

Similar RGB distributions between classes indicate color alone is not a strong distinguishing feature. Your values may differ.

Step 7: Detect Potential Duplicates

Use image hashing to find near-duplicate images:

import imagehash

def compute_image_hash(image_path, hash_size=8):  # (#1:Compute perceptual hash)
    """Compute average hash of an image."""
    with Image.open(image_path) as img:
        return str(imagehash.average_hash(img, hash_size=hash_size))

# Compute hashes for all images  # (#2:Hash all images in dataset)
print("Computing image hashes...")
df_clean["hash"] = df_clean["path"].apply(compute_image_hash)

# Find duplicates  # (#3:Group by hash to find duplicates)
hash_counts = df_clean["hash"].value_counts()
duplicate_hashes = hash_counts[hash_counts > 1]

print(f"\nFound {len(duplicate_hashes)} hash values with duplicates")
print(f"Total duplicate images: {duplicate_hashes.sum() - len(duplicate_hashes)}")

# Show some examples  # (#4:Display duplicate pairs)
if len(duplicate_hashes) > 0:
    example_hash = duplicate_hashes.index[0]
    duplicates = df_clean[df_clean["hash"] == example_hash]
    print(f"\nExample duplicates (hash={example_hash}):")
    print(duplicates[["filename", "label", "width", "height"]])

    # Visualize duplicates
    fig, axes = plt.subplots(1, min(len(duplicates), 4), figsize=(12, 3))
    for i, (idx, row) in enumerate(duplicates.head(4).iterrows()):
        img = Image.open(row["path"])
        axes[i].imshow(img)
        axes[i].set_title(f"{row['filename']}\n{row['width']}x{row['height']}")
        axes[i].axis("off")
    plt.suptitle("Potential Duplicates")
    plt.tight_layout()
    plt.show()

# Keep only first occurrence of duplicates  # (#5:Remove duplicate images)
df_dedup = df_clean.drop_duplicates(subset=["hash"], keep="first")
print(f"\nDataset after deduplication: {len(df_dedup)} images")

Step 8: Create Stratified Train/Val/Test Splits

Split the dataset while maintaining class proportions:

from sklearn.model_selection import train_test_split

# First split: 70% train, 30% temp  # (#1:First stratified split)
train_df, temp_df = train_test_split(
    df_dedup,
    test_size=0.30,
    stratify=df_dedup["label"],  # (#2:Maintain class balance)
    random_state=42
)

# Second split: 50% of temp = 15% val, 15% test  # (#3:Split remaining into val/test)
val_df, test_df = train_test_split(
    temp_df,
    test_size=0.50,
    stratify=temp_df["label"],
    random_state=42
)

# Verify splits  # (#4:Print split statistics)
print("Split Statistics:")
print(f"Train: {len(train_df)} images ({len(train_df)/len(df_dedup)*100:.1f}%)")
print(f"  - Dogs: {len(train_df[train_df['label']=='dog'])}")
print(f"  - Cats: {len(train_df[train_df['label']=='cat'])}")
print(f"\nValidation: {len(val_df)} images ({len(val_df)/len(df_dedup)*100:.1f}%)")
print(f"  - Dogs: {len(val_df[val_df['label']=='dog'])}")
print(f"  - Cats: {len(val_df[val_df['label']=='cat'])}")
print(f"\nTest: {len(test_df)} images ({len(test_df)/len(df_dedup)*100:.1f}%)")
print(f"  - Dogs: {len(test_df[test_df['label']=='dog'])}")
print(f"  - Cats: {len(test_df[test_df['label']=='cat'])}")

Expected Output: Stratified Data Split (70/15/15)

Train 70%

Val 15%

Test 15%

Training: ~17,500 images

Validation: ~3,750 images

Test: ~3,750 images

Each split maintains 50/50 cat/dog ratio (stratified sampling). Your counts may vary based on deduplication.

Step 9: Organize Files into Split Directories

Create a proper directory structure for your splits:

import shutil

def organize_dataset(df, split_name, base_dir):  # (#1:Function to organize files)
    """Copy files into organized directory structure."""
    split_dir = base_dir / split_name

    for label in df["label"].unique():
        label_dir = split_dir / label
        label_dir.mkdir(parents=True, exist_ok=True)  # (#2:Create class subdirectories)

    for idx, row in df.iterrows():
        src = Path(row["path"])
        dst = split_dir / row["label"] / row["filename"]
        shutil.copy2(src, dst)  # (#3:Copy files to new location)

    print(f"Organized {len(df)} files into {split_dir}")

# Create organized dataset  # (#4:Organize all splits)
OUTPUT_DIR = Path("./dogs-vs-cats-prepared")
OUTPUT_DIR.mkdir(exist_ok=True)

organize_dataset(train_df, "train", OUTPUT_DIR)
organize_dataset(val_df, "val", OUTPUT_DIR)
organize_dataset(test_df, "test", OUTPUT_DIR)

# Verify final structure  # (#5:Display final directory tree)
print("\nFinal directory structure:")
for split in ["train", "val", "test"]:
    split_dir = OUTPUT_DIR / split
    for label_dir in split_dir.iterdir():
        count = len(list(label_dir.glob("*.jpg")))
        print(f"  {split}/{label_dir.name}: {count} images")

Success: Your dataset is now properly organized and ready for training!

Expected Output

After completing this practical work, you should have:

A DataFrame with metadata for all images (dimensions, aspect ratios, labels)
A bar chart showing balanced class distribution (approximately 50/50)
Histogram plots revealing the variety of image dimensions in the dataset
A list of any corrupted images that were removed
Color statistics comparison between dog and cat images
Identification of potential duplicate images
An organized directory structure:

dogs-vs-cats-prepared/
- train/
  - cat/ (~8,750 images)
  - dog/ (~8,750 images)
- val/
  - cat/ (~1,875 images)
  - dog/ (~1,875 images)
- test/
  - cat/ (~1,875 images)
  - dog/ (~1,875 images)

Deliverables

Jupyter Notebook: Complete notebook with all code cells executed and outputs visible
Cleaned Dataset: Organized directory structure with train/val/test splits
Distribution Plots: Saved visualization files:
- class_distribution.png
- dimension_histograms.png
- color_statistics.png

Bonus Challenges

Perceptual Hashing: Implement multiple hashing algorithms (pHash, dHash, wHash) and compare their effectiveness at finding near-duplicates:

# Compare different hash algorithms
phash = imagehash.phash(img)      # Perceptual hash
dhash = imagehash.dhash(img)      # Difference hash
whash = imagehash.whash(img)      # Wavelet hash

# Calculate hash distance
distance = hash1 - hash2  # Hamming distance

Blur Detection: Add a quality metric to detect blurry images using the Laplacian variance method:

import cv2

def calculate_blur_score(image_path):
    """Higher score = sharper image."""
    img = cv2.imread(str(image_path))
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    return cv2.Laplacian(gray, cv2.CV_64F).var()

# Flag images below threshold as blurry
BLUR_THRESHOLD = 100
df["blur_score"] = df["path"].apply(calculate_blur_score)
df["is_blurry"] = df["blur_score"] < BLUR_THRESHOLD

Outlier Detection: Identify images with unusual dimensions or aspect ratios using z-scores
Data Augmentation Preview: Create a visualization showing the effect of common augmentations (rotation, flip, color jitter)