← Back to Presentations
Practical Work 2

Dataset Analysis & Preparation

Learn to analyze, clean, and prepare image datasets for computer vision tasks

Duration 1.5 hours
Difficulty Intermediate
Session 2 - Data Preparation

Objectives

By the end of this practical work, you will be able to:

  • Load and explore a real-world image dataset
  • Generate statistical visualizations to understand data distribution
  • Identify and address data quality issues
  • Create stratified train/validation/test splits

Prerequisites

  • A Kaggle account (free registration)
  • Python 3.8+ installed
  • Jupyter Notebook or JupyterLab

Install required packages:

pip install pandas seaborn scikit-learn Pillow imagehash matplotlib

Instructions

Step 1: Download the Dogs vs Cats Dataset

Go to the Dogs vs Cats competition page on Kaggle and download the dataset. Extract the files to a local directory.

import os
from pathlib import Path

# Define paths  # (#1:Set up dataset directory)
DATASET_DIR = Path("./dogs-vs-cats")
TRAIN_DIR = DATASET_DIR / "train"

# Verify dataset exists
print(f"Dataset directory exists: {TRAIN_DIR.exists()}")
print(f"Number of images: {len(list(TRAIN_DIR.glob('*.jpg')))}")

Note: The Dogs vs Cats dataset contains 25,000 labeled images (12,500 dogs and 12,500 cats). You may need to accept the competition rules on Kaggle before downloading.

Dataset Sample Images
CAT
cat.0.jpg
DOG
dog.0.jpg
CAT
cat.1.jpg
DOG
dog.1.jpg
Images are named with class prefix (cat.N.jpg or dog.N.jpg)

Step 2: Load and Explore Dataset Structure

Create a DataFrame to organize the dataset metadata:

import pandas as pd
from PIL import Image

def load_dataset_info(train_dir):  # (#1:Function to extract image metadata)
    data = []
    for img_path in train_dir.glob("*.jpg"):
        filename = img_path.name
        label = filename.split(".")[0]  # (#2:Extract label from filename)

        try:
            with Image.open(img_path) as img:
                width, height = img.size
                mode = img.mode
                data.append({
                    "filename": filename,
                    "path": str(img_path),
                    "label": label,
                    "width": width,
                    "height": height,
                    "aspect_ratio": width / height,  # (#3:Calculate aspect ratio)
                    "mode": mode
                })
        except Exception as e:
            print(f"Error loading {filename}: {e}")

    return pd.DataFrame(data)

df = load_dataset_info(TRAIN_DIR)
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")

Step 3: Analyze Class Distribution

Create a bar chart to visualize the class distribution:

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
plt.figure(figsize=(8, 6))

# Create bar chart  # (#1:Visualize class balance)
class_counts = df["label"].value_counts()
ax = sns.barplot(x=class_counts.index, y=class_counts.values, palette="viridis")

# Add value labels on bars  # (#2:Annotate bars with counts)
for i, v in enumerate(class_counts.values):
    ax.text(i, v + 100, str(v), ha='center', fontweight='bold')

plt.title("Class Distribution in Dogs vs Cats Dataset", fontsize=14)
plt.xlabel("Class")
plt.ylabel("Number of Images")
plt.tight_layout()
plt.savefig("class_distribution.png", dpi=150)
plt.show()

# Print balance ratio  # (#3:Check for class imbalance)
print(f"\nClass balance ratio: {class_counts.min() / class_counts.max():.2%}")
Expected Output: Class Distribution
12,500
cat
12,500
dog
Class
Perfectly Balanced - Class ratio: 100%
Equal number of samples per class. Your counts may differ slightly if duplicates are removed.

Step 4: Plot Dimension Histograms

Analyze the distribution of image dimensions:

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Width distribution  # (#1:Histogram for image widths)
axes[0].hist(df["width"], bins=50, color="steelblue", edgecolor="black", alpha=0.7)
axes[0].set_title("Width Distribution")
axes[0].set_xlabel("Width (pixels)")
axes[0].set_ylabel("Frequency")
axes[0].axvline(df["width"].median(), color="red", linestyle="--", label=f'Median: {df["width"].median():.0f}')
axes[0].legend()

# Height distribution  # (#2:Histogram for image heights)
axes[1].hist(df["height"], bins=50, color="seagreen", edgecolor="black", alpha=0.7)
axes[1].set_title("Height Distribution")
axes[1].set_xlabel("Height (pixels)")
axes[1].set_ylabel("Frequency")
axes[1].axvline(df["height"].median(), color="red", linestyle="--", label=f'Median: {df["height"].median():.0f}')
axes[1].legend()

# Aspect ratio distribution  # (#3:Histogram for aspect ratios)
axes[2].hist(df["aspect_ratio"], bins=50, color="coral", edgecolor="black", alpha=0.7)
axes[2].set_title("Aspect Ratio Distribution")
axes[2].set_xlabel("Aspect Ratio (W/H)")
axes[2].set_ylabel("Frequency")
axes[2].axvline(1.0, color="red", linestyle="--", label="Square (1:1)")
axes[2].legend()

plt.tight_layout()
plt.savefig("dimension_histograms.png", dpi=150)
plt.show()

# Print statistics  # (#4:Summary statistics)
print("\nDimension Statistics:")
print(df[["width", "height", "aspect_ratio"]].describe())
Expected Output: Dimension Distributions
Width Distribution
50px~350px500px
Height Distribution
50px~280px450px
Aspect Ratio (W/H)
0.51.02.0
Red markers indicate median/reference values. Your distributions may vary.

Step 5: Detect and Remove Corrupted Images

Use try/except to identify images that cannot be loaded properly:

from PIL import Image
import shutil

def check_image_integrity(image_path):  # (#1:Function to validate image files)
    """Check if an image can be fully loaded and decoded."""
    try:
        with Image.open(image_path) as img:
            img.verify()  # (#2:Verify image integrity)

        # Re-open to actually load the data  # (#3:Load pixels to check for truncation)
        with Image.open(image_path) as img:
            img.load()
        return True
    except Exception as e:
        return False

# Check all images
corrupted_images = []
for idx, row in df.iterrows():
    if not check_image_integrity(row["path"]):
        corrupted_images.append(row["filename"])
        print(f"Corrupted: {row['filename']}")

print(f"\nTotal corrupted images: {len(corrupted_images)}")

# Remove corrupted images from DataFrame  # (#4:Filter out bad images)
df_clean = df[~df["filename"].isin(corrupted_images)].copy()
print(f"Clean dataset size: {len(df_clean)}")

Warning: Always keep a backup of your original data before removing any files. In production, move corrupted files to a quarantine folder rather than deleting them.

Step 6: Calculate Color Statistics per Class

Analyze color channel distributions for each class:

import numpy as np

def calculate_color_stats(image_path):  # (#1:Extract RGB statistics from image)
    """Calculate mean and std of color channels."""
    with Image.open(image_path) as img:
        if img.mode != "RGB":
            img = img.convert("RGB")
        pixels = np.array(img)

        return {
            "r_mean": pixels[:,:,0].mean(),
            "g_mean": pixels[:,:,1].mean(),
            "b_mean": pixels[:,:,2].mean(),
            "r_std": pixels[:,:,0].std(),
            "g_std": pixels[:,:,1].std(),
            "b_std": pixels[:,:,2].std()
        }

# Sample images for faster processing  # (#2:Use sampling for large datasets)
sample_df = df_clean.groupby("label").apply(lambda x: x.sample(min(500, len(x)))).reset_index(drop=True)

# Calculate stats
color_stats = []
for idx, row in sample_df.iterrows():
    stats = calculate_color_stats(row["path"])
    stats["label"] = row["label"]
    color_stats.append(stats)

color_df = pd.DataFrame(color_stats)

# Visualize color statistics per class  # (#3:Compare classes by color)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Mean values
color_means = color_df.groupby("label")[["r_mean", "g_mean", "b_mean"]].mean()
color_means.plot(kind="bar", ax=axes[0], color=["red", "green", "blue"], alpha=0.7)
axes[0].set_title("Mean Color Values by Class")
axes[0].set_ylabel("Mean Pixel Value")
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=0)
axes[0].legend(["Red", "Green", "Blue"])

# Standard deviation
color_stds = color_df.groupby("label")[["r_std", "g_std", "b_std"]].mean()
color_stds.plot(kind="bar", ax=axes[1], color=["red", "green", "blue"], alpha=0.7)
axes[1].set_title("Color Standard Deviation by Class")
axes[1].set_ylabel("Std Pixel Value")
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=0)
axes[1].legend(["Red", "Green", "Blue"])

plt.tight_layout()
plt.savefig("color_statistics.png", dpi=150)
plt.show()
Expected Output: Color Statistics by Class
Cat - Mean RGB
118
112
105
Dog - Mean RGB
122
110
101
Similar RGB distributions between classes indicate color alone is not a strong distinguishing feature. Your values may differ.

Step 7: Detect Potential Duplicates

Use image hashing to find near-duplicate images:

import imagehash

def compute_image_hash(image_path, hash_size=8):  # (#1:Compute perceptual hash)
    """Compute average hash of an image."""
    with Image.open(image_path) as img:
        return str(imagehash.average_hash(img, hash_size=hash_size))

# Compute hashes for all images  # (#2:Hash all images in dataset)
print("Computing image hashes...")
df_clean["hash"] = df_clean["path"].apply(compute_image_hash)

# Find duplicates  # (#3:Group by hash to find duplicates)
hash_counts = df_clean["hash"].value_counts()
duplicate_hashes = hash_counts[hash_counts > 1]

print(f"\nFound {len(duplicate_hashes)} hash values with duplicates")
print(f"Total duplicate images: {duplicate_hashes.sum() - len(duplicate_hashes)}")

# Show some examples  # (#4:Display duplicate pairs)
if len(duplicate_hashes) > 0:
    example_hash = duplicate_hashes.index[0]
    duplicates = df_clean[df_clean["hash"] == example_hash]
    print(f"\nExample duplicates (hash={example_hash}):")
    print(duplicates[["filename", "label", "width", "height"]])

    # Visualize duplicates
    fig, axes = plt.subplots(1, min(len(duplicates), 4), figsize=(12, 3))
    for i, (idx, row) in enumerate(duplicates.head(4).iterrows()):
        img = Image.open(row["path"])
        axes[i].imshow(img)
        axes[i].set_title(f"{row['filename']}\n{row['width']}x{row['height']}")
        axes[i].axis("off")
    plt.suptitle("Potential Duplicates")
    plt.tight_layout()
    plt.show()

# Keep only first occurrence of duplicates  # (#5:Remove duplicate images)
df_dedup = df_clean.drop_duplicates(subset=["hash"], keep="first")
print(f"\nDataset after deduplication: {len(df_dedup)} images")

Step 8: Create Stratified Train/Val/Test Splits

Split the dataset while maintaining class proportions:

from sklearn.model_selection import train_test_split

# First split: 70% train, 30% temp  # (#1:First stratified split)
train_df, temp_df = train_test_split(
    df_dedup,
    test_size=0.30,
    stratify=df_dedup["label"],  # (#2:Maintain class balance)
    random_state=42
)

# Second split: 50% of temp = 15% val, 15% test  # (#3:Split remaining into val/test)
val_df, test_df = train_test_split(
    temp_df,
    test_size=0.50,
    stratify=temp_df["label"],
    random_state=42
)

# Verify splits  # (#4:Print split statistics)
print("Split Statistics:")
print(f"Train: {len(train_df)} images ({len(train_df)/len(df_dedup)*100:.1f}%)")
print(f"  - Dogs: {len(train_df[train_df['label']=='dog'])}")
print(f"  - Cats: {len(train_df[train_df['label']=='cat'])}")
print(f"\nValidation: {len(val_df)} images ({len(val_df)/len(df_dedup)*100:.1f}%)")
print(f"  - Dogs: {len(val_df[val_df['label']=='dog'])}")
print(f"  - Cats: {len(val_df[val_df['label']=='cat'])}")
print(f"\nTest: {len(test_df)} images ({len(test_df)/len(df_dedup)*100:.1f}%)")
print(f"  - Dogs: {len(test_df[test_df['label']=='dog'])}")
print(f"  - Cats: {len(test_df[test_df['label']=='cat'])}")
Expected Output: Stratified Data Split (70/15/15)
Train 70%
Val 15%
Test 15%
Training: ~17,500 images
Validation: ~3,750 images
Test: ~3,750 images
Each split maintains 50/50 cat/dog ratio (stratified sampling). Your counts may vary based on deduplication.

Step 9: Organize Files into Split Directories

Create a proper directory structure for your splits:

import shutil

def organize_dataset(df, split_name, base_dir):  # (#1:Function to organize files)
    """Copy files into organized directory structure."""
    split_dir = base_dir / split_name

    for label in df["label"].unique():
        label_dir = split_dir / label
        label_dir.mkdir(parents=True, exist_ok=True)  # (#2:Create class subdirectories)

    for idx, row in df.iterrows():
        src = Path(row["path"])
        dst = split_dir / row["label"] / row["filename"]
        shutil.copy2(src, dst)  # (#3:Copy files to new location)

    print(f"Organized {len(df)} files into {split_dir}")

# Create organized dataset  # (#4:Organize all splits)
OUTPUT_DIR = Path("./dogs-vs-cats-prepared")
OUTPUT_DIR.mkdir(exist_ok=True)

organize_dataset(train_df, "train", OUTPUT_DIR)
organize_dataset(val_df, "val", OUTPUT_DIR)
organize_dataset(test_df, "test", OUTPUT_DIR)

# Verify final structure  # (#5:Display final directory tree)
print("\nFinal directory structure:")
for split in ["train", "val", "test"]:
    split_dir = OUTPUT_DIR / split
    for label_dir in split_dir.iterdir():
        count = len(list(label_dir.glob("*.jpg")))
        print(f"  {split}/{label_dir.name}: {count} images")

Success: Your dataset is now properly organized and ready for training!

Expected Output

After completing this practical work, you should have:

  • A DataFrame with metadata for all images (dimensions, aspect ratios, labels)
  • A bar chart showing balanced class distribution (approximately 50/50)
  • Histogram plots revealing the variety of image dimensions in the dataset
  • A list of any corrupted images that were removed
  • Color statistics comparison between dog and cat images
  • Identification of potential duplicate images
  • An organized directory structure:
  • dogs-vs-cats-prepared/
    • train/
      • cat/ (~8,750 images)
      • dog/ (~8,750 images)
    • val/
      • cat/ (~1,875 images)
      • dog/ (~1,875 images)
    • test/
      • cat/ (~1,875 images)
      • dog/ (~1,875 images)

Deliverables

  • Jupyter Notebook: Complete notebook with all code cells executed and outputs visible
  • Cleaned Dataset: Organized directory structure with train/val/test splits
  • Distribution Plots: Saved visualization files:
    • class_distribution.png
    • dimension_histograms.png
    • color_statistics.png

Bonus Challenges

  • Perceptual Hashing: Implement multiple hashing algorithms (pHash, dHash, wHash) and compare their effectiveness at finding near-duplicates:
    # Compare different hash algorithms
    phash = imagehash.phash(img)      # Perceptual hash
    dhash = imagehash.dhash(img)      # Difference hash
    whash = imagehash.whash(img)      # Wavelet hash
    
    # Calculate hash distance
    distance = hash1 - hash2  # Hamming distance
  • Blur Detection: Add a quality metric to detect blurry images using the Laplacian variance method:
    import cv2
    
    def calculate_blur_score(image_path):
        """Higher score = sharper image."""
        img = cv2.imread(str(image_path))
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        return cv2.Laplacian(gray, cv2.CV_64F).var()
    
    # Flag images below threshold as blurry
    BLUR_THRESHOLD = 100
    df["blur_score"] = df["path"].apply(calculate_blur_score)
    df["is_blurry"] = df["blur_score"] < BLUR_THRESHOLD
  • Outlier Detection: Identify images with unusual dimensions or aspect ratios using z-scores
  • Data Augmentation Preview: Create a visualization showing the effect of common augmentations (rotation, flip, color jitter)

Resources