Dataset Analysis & Preparation
Learn to analyze, clean, and prepare image datasets for computer vision tasks
Objectives
By the end of this practical work, you will be able to:
- Load and explore a real-world image dataset
- Generate statistical visualizations to understand data distribution
- Identify and address data quality issues
- Create stratified train/validation/test splits
Prerequisites
- A Kaggle account (free registration)
- Python 3.8+ installed
- Jupyter Notebook or JupyterLab
Install required packages:
pip install pandas seaborn scikit-learn Pillow imagehash matplotlib
Instructions
Step 1: Download the Dogs vs Cats Dataset
Go to the Dogs vs Cats competition page on Kaggle and download the dataset. Extract the files to a local directory.
import os
from pathlib import Path
# Define paths # (#1:Set up dataset directory)
DATASET_DIR = Path("./dogs-vs-cats")
TRAIN_DIR = DATASET_DIR / "train"
# Verify dataset exists
print(f"Dataset directory exists: {TRAIN_DIR.exists()}")
print(f"Number of images: {len(list(TRAIN_DIR.glob('*.jpg')))}")
Note: The Dogs vs Cats dataset contains 25,000 labeled images (12,500 dogs and 12,500 cats). You may need to accept the competition rules on Kaggle before downloading.
Step 2: Load and Explore Dataset Structure
Create a DataFrame to organize the dataset metadata:
import pandas as pd
from PIL import Image
def load_dataset_info(train_dir): # (#1:Function to extract image metadata)
data = []
for img_path in train_dir.glob("*.jpg"):
filename = img_path.name
label = filename.split(".")[0] # (#2:Extract label from filename)
try:
with Image.open(img_path) as img:
width, height = img.size
mode = img.mode
data.append({
"filename": filename,
"path": str(img_path),
"label": label,
"width": width,
"height": height,
"aspect_ratio": width / height, # (#3:Calculate aspect ratio)
"mode": mode
})
except Exception as e:
print(f"Error loading {filename}: {e}")
return pd.DataFrame(data)
df = load_dataset_info(TRAIN_DIR)
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
Step 3: Analyze Class Distribution
Create a bar chart to visualize the class distribution:
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style("whitegrid")
plt.figure(figsize=(8, 6))
# Create bar chart # (#1:Visualize class balance)
class_counts = df["label"].value_counts()
ax = sns.barplot(x=class_counts.index, y=class_counts.values, palette="viridis")
# Add value labels on bars # (#2:Annotate bars with counts)
for i, v in enumerate(class_counts.values):
ax.text(i, v + 100, str(v), ha='center', fontweight='bold')
plt.title("Class Distribution in Dogs vs Cats Dataset", fontsize=14)
plt.xlabel("Class")
plt.ylabel("Number of Images")
plt.tight_layout()
plt.savefig("class_distribution.png", dpi=150)
plt.show()
# Print balance ratio # (#3:Check for class imbalance)
print(f"\nClass balance ratio: {class_counts.min() / class_counts.max():.2%}")
Step 4: Plot Dimension Histograms
Analyze the distribution of image dimensions:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Width distribution # (#1:Histogram for image widths)
axes[0].hist(df["width"], bins=50, color="steelblue", edgecolor="black", alpha=0.7)
axes[0].set_title("Width Distribution")
axes[0].set_xlabel("Width (pixels)")
axes[0].set_ylabel("Frequency")
axes[0].axvline(df["width"].median(), color="red", linestyle="--", label=f'Median: {df["width"].median():.0f}')
axes[0].legend()
# Height distribution # (#2:Histogram for image heights)
axes[1].hist(df["height"], bins=50, color="seagreen", edgecolor="black", alpha=0.7)
axes[1].set_title("Height Distribution")
axes[1].set_xlabel("Height (pixels)")
axes[1].set_ylabel("Frequency")
axes[1].axvline(df["height"].median(), color="red", linestyle="--", label=f'Median: {df["height"].median():.0f}')
axes[1].legend()
# Aspect ratio distribution # (#3:Histogram for aspect ratios)
axes[2].hist(df["aspect_ratio"], bins=50, color="coral", edgecolor="black", alpha=0.7)
axes[2].set_title("Aspect Ratio Distribution")
axes[2].set_xlabel("Aspect Ratio (W/H)")
axes[2].set_ylabel("Frequency")
axes[2].axvline(1.0, color="red", linestyle="--", label="Square (1:1)")
axes[2].legend()
plt.tight_layout()
plt.savefig("dimension_histograms.png", dpi=150)
plt.show()
# Print statistics # (#4:Summary statistics)
print("\nDimension Statistics:")
print(df[["width", "height", "aspect_ratio"]].describe())
Step 5: Detect and Remove Corrupted Images
Use try/except to identify images that cannot be loaded properly:
from PIL import Image
import shutil
def check_image_integrity(image_path): # (#1:Function to validate image files)
"""Check if an image can be fully loaded and decoded."""
try:
with Image.open(image_path) as img:
img.verify() # (#2:Verify image integrity)
# Re-open to actually load the data # (#3:Load pixels to check for truncation)
with Image.open(image_path) as img:
img.load()
return True
except Exception as e:
return False
# Check all images
corrupted_images = []
for idx, row in df.iterrows():
if not check_image_integrity(row["path"]):
corrupted_images.append(row["filename"])
print(f"Corrupted: {row['filename']}")
print(f"\nTotal corrupted images: {len(corrupted_images)}")
# Remove corrupted images from DataFrame # (#4:Filter out bad images)
df_clean = df[~df["filename"].isin(corrupted_images)].copy()
print(f"Clean dataset size: {len(df_clean)}")
Warning: Always keep a backup of your original data before removing any files. In production, move corrupted files to a quarantine folder rather than deleting them.
Step 6: Calculate Color Statistics per Class
Analyze color channel distributions for each class:
import numpy as np
def calculate_color_stats(image_path): # (#1:Extract RGB statistics from image)
"""Calculate mean and std of color channels."""
with Image.open(image_path) as img:
if img.mode != "RGB":
img = img.convert("RGB")
pixels = np.array(img)
return {
"r_mean": pixels[:,:,0].mean(),
"g_mean": pixels[:,:,1].mean(),
"b_mean": pixels[:,:,2].mean(),
"r_std": pixels[:,:,0].std(),
"g_std": pixels[:,:,1].std(),
"b_std": pixels[:,:,2].std()
}
# Sample images for faster processing # (#2:Use sampling for large datasets)
sample_df = df_clean.groupby("label").apply(lambda x: x.sample(min(500, len(x)))).reset_index(drop=True)
# Calculate stats
color_stats = []
for idx, row in sample_df.iterrows():
stats = calculate_color_stats(row["path"])
stats["label"] = row["label"]
color_stats.append(stats)
color_df = pd.DataFrame(color_stats)
# Visualize color statistics per class # (#3:Compare classes by color)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Mean values
color_means = color_df.groupby("label")[["r_mean", "g_mean", "b_mean"]].mean()
color_means.plot(kind="bar", ax=axes[0], color=["red", "green", "blue"], alpha=0.7)
axes[0].set_title("Mean Color Values by Class")
axes[0].set_ylabel("Mean Pixel Value")
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=0)
axes[0].legend(["Red", "Green", "Blue"])
# Standard deviation
color_stds = color_df.groupby("label")[["r_std", "g_std", "b_std"]].mean()
color_stds.plot(kind="bar", ax=axes[1], color=["red", "green", "blue"], alpha=0.7)
axes[1].set_title("Color Standard Deviation by Class")
axes[1].set_ylabel("Std Pixel Value")
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=0)
axes[1].legend(["Red", "Green", "Blue"])
plt.tight_layout()
plt.savefig("color_statistics.png", dpi=150)
plt.show()
Step 7: Detect Potential Duplicates
Use image hashing to find near-duplicate images:
import imagehash
def compute_image_hash(image_path, hash_size=8): # (#1:Compute perceptual hash)
"""Compute average hash of an image."""
with Image.open(image_path) as img:
return str(imagehash.average_hash(img, hash_size=hash_size))
# Compute hashes for all images # (#2:Hash all images in dataset)
print("Computing image hashes...")
df_clean["hash"] = df_clean["path"].apply(compute_image_hash)
# Find duplicates # (#3:Group by hash to find duplicates)
hash_counts = df_clean["hash"].value_counts()
duplicate_hashes = hash_counts[hash_counts > 1]
print(f"\nFound {len(duplicate_hashes)} hash values with duplicates")
print(f"Total duplicate images: {duplicate_hashes.sum() - len(duplicate_hashes)}")
# Show some examples # (#4:Display duplicate pairs)
if len(duplicate_hashes) > 0:
example_hash = duplicate_hashes.index[0]
duplicates = df_clean[df_clean["hash"] == example_hash]
print(f"\nExample duplicates (hash={example_hash}):")
print(duplicates[["filename", "label", "width", "height"]])
# Visualize duplicates
fig, axes = plt.subplots(1, min(len(duplicates), 4), figsize=(12, 3))
for i, (idx, row) in enumerate(duplicates.head(4).iterrows()):
img = Image.open(row["path"])
axes[i].imshow(img)
axes[i].set_title(f"{row['filename']}\n{row['width']}x{row['height']}")
axes[i].axis("off")
plt.suptitle("Potential Duplicates")
plt.tight_layout()
plt.show()
# Keep only first occurrence of duplicates # (#5:Remove duplicate images)
df_dedup = df_clean.drop_duplicates(subset=["hash"], keep="first")
print(f"\nDataset after deduplication: {len(df_dedup)} images")
Step 8: Create Stratified Train/Val/Test Splits
Split the dataset while maintaining class proportions:
from sklearn.model_selection import train_test_split
# First split: 70% train, 30% temp # (#1:First stratified split)
train_df, temp_df = train_test_split(
df_dedup,
test_size=0.30,
stratify=df_dedup["label"], # (#2:Maintain class balance)
random_state=42
)
# Second split: 50% of temp = 15% val, 15% test # (#3:Split remaining into val/test)
val_df, test_df = train_test_split(
temp_df,
test_size=0.50,
stratify=temp_df["label"],
random_state=42
)
# Verify splits # (#4:Print split statistics)
print("Split Statistics:")
print(f"Train: {len(train_df)} images ({len(train_df)/len(df_dedup)*100:.1f}%)")
print(f" - Dogs: {len(train_df[train_df['label']=='dog'])}")
print(f" - Cats: {len(train_df[train_df['label']=='cat'])}")
print(f"\nValidation: {len(val_df)} images ({len(val_df)/len(df_dedup)*100:.1f}%)")
print(f" - Dogs: {len(val_df[val_df['label']=='dog'])}")
print(f" - Cats: {len(val_df[val_df['label']=='cat'])}")
print(f"\nTest: {len(test_df)} images ({len(test_df)/len(df_dedup)*100:.1f}%)")
print(f" - Dogs: {len(test_df[test_df['label']=='dog'])}")
print(f" - Cats: {len(test_df[test_df['label']=='cat'])}")
Step 9: Organize Files into Split Directories
Create a proper directory structure for your splits:
import shutil
def organize_dataset(df, split_name, base_dir): # (#1:Function to organize files)
"""Copy files into organized directory structure."""
split_dir = base_dir / split_name
for label in df["label"].unique():
label_dir = split_dir / label
label_dir.mkdir(parents=True, exist_ok=True) # (#2:Create class subdirectories)
for idx, row in df.iterrows():
src = Path(row["path"])
dst = split_dir / row["label"] / row["filename"]
shutil.copy2(src, dst) # (#3:Copy files to new location)
print(f"Organized {len(df)} files into {split_dir}")
# Create organized dataset # (#4:Organize all splits)
OUTPUT_DIR = Path("./dogs-vs-cats-prepared")
OUTPUT_DIR.mkdir(exist_ok=True)
organize_dataset(train_df, "train", OUTPUT_DIR)
organize_dataset(val_df, "val", OUTPUT_DIR)
organize_dataset(test_df, "test", OUTPUT_DIR)
# Verify final structure # (#5:Display final directory tree)
print("\nFinal directory structure:")
for split in ["train", "val", "test"]:
split_dir = OUTPUT_DIR / split
for label_dir in split_dir.iterdir():
count = len(list(label_dir.glob("*.jpg")))
print(f" {split}/{label_dir.name}: {count} images")
Success: Your dataset is now properly organized and ready for training!
Expected Output
After completing this practical work, you should have:
- A DataFrame with metadata for all images (dimensions, aspect ratios, labels)
- A bar chart showing balanced class distribution (approximately 50/50)
- Histogram plots revealing the variety of image dimensions in the dataset
- A list of any corrupted images that were removed
- Color statistics comparison between dog and cat images
- Identification of potential duplicate images
- An organized directory structure:
- dogs-vs-cats-prepared/
- train/
- cat/ (~8,750 images)
- dog/ (~8,750 images)
- val/
- cat/ (~1,875 images)
- dog/ (~1,875 images)
- test/
- cat/ (~1,875 images)
- dog/ (~1,875 images)
- train/
Deliverables
- Jupyter Notebook: Complete notebook with all code cells executed and outputs visible
- Cleaned Dataset: Organized directory structure with train/val/test splits
- Distribution Plots: Saved visualization files:
class_distribution.pngdimension_histograms.pngcolor_statistics.png
Bonus Challenges
- Perceptual Hashing: Implement multiple hashing algorithms (pHash, dHash, wHash) and compare their effectiveness at finding near-duplicates:
# Compare different hash algorithms phash = imagehash.phash(img) # Perceptual hash dhash = imagehash.dhash(img) # Difference hash whash = imagehash.whash(img) # Wavelet hash # Calculate hash distance distance = hash1 - hash2 # Hamming distance - Blur Detection: Add a quality metric to detect blurry images using the Laplacian variance method:
import cv2 def calculate_blur_score(image_path): """Higher score = sharper image.""" img = cv2.imread(str(image_path)) gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) return cv2.Laplacian(gray, cv2.CV_64F).var() # Flag images below threshold as blurry BLUR_THRESHOLD = 100 df["blur_score"] = df["path"].apply(calculate_blur_score) df["is_blurry"] = df["blur_score"] < BLUR_THRESHOLD - Outlier Detection: Identify images with unusual dimensions or aspect ratios using z-scores
- Data Augmentation Preview: Create a visualization showing the effect of common augmentations (rotation, flip, color jitter)