Computer Vision

Session 2 - Data Preparation & Exploration

Collecting, annotating, and organizing datasets for CV projects

Today's Agenda

Data Preparation Pipeline

flowchart LR A(["Collect
1000+ images"]) -->|raw| B["Annotate
Labels/Boxes"] B -->|labeled| C{"Quality
Check"} C -->|pass| D["Split
70/15/15"] C -->|fail| B D -->|v1.0| E[("Version
DVC/Git")] E -->|ready| F(["Train
Model"]) classDef collect fill:#78909c,stroke:#546e7a,color:#fff classDef process fill:#4a90d9,stroke:#2e6da4,color:#fff classDef decision fill:#ff9800,stroke:#ef6c00,color:#fff classDef storage fill:#7cb342,stroke:#558b2f,color:#fff classDef output fill:#9c27b0,stroke:#7b1fa2,color:#fff class A collect class B,D process class C decision class E storage class F output

Where to Find Data?

Public Datasets

Ready-to-use, curated datasets for common tasks

Custom Collection

Domain-specific data you collect yourself

Synthetic Data

Computer-generated data for rare scenarios

Key insight: Data quality matters more than quantity. A smaller, well-annotated dataset often outperforms a large, noisy one.

Public Dataset Platforms

Platform Strengths Best For
Kaggle Competitions, community notebooks Learning, benchmarking
Google Dataset Search Broad coverage, research focus Academic datasets
Roboflow Universe Pre-annotated, multiple formats Object detection, quick start
Hugging Face Datasets Easy loading, standardized Integration with transformers
Papers With Code Linked to research papers State-of-the-art benchmarks

Loading Public Datasets

# Kaggle API
import kaggle  # (#1:Install: pip install kaggle)
kaggle.api.dataset_download_files('username/dataset-name', unzip=True)

# Hugging Face Datasets
from datasets import load_dataset  # (#2:pip install datasets)
dataset = load_dataset("cifar10")

# Roboflow
from roboflow import Roboflow  # (#3:pip install roboflow)
rf = Roboflow(api_key="YOUR_API_KEY")
project = rf.workspace("workspace").project("project")
dataset = project.version(1).download("yolov8")  # (#4:Multiple format options)

Creating Custom Datasets

Collection Methods

  • Manual capture: Photos, screenshots
  • Web scraping: Automated collection
  • Video extraction: Frame sampling
  • Crowdsourcing: Amazon MTurk, Scale AI

Best Practices

  • Capture diverse conditions
  • Include edge cases
  • Document collection method
  • Maintain consistent quality

Web Scraping for Images

import requests
from bs4 import BeautifulSoup  # (#1:pip install beautifulsoup4)
import os

def download_images(url, save_dir, limit=100):
    os.makedirs(save_dir, exist_ok=True)  # (#2:Create directory)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    images = soup.find_all('img')  # (#3:Find all img tags)

    for i, img in enumerate(images[:limit]):
        img_url = img.get('src')
        if img_url and img_url.startswith('http'):
            try:
                img_data = requests.get(img_url).content
                with open(f"{save_dir}/image_{i}.jpg", 'wb') as f:
                    f.write(img_data)  # (#4:Save image)
            except Exception as e:
                print(f"Error: {e}")

Warning: Always check website terms of service and robots.txt before scraping.

Synthetic Data Generation

Use Cases

  • Rare events (accidents, defects)
  • Privacy-sensitive data (faces)
  • Dangerous scenarios (autonomous driving)
  • Perfect ground truth labels

Tools

  • Blender: 3D rendering
  • Unity/Unreal: Game engines
  • NVIDIA Omniverse: Simulation
  • Albumentations: Augmentations

Pro tip: Mix synthetic with real data. Domain adaptation helps bridge the gap.

Legal Considerations

Copyright

Images are automatically copyrighted. Check usage rights before using.

GDPR

Personal data (faces) requires consent in EU. Anonymize or get permission.

Licensing

Understand Creative Commons, MIT, Apache licenses for datasets.

Common Dataset Licenses

License Commercial Use Attribution Share-Alike
CC0 (Public Domain) Yes No No
CC BY Yes Yes No
CC BY-SA Yes Yes Yes
CC BY-NC No Yes No
Research Only No Varies Varies

Important: Always document the license of each dataset you use for compliance.

Annotation Tools Comparison

Tool Type Best For Cost
LabelImg Desktop Simple bbox annotation Free
CVAT Web-based Video, team projects Free (self-hosted)
Label Studio Web-based Multi-modal data Free / Enterprise
Roboflow Cloud End-to-end workflow Freemium
V7 Labs Cloud AI-assisted annotation Paid

Why Do We Need Different Annotation Formats?

The Problem

Different tools and frameworks expect data in specific formats:

  • Training frameworks have their own parsers
  • Annotation tools export in different formats
  • Pre-trained models expect specific structures

Format Ecosystems

FormatUsed By
COCODetectron2, MMDetection, benchmarks
YOLOUltralytics, Darknet, real-time apps
Pascal VOCTensorFlow OD API, legacy tools

Key insight: Understanding formats lets you use any dataset with any framework. Format conversion is a common preprocessing step.

COCO JSON Format

{
  "images": [  // (#1:List of all images)
    {
      "id": 1,
      "file_name": "image_001.jpg",
      "width": 640,
      "height": 480
    }
  ],
  "annotations": [  // (#2:All annotations)
    {
      "id": 1,
      "image_id": 1,  // (#3:Links to image)
      "category_id": 0,
      "bbox": [100, 50, 200, 150],  // (#4:[x, y, width, height])
      "area": 30000,
      "segmentation": [[...]]  // (#5:Polygon points)
    }
  ],
  "categories": [  // (#6:Class definitions)
    {"id": 0, "name": "cat"},
    {"id": 1, "name": "dog"}
  ]
}

YOLO TXT Format

Label File (image_001.txt)

# class center_x center_y width height
0 0.45 0.52 0.31 0.42
1 0.72 0.38 0.15 0.25
0 0.21 0.65 0.18 0.30

All values normalized to 0-1 relative to image dimensions

data.yaml

train: ./train/images
val: ./valid/images
test: ./test/images

nc: 2  # number of classes
names: ['cat', 'dog']

Pascal VOC XML Format

<annotation>
  <filename>image_001.jpg</filename>  <!-- (#1:Image file name) -->
  <size>
    <width>640</width>
    <height>480</height>
    <depth>3</depth>  <!-- (#2:Number of channels) -->
  </size>
  <object>
    <name>cat</name>  <!-- (#3:Class label) -->
    <bndbox>
      <xmin>100</xmin>  <!-- (#4:Absolute pixel values) -->
      <ymin>50</ymin>
      <xmax>300</xmax>
      <ymax>200</ymax>
    </bndbox>
    <difficult>0</difficult>  <!-- (#5:Hard example flag) -->
  </object>
</annotation>

Converting Between Formats

import json
import os

def coco_to_yolo(coco_json, output_dir, img_width, img_height):
    """Convert COCO format to YOLO format"""  # (#1:Conversion function)
    with open(coco_json) as f:
        data = json.load(f)

    os.makedirs(output_dir, exist_ok=True)

    for ann in data['annotations']:
        img_id = ann['image_id']
        x, y, w, h = ann['bbox']  # (#2:COCO uses [x, y, w, h])

        # Convert to YOLO: center_x, center_y, width, height (normalized)
        cx = (x + w/2) / img_width  # (#3:Normalize to 0-1)
        cy = (y + h/2) / img_height
        nw = w / img_width
        nh = h / img_height

        # Write YOLO format
        label_file = f"{output_dir}/{img_id}.txt"
        with open(label_file, 'a') as f:
            f.write(f"{ann['category_id']} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")  # (#4:YOLO format)

Quick Exercise: Annotation Formats

A bounding box is defined in COCO format as:

{"bbox": [120, 80, 200, 150], "image_id": 1}

The image is 640x480 pixels.

Question 1

What are the top-left coordinates (x1, y1) of this box?

Hint: COCO format = [x, y, w, h]

Question 2

What is the center point (cx, cy) in normalized YOLO coordinates?

Formula: cx = (x + w/2) / img_width

Question 3

What would be the full YOLO line for class 0?

Format: class cx cy w h

Annotation Best Practices

Do

  • Create clear labeling guidelines
  • Use tight bounding boxes
  • Include occluded objects
  • Double-check difficult cases
  • Use consistent naming conventions

Don't

  • Skip edge cases
  • Leave gaps around objects
  • Include excessive background
  • Mix annotation styles
  • Ignore ambiguous objects

Quality tip: Have multiple annotators label the same images and measure inter-annotator agreement.

Visual Inspection Techniques

import matplotlib.pyplot as plt
import cv2
import os
import random

def visualize_samples(image_dir, num_samples=9):
    """Display random samples from dataset"""  # (#1:Quick dataset overview)
    images = os.listdir(image_dir)
    samples = random.sample(images, min(num_samples, len(images)))

    fig, axes = plt.subplots(3, 3, figsize=(12, 12))
    for ax, img_name in zip(axes.flat, samples):
        img = cv2.imread(os.path.join(image_dir, img_name))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # (#2:BGR to RGB)
        ax.imshow(img)
        ax.set_title(img_name[:20])  # (#3:Truncate long names)
        ax.axis('off')
    plt.tight_layout()
    plt.show()

Visualizing Annotations

def draw_yolo_boxes(image_path, label_path, class_names):
    """Draw YOLO format bboxes on image"""  # (#1:Verify annotations)
    img = cv2.imread(image_path)
    h, w = img.shape[:2]

    with open(label_path) as f:
        for line in f:
            parts = line.strip().split()
            cls_id = int(parts[0])
            cx, cy, bw, bh = map(float, parts[1:])  # (#2:Parse YOLO format)

            # Convert to pixel coordinates
            x1 = int((cx - bw/2) * w)  # (#3:Denormalize)
            y1 = int((cy - bh/2) * h)
            x2 = int((cx + bw/2) * w)
            y2 = int((cy + bh/2) * h)

            cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)  # (#4:Draw bbox)
            cv2.putText(img, class_names[cls_id], (x1, y1-10),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
    return img

Statistical Analysis

import numpy as np
from collections import Counter

def analyze_dataset(label_dir):
    """Compute dataset statistics"""  # (#1:Essential analysis)
    class_counts = Counter()
    bbox_sizes = []

    for label_file in os.listdir(label_dir):
        with open(os.path.join(label_dir, label_file)) as f:
            for line in f:
                parts = line.strip().split()
                class_counts[int(parts[0])] += 1  # (#2:Count classes)
                w, h = float(parts[3]), float(parts[4])
                bbox_sizes.append(w * h)  # (#3:Bbox area)

    print("Class Distribution:")
    for cls, count in sorted(class_counts.items()):
        print(f"  Class {cls}: {count} ({count/sum(class_counts.values())*100:.1f}%)")

    print(f"\nBbox Statistics:")
    print(f"  Mean area: {np.mean(bbox_sizes):.4f}")  # (#4:Size statistics)
    print(f"  Std area: {np.std(bbox_sizes):.4f}")

Image Dimensions Analysis

import cv2
import matplotlib.pyplot as plt
from collections import defaultdict

def analyze_dimensions(image_dir):
    """Analyze image sizes in dataset"""  # (#1:Size distribution)
    widths, heights = [], []
    aspect_ratios = []

    for img_name in os.listdir(image_dir):
        img = cv2.imread(os.path.join(image_dir, img_name))
        if img is not None:
            h, w = img.shape[:2]
            widths.append(w)
            heights.append(h)
            aspect_ratios.append(w/h)  # (#2:Aspect ratio)

    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    axes[0].hist(widths, bins=30)  # (#3:Width distribution)
    axes[0].set_title('Width Distribution')
    axes[1].hist(heights, bins=30)  # (#4:Height distribution)
    axes[1].set_title('Height Distribution')
    axes[2].hist(aspect_ratios, bins=30)
    axes[2].set_title('Aspect Ratio Distribution')
    plt.tight_layout()
    plt.show()

Color Distribution Analysis

def analyze_color_distribution(image_dir, sample_size=100):
    """Analyze color channels across dataset"""  # (#1:Color statistics)
    all_means = {'R': [], 'G': [], 'B': []}

    images = random.sample(os.listdir(image_dir),
                          min(sample_size, len(os.listdir(image_dir))))

    for img_name in images:
        img = cv2.imread(os.path.join(image_dir, img_name))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        all_means['R'].append(img[:,:,0].mean())  # (#2:Channel means)
        all_means['G'].append(img[:,:,1].mean())
        all_means['B'].append(img[:,:,2].mean())

    print("Channel Statistics (for normalization):")
    for channel, values in all_means.items():
        print(f"  {channel}: mean={np.mean(values)/255:.3f}, "
              f"std={np.std(values)/255:.3f}")  # (#3:For normalization)

Use case: These statistics help set proper normalization values for training.

Quality Checks

Blur Detection

Identify out-of-focus images using Laplacian variance

Noise Detection

Find images with excessive noise or compression artifacts

Duplicate Detection

Remove exact or near-duplicate images

Blur Detection

def detect_blur(image_path, threshold=100):
    """Detect blurry images using Laplacian variance"""  # (#1:Sharpness metric)
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    laplacian_var = cv2.Laplacian(img, cv2.CV_64F).var()  # (#2:Edge detection)

    return laplacian_var < threshold, laplacian_var  # (#3:Lower = blurrier)

def find_blurry_images(image_dir, threshold=100):
    """Find all blurry images in directory"""
    blurry = []
    for img_name in os.listdir(image_dir):
        img_path = os.path.join(image_dir, img_name)
        is_blurry, score = detect_blur(img_path, threshold)
        if is_blurry:
            blurry.append((img_name, score))  # (#4:Track blurry images)

    print(f"Found {len(blurry)} blurry images")
    return sorted(blurry, key=lambda x: x[1])  # (#5:Sort by blur score)

Duplicate Detection

import hashlib
from PIL import Image
import imagehash  # (#1:pip install imagehash)

def find_exact_duplicates(image_dir):
    """Find exact duplicates using MD5 hash"""
    hashes = {}
    duplicates = []

    for img_name in os.listdir(image_dir):
        with open(os.path.join(image_dir, img_name), 'rb') as f:
            img_hash = hashlib.md5(f.read()).hexdigest()  # (#2:File hash)

        if img_hash in hashes:
            duplicates.append((img_name, hashes[img_hash]))
        else:
            hashes[img_hash] = img_name
    return duplicates

def find_similar_images(image_dir, threshold=5):
    """Find similar images using perceptual hash"""  # (#3:Near-duplicates)
    hashes = {}
    for img_name in os.listdir(image_dir):
        img = Image.open(os.path.join(image_dir, img_name))
        h = imagehash.phash(img)  # (#4:Perceptual hash)
        hashes[img_name] = h
    # Compare hashes - difference < threshold = similar

Handling Class Imbalance

flowchart LR subgraph Problem["Imbalanced Dataset - Ratio 45:4:1"] A["Class A
900 samples
90%"] B["Class B
80 samples
8%"] C["Class C
20 samples
2%"] end subgraph Solutions["Balancing Strategies"] S1["Oversample
Duplicate minority"] S2["Class Weights
w = 1/frequency"] S3["Focal Loss
γ = 2.0"] end Problem -->|"model bias"| Solutions classDef majority fill:#7cb342,stroke:#558b2f,color:#fff classDef medium fill:#ff9800,stroke:#ef6c00,color:#fff classDef minority fill:#f44336,stroke:#c62828,color:#fff classDef solution fill:#4a90d9,stroke:#2e6da4,color:#fff class A majority class B medium class C minority class S1,S2,S3 solution

The Problem

  • Model ignores minority classes
  • Poor recall on rare classes

Solutions

  • Resampling - Over/undersample
  • Class weights - Penalize errors
  • Focal loss - Focus on hard examples

Resampling Techniques

from sklearn.utils import resample
import numpy as np

def oversample_minority(X, y, target_count=None):
    """Oversample minority classes to balance dataset"""  # (#1:Balance classes)
    classes, counts = np.unique(y, return_counts=True)
    max_count = target_count or counts.max()

    X_balanced, y_balanced = [], []

    for cls in classes:
        X_cls = X[y == cls]
        y_cls = y[y == cls]

        if len(X_cls) < max_count:
            X_resampled, y_resampled = resample(  # (#2:Sklearn resample)
                X_cls, y_cls,
                replace=True,  # (#3:With replacement)
                n_samples=max_count,
                random_state=42
            )
        else:
            X_resampled, y_resampled = X_cls, y_cls

        X_balanced.extend(X_resampled)
        y_balanced.extend(y_resampled)

    return np.array(X_balanced), np.array(y_balanced)

Class Weights & Focal Loss

from sklearn.utils.class_weight import compute_class_weight
import torch.nn as nn

# Compute balanced class weights
class_weights = compute_class_weight(  # (#1:Auto-compute weights)
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)

# Use in loss function (PyTorch)
weights = torch.tensor(class_weights, dtype=torch.float32)
criterion = nn.CrossEntropyLoss(weight=weights)  # (#2:Weighted loss)

# Focal Loss - focuses on hard examples
class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2):  # (#3:gamma controls focus)
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, inputs, targets):
        ce_loss = nn.CrossEntropyLoss(reduction='none')(inputs, targets)
        pt = torch.exp(-ce_loss)  # (#4:Probability of correct class)
        focal_loss = self.alpha * (1-pt)**self.gamma * ce_loss
        return focal_loss.mean()

Quick Exercise: Class Weights

Given a dataset with the following class distribution:

Question 1

What is the imbalance ratio between Cat and Bird?

Answer: 500/100 = ?

Question 2

If using inverse frequency weights, what weight should Bird have?

Formula: weight = total / (n_classes * class_count)

Question 3

How many Bird images would you need to oversample to match Cat?

Think: Current + needed = target

Data Organization

YOLO Structure

dataset/
  train/
    images/
      img001.jpg
      img002.jpg
    labels/
      img001.txt
      img002.txt
  valid/
    images/
    labels/
  test/
    images/
    labels/
  data.yaml

ImageFolder Structure

dataset/
  train/
    class_a/
      img001.jpg
      img002.jpg
    class_b/
      img001.jpg
      img002.jpg
  valid/
    class_a/
    class_b/
  test/
    class_a/
    class_b/

Tip: PyTorch's ImageFolder automatically creates labels from directory names.

Data Splitting Strategies

from sklearn.model_selection import train_test_split, StratifiedKFold

# Simple split (80/10/10)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # (#1:Preserve class ratio)
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

# K-Fold Cross Validation
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)  # (#2:5-fold CV)

for fold, (train_idx, val_idx) in enumerate(kfold.split(X, y)):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    print(f"Fold {fold}: Train={len(train_idx)}, Val={len(val_idx)}")  # (#3:Train each fold)

Data Splitting Best Practices

Stratified Splits

Maintain class distribution across all splits

No Data Leakage

Same object should not appear in train and test

Time-based Splits

For temporal data, split chronologically

Dataset Size Recommended Split
Small (<1K) K-fold cross-validation
Medium (1K-10K) 70/15/15 or 80/10/10
Large (>10K) 90/5/5 or fixed val/test

Data Versioning Best Practices

What to Version

  • Raw data - Original, unprocessed images
  • Annotations - Labels in standard formats
  • Processing scripts - Code that transforms data
  • Configuration - Split ratios, parameters

What NOT to Version

  • Processed data - Can be regenerated
  • Augmented images - Generated on-the-fly
  • Intermediate artifacts - Temporary files

Industry best practice: Share only raw data + processing scripts. Anyone can reproduce the exact processed dataset.

Data Storage Solutions

Solution Best For Versioning
Cloud Storage (S3, GCS) Large datasets, team collaboration Object versioning, lifecycle policies
Hugging Face Datasets Public datasets, ML community Git-based, automatic splits
Git LFS Small-medium datasets Git integration, simple workflow

Tip: Document your data lineage: source, collection date, preprocessing steps, and any known issues.

Reproducibility Workflow

# data_config.yaml - Version this file in Git
"""
dataset:
  name: traffic_signs_v2
  raw_source: s3://bucket/raw/traffic_signs/
  train_split: 0.7
  val_split: 0.15
  test_split: 0.15
  seed: 42
  preprocessing:
    resize: [224, 224]
    normalize: imagenet
"""

# prepare_data.py - Deterministic processing
import random
import numpy as np

def prepare_dataset(config):
    # Set seeds for reproducibility  # (#1:Deterministic splits)
    random.seed(config['seed'])
    np.random.seed(config['seed'])
    
    # Download raw data
    raw_data = download_raw(config['raw_source'])  # (#2:Fetch raw only)
    
    # Apply processing (can always be re-run)
    processed = preprocess(raw_data, config['preprocessing'])  # (#3:Reproducible)
    
    # Split deterministically
    return split_data(processed, config)  # (#4:Same splits every time)

Hands-on Lab: Dataset Preparation

Objectives

Exercises

  1. Load a detection dataset and visualize annotations
  2. Calculate and plot class distribution
  3. Find and remove low-quality images
  4. Convert COCO annotations to YOLO format

Lab: Getting Started

# Option 1: Use Roboflow
from roboflow import Roboflow
rf = Roboflow(api_key="your_key")  # (#1:Free account)
project = rf.workspace().project("coco-dataset")
dataset = project.version(1).download("coco")

# Option 2: Use Kaggle
!kaggle datasets download -d dataset-name  # (#2:Kaggle CLI)

# Option 3: Use torchvision
from torchvision.datasets import VOCDetection
dataset = VOCDetection(  # (#3:Pascal VOC)
    root='./data',
    year='2012',
    image_set='train',
    download=True
)

Key Takeaways

Quality Over Quantity

Well-annotated data is more valuable than large, noisy datasets

Know Your Data

Statistical analysis reveals issues before training

Version Everything

Track data changes like code for reproducibility

Next Session Preview

Session 3: Data Augmentation

Preparation: Install albumentations: pip install albumentations

Resources

Type Resource
Datasets Roboflow Universe
Tool LabelImg GitHub
Tool CVAT - Computer Vision Annotation Tool
Documentation DVC Documentation
Guide COCO Data Format
Article Training Data Best Practices

Questions?

Lab Time

Download a dataset and perform exploratory analysis

Practical Work

Complete the data preparation exercises

Challenge

Create quality check scripts for your own dataset

1 / 1

Slide Overview