Computer Vision

Session 2 - Data Preparation & Exploration

Collecting, annotating, and organizing datasets for CV projects

Today's Agenda

Data Sources & Collection
Legal Considerations
Annotation Tools & Formats
Data Exploration & Quality Checks
Handling Class Imbalance
Data Organization & Versioning
Hands-on Lab: Dataset Preparation

Data Preparation Pipeline

flowchart LR A(["Collect
1000+ images"]) -->|raw| B["Annotate
Labels/Boxes"] B -->|labeled| C{"Quality
Check"} C -->|pass| D["Split
70/15/15"] C -->|fail| B D -->|v1.0| E[("Version
DVC/Git")] E -->|ready| F(["Train
Model"]) classDef collect fill:#78909c,stroke:#546e7a,color:#fff classDef process fill:#4a90d9,stroke:#2e6da4,color:#fff classDef decision fill:#ff9800,stroke:#ef6c00,color:#fff classDef storage fill:#7cb342,stroke:#558b2f,color:#fff classDef output fill:#9c27b0,stroke:#7b1fa2,color:#fff class A collect class B,D process class C decision class E storage class F output

Where to Find Data?

Public Datasets

Ready-to-use, curated datasets for common tasks

Custom Collection

Domain-specific data you collect yourself

Synthetic Data

Computer-generated data for rare scenarios

Key insight: Data quality matters more than quantity. A smaller, well-annotated dataset often outperforms a large, noisy one.

Public Dataset Platforms

Platform	Strengths	Best For
Kaggle	Competitions, community notebooks	Learning, benchmarking
Google Dataset Search	Broad coverage, research focus	Academic datasets
Roboflow Universe	Pre-annotated, multiple formats	Object detection, quick start
Hugging Face Datasets	Easy loading, standardized	Integration with transformers
Papers With Code	Linked to research papers	State-of-the-art benchmarks

Loading Public Datasets

# Kaggle API
import kaggle  # (#1:Install: pip install kaggle)
kaggle.api.dataset_download_files('username/dataset-name', unzip=True)

# Hugging Face Datasets
from datasets import load_dataset  # (#2:pip install datasets)
dataset = load_dataset("cifar10")

# Roboflow
from roboflow import Roboflow  # (#3:pip install roboflow)
rf = Roboflow(api_key="YOUR_API_KEY")
project = rf.workspace("workspace").project("project")
dataset = project.version(1).download("yolov8")  # (#4:Multiple format options)

Creating Custom Datasets

Collection Methods

Manual capture: Photos, screenshots
Web scraping: Automated collection
Video extraction: Frame sampling
Crowdsourcing: Amazon MTurk, Scale AI

Best Practices

Capture diverse conditions
Include edge cases
Document collection method
Maintain consistent quality

Web Scraping for Images

import requests
from bs4 import BeautifulSoup  # (#1:pip install beautifulsoup4)
import os

def download_images(url, save_dir, limit=100):
    os.makedirs(save_dir, exist_ok=True)  # (#2:Create directory)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    images = soup.find_all('img')  # (#3:Find all img tags)

    for i, img in enumerate(images[:limit]):
        img_url = img.get('src')
        if img_url and img_url.startswith('http'):
            try:
                img_data = requests.get(img_url).content
                with open(f"{save_dir}/image_{i}.jpg", 'wb') as f:
                    f.write(img_data)  # (#4:Save image)
            except Exception as e:
                print(f"Error: {e}")

Warning: Always check website terms of service and robots.txt before scraping.

Synthetic Data Generation

Use Cases

Rare events (accidents, defects)
Privacy-sensitive data (faces)
Dangerous scenarios (autonomous driving)
Perfect ground truth labels

Tools

Blender: 3D rendering
Unity/Unreal: Game engines
NVIDIA Omniverse: Simulation
Albumentations: Augmentations

Pro tip: Mix synthetic with real data. Domain adaptation helps bridge the gap.

Legal Considerations

Copyright

Images are automatically copyrighted. Check usage rights before using.

GDPR

Personal data (faces) requires consent in EU. Anonymize or get permission.

Licensing

Understand Creative Commons, MIT, Apache licenses for datasets.

Common Dataset Licenses

License	Commercial Use	Attribution	Share-Alike
CC0 (Public Domain)	Yes	No	No
CC BY	Yes	Yes	No
CC BY-SA	Yes	Yes	Yes
CC BY-NC	No	Yes	No
Research Only	No	Varies	Varies

Important: Always document the license of each dataset you use for compliance.

Annotation Tools Comparison

Tool	Type	Best For	Cost
LabelImg	Desktop	Simple bbox annotation	Free
CVAT	Web-based	Video, team projects	Free (self-hosted)
Label Studio	Web-based	Multi-modal data	Free / Enterprise
Roboflow	Cloud	End-to-end workflow	Freemium
V7 Labs	Cloud	AI-assisted annotation	Paid

Why Do We Need Different Annotation Formats?

The Problem

Different tools and frameworks expect data in specific formats:

Training frameworks have their own parsers
Annotation tools export in different formats
Pre-trained models expect specific structures

Format Ecosystems

Format	Used By
COCO	Detectron2, MMDetection, benchmarks
YOLO	Ultralytics, Darknet, real-time apps
Pascal VOC	TensorFlow OD API, legacy tools

Key insight: Understanding formats lets you use any dataset with any framework. Format conversion is a common preprocessing step.

COCO JSON Format

{
  "images": [  // (#1:List of all images)
    {
      "id": 1,
      "file_name": "image_001.jpg",
      "width": 640,
      "height": 480
    }
  ],
  "annotations": [  // (#2:All annotations)
    {
      "id": 1,
      "image_id": 1,  // (#3:Links to image)
      "category_id": 0,
      "bbox": [100, 50, 200, 150],  // (#4:[x, y, width, height])
      "area": 30000,
      "segmentation": [[...]]  // (#5:Polygon points)
    }
  ],
  "categories": [  // (#6:Class definitions)
    {"id": 0, "name": "cat"},
    {"id": 1, "name": "dog"}
  ]
}

YOLO TXT Format

Label File (image_001.txt)

# class center_x center_y width height
0 0.45 0.52 0.31 0.42
1 0.72 0.38 0.15 0.25
0 0.21 0.65 0.18 0.30

All values normalized to 0-1 relative to image dimensions

data.yaml

train: ./train/images
val: ./valid/images
test: ./test/images

nc: 2  # number of classes
names: ['cat', 'dog']

Pascal VOC XML Format

<annotation>
  <filename>image_001.jpg</filename>  <!-- (#1:Image file name) -->
  <size>
    <width>640</width>
    <height>480</height>
    <depth>3</depth>  <!-- (#2:Number of channels) -->
  </size>
  <object>
    <name>cat</name>  <!-- (#3:Class label) -->
    <bndbox>
      <xmin>100</xmin>  <!-- (#4:Absolute pixel values) -->
      <ymin>50</ymin>
      <xmax>300</xmax>
      <ymax>200</ymax>
    </bndbox>
    <difficult>0</difficult>  <!-- (#5:Hard example flag) -->
  </object>
</annotation>

Converting Between Formats

import json
import os

def coco_to_yolo(coco_json, output_dir, img_width, img_height):
    """Convert COCO format to YOLO format"""  # (#1:Conversion function)
    with open(coco_json) as f:
        data = json.load(f)

    os.makedirs(output_dir, exist_ok=True)

    for ann in data['annotations']:
        img_id = ann['image_id']
        x, y, w, h = ann['bbox']  # (#2:COCO uses [x, y, w, h])

        # Convert to YOLO: center_x, center_y, width, height (normalized)
        cx = (x + w/2) / img_width  # (#3:Normalize to 0-1)
        cy = (y + h/2) / img_height
        nw = w / img_width
        nh = h / img_height

        # Write YOLO format
        label_file = f"{output_dir}/{img_id}.txt"
        with open(label_file, 'a') as f:
            f.write(f"{ann['category_id']} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n")  # (#4:YOLO format)

Quick Exercise: Annotation Formats

A bounding box is defined in COCO format as:

{"bbox": [120, 80, 200, 150], "image_id": 1}

The image is 640x480 pixels.

Question 1

What are the top-left coordinates (x1, y1) of this box?

Hint: COCO format = [x, y, w, h]

Question 2

What is the center point (cx, cy) in normalized YOLO coordinates?

Formula: cx = (x + w/2) / img_width

Question 3

What would be the full YOLO line for class 0?

Format: class cx cy w h

Annotation Best Practices

Do

Create clear labeling guidelines
Use tight bounding boxes
Include occluded objects
Double-check difficult cases
Use consistent naming conventions

Don't

Skip edge cases
Leave gaps around objects
Include excessive background
Mix annotation styles
Ignore ambiguous objects

Quality tip: Have multiple annotators label the same images and measure inter-annotator agreement.

Visual Inspection Techniques

import matplotlib.pyplot as plt
import cv2
import os
import random

def visualize_samples(image_dir, num_samples=9):
    """Display random samples from dataset"""  # (#1:Quick dataset overview)
    images = os.listdir(image_dir)
    samples = random.sample(images, min(num_samples, len(images)))

    fig, axes = plt.subplots(3, 3, figsize=(12, 12))
    for ax, img_name in zip(axes.flat, samples):
        img = cv2.imread(os.path.join(image_dir, img_name))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # (#2:BGR to RGB)
        ax.imshow(img)
        ax.set_title(img_name[:20])  # (#3:Truncate long names)
        ax.axis('off')
    plt.tight_layout()
    plt.show()

Visualizing Annotations

def draw_yolo_boxes(image_path, label_path, class_names):
    """Draw YOLO format bboxes on image"""  # (#1:Verify annotations)
    img = cv2.imread(image_path)
    h, w = img.shape[:2]

    with open(label_path) as f:
        for line in f:
            parts = line.strip().split()
            cls_id = int(parts[0])
            cx, cy, bw, bh = map(float, parts[1:])  # (#2:Parse YOLO format)

            # Convert to pixel coordinates
            x1 = int((cx - bw/2) * w)  # (#3:Denormalize)
            y1 = int((cy - bh/2) * h)
            x2 = int((cx + bw/2) * w)
            y2 = int((cy + bh/2) * h)

            cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)  # (#4:Draw bbox)
            cv2.putText(img, class_names[cls_id], (x1, y1-10),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
    return img

Statistical Analysis

import numpy as np
from collections import Counter

def analyze_dataset(label_dir):
    """Compute dataset statistics"""  # (#1:Essential analysis)
    class_counts = Counter()
    bbox_sizes = []

    for label_file in os.listdir(label_dir):
        with open(os.path.join(label_dir, label_file)) as f:
            for line in f:
                parts = line.strip().split()
                class_counts[int(parts[0])] += 1  # (#2:Count classes)
                w, h = float(parts[3]), float(parts[4])
                bbox_sizes.append(w * h)  # (#3:Bbox area)

    print("Class Distribution:")
    for cls, count in sorted(class_counts.items()):
        print(f"  Class {cls}: {count} ({count/sum(class_counts.values())*100:.1f}%)")

    print(f"\nBbox Statistics:")
    print(f"  Mean area: {np.mean(bbox_sizes):.4f}")  # (#4:Size statistics)
    print(f"  Std area: {np.std(bbox_sizes):.4f}")

Image Dimensions Analysis

import cv2
import matplotlib.pyplot as plt
from collections import defaultdict

def analyze_dimensions(image_dir):
    """Analyze image sizes in dataset"""  # (#1:Size distribution)
    widths, heights = [], []
    aspect_ratios = []

    for img_name in os.listdir(image_dir):
        img = cv2.imread(os.path.join(image_dir, img_name))
        if img is not None:
            h, w = img.shape[:2]
            widths.append(w)
            heights.append(h)
            aspect_ratios.append(w/h)  # (#2:Aspect ratio)

    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    axes[0].hist(widths, bins=30)  # (#3:Width distribution)
    axes[0].set_title('Width Distribution')
    axes[1].hist(heights, bins=30)  # (#4:Height distribution)
    axes[1].set_title('Height Distribution')
    axes[2].hist(aspect_ratios, bins=30)
    axes[2].set_title('Aspect Ratio Distribution')
    plt.tight_layout()
    plt.show()

Color Distribution Analysis

def analyze_color_distribution(image_dir, sample_size=100):
    """Analyze color channels across dataset"""  # (#1:Color statistics)
    all_means = {'R': [], 'G': [], 'B': []}

    images = random.sample(os.listdir(image_dir),
                          min(sample_size, len(os.listdir(image_dir))))

    for img_name in images:
        img = cv2.imread(os.path.join(image_dir, img_name))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        all_means['R'].append(img[:,:,0].mean())  # (#2:Channel means)
        all_means['G'].append(img[:,:,1].mean())
        all_means['B'].append(img[:,:,2].mean())

    print("Channel Statistics (for normalization):")
    for channel, values in all_means.items():
        print(f"  {channel}: mean={np.mean(values)/255:.3f}, "
              f"std={np.std(values)/255:.3f}")  # (#3:For normalization)

Use case: These statistics help set proper normalization values for training.

Quality Checks

Blur Detection

Identify out-of-focus images using Laplacian variance

Noise Detection

Find images with excessive noise or compression artifacts

Duplicate Detection

Remove exact or near-duplicate images

Blur Detection

def detect_blur(image_path, threshold=100):
    """Detect blurry images using Laplacian variance"""  # (#1:Sharpness metric)
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    laplacian_var = cv2.Laplacian(img, cv2.CV_64F).var()  # (#2:Edge detection)

    return laplacian_var < threshold, laplacian_var  # (#3:Lower = blurrier)

def find_blurry_images(image_dir, threshold=100):
    """Find all blurry images in directory"""
    blurry = []
    for img_name in os.listdir(image_dir):
        img_path = os.path.join(image_dir, img_name)
        is_blurry, score = detect_blur(img_path, threshold)
        if is_blurry:
            blurry.append((img_name, score))  # (#4:Track blurry images)

    print(f"Found {len(blurry)} blurry images")
    return sorted(blurry, key=lambda x: x[1])  # (#5:Sort by blur score)

Duplicate Detection

import hashlib
from PIL import Image
import imagehash  # (#1:pip install imagehash)

def find_exact_duplicates(image_dir):
    """Find exact duplicates using MD5 hash"""
    hashes = {}
    duplicates = []

    for img_name in os.listdir(image_dir):
        with open(os.path.join(image_dir, img_name), 'rb') as f:
            img_hash = hashlib.md5(f.read()).hexdigest()  # (#2:File hash)

        if img_hash in hashes:
            duplicates.append((img_name, hashes[img_hash]))
        else:
            hashes[img_hash] = img_name
    return duplicates

def find_similar_images(image_dir, threshold=5):
    """Find similar images using perceptual hash"""  # (#3:Near-duplicates)
    hashes = {}
    for img_name in os.listdir(image_dir):
        img = Image.open(os.path.join(image_dir, img_name))
        h = imagehash.phash(img)  # (#4:Perceptual hash)
        hashes[img_name] = h
    # Compare hashes - difference < threshold = similar

Handling Class Imbalance

flowchart LR subgraph Problem["Imbalanced Dataset - Ratio 45:4:1"] A["Class A
900 samples
90%"] B["Class B
80 samples
8%"] C["Class C
20 samples
2%"] end subgraph Solutions["Balancing Strategies"] S1["Oversample
Duplicate minority"] S2["Class Weights
w = 1/frequency"] S3["Focal Loss
γ = 2.0"] end Problem -->|"model bias"| Solutions classDef majority fill:#7cb342,stroke:#558b2f,color:#fff classDef medium fill:#ff9800,stroke:#ef6c00,color:#fff classDef minority fill:#f44336,stroke:#c62828,color:#fff classDef solution fill:#4a90d9,stroke:#2e6da4,color:#fff class A majority class B medium class C minority class S1,S2,S3 solution

The Problem

Model ignores minority classes
Poor recall on rare classes

Solutions

Resampling - Over/undersample
Class weights - Penalize errors
Focal loss - Focus on hard examples

Resampling Techniques

from sklearn.utils import resample
import numpy as np

def oversample_minority(X, y, target_count=None):
    """Oversample minority classes to balance dataset"""  # (#1:Balance classes)
    classes, counts = np.unique(y, return_counts=True)
    max_count = target_count or counts.max()

    X_balanced, y_balanced = [], []

    for cls in classes:
        X_cls = X[y == cls]
        y_cls = y[y == cls]

        if len(X_cls) < max_count:
            X_resampled, y_resampled = resample(  # (#2:Sklearn resample)
                X_cls, y_cls,
                replace=True,  # (#3:With replacement)
                n_samples=max_count,
                random_state=42
            )
        else:
            X_resampled, y_resampled = X_cls, y_cls

        X_balanced.extend(X_resampled)
        y_balanced.extend(y_resampled)

    return np.array(X_balanced), np.array(y_balanced)

Class Weights & Focal Loss

from sklearn.utils.class_weight import compute_class_weight
import torch.nn as nn

# Compute balanced class weights
class_weights = compute_class_weight(  # (#1:Auto-compute weights)
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)

# Use in loss function (PyTorch)
weights = torch.tensor(class_weights, dtype=torch.float32)
criterion = nn.CrossEntropyLoss(weight=weights)  # (#2:Weighted loss)

# Focal Loss - focuses on hard examples
class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2):  # (#3:gamma controls focus)
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, inputs, targets):
        ce_loss = nn.CrossEntropyLoss(reduction='none')(inputs, targets)
        pt = torch.exp(-ce_loss)  # (#4:Probability of correct class)
        focal_loss = self.alpha * (1-pt)**self.gamma * ce_loss
        return focal_loss.mean()

Quick Exercise: Class Weights

Given a dataset with the following class distribution:

Cat: 500 images
Dog: 300 images
Bird: 100 images

Question 1

What is the imbalance ratio between Cat and Bird?

Answer: 500/100 = ?

Question 2

If using inverse frequency weights, what weight should Bird have?

Formula: weight = total / (n_classes * class_count)

Question 3

How many Bird images would you need to oversample to match Cat?

Think: Current + needed = target

Data Organization

YOLO Structure

dataset/
  train/
    images/
      img001.jpg
      img002.jpg
    labels/
      img001.txt
      img002.txt
  valid/
    images/
    labels/
  test/
    images/
    labels/
  data.yaml

ImageFolder Structure

dataset/
  train/
    class_a/
      img001.jpg
      img002.jpg
    class_b/
      img001.jpg
      img002.jpg
  valid/
    class_a/
    class_b/
  test/
    class_a/
    class_b/

Tip: PyTorch's ImageFolder automatically creates labels from directory names.

Data Splitting Strategies

from sklearn.model_selection import train_test_split, StratifiedKFold

# Simple split (80/10/10)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # (#1:Preserve class ratio)
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

# K-Fold Cross Validation
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)  # (#2:5-fold CV)

for fold, (train_idx, val_idx) in enumerate(kfold.split(X, y)):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    print(f"Fold {fold}: Train={len(train_idx)}, Val={len(val_idx)}")  # (#3:Train each fold)

Data Splitting Best Practices

Stratified Splits

Maintain class distribution across all splits

No Data Leakage

Same object should not appear in train and test

Time-based Splits

For temporal data, split chronologically

Dataset Size	Recommended Split
Small (<1K)	K-fold cross-validation
Medium (1K-10K)	70/15/15 or 80/10/10
Large (>10K)	90/5/5 or fixed val/test

Data Versioning Best Practices

What to Version

Raw data - Original, unprocessed images
Annotations - Labels in standard formats
Processing scripts - Code that transforms data
Configuration - Split ratios, parameters

What NOT to Version

Processed data - Can be regenerated
Augmented images - Generated on-the-fly
Intermediate artifacts - Temporary files

Industry best practice: Share only raw data + processing scripts. Anyone can reproduce the exact processed dataset.

Data Storage Solutions

Solution	Best For	Versioning
Cloud Storage (S3, GCS)	Large datasets, team collaboration	Object versioning, lifecycle policies
Hugging Face Datasets	Public datasets, ML community	Git-based, automatic splits
Git LFS	Small-medium datasets	Git integration, simple workflow

Tip: Document your data lineage: source, collection date, preprocessing steps, and any known issues.

Reproducibility Workflow

# data_config.yaml - Version this file in Git
"""
dataset:
  name: traffic_signs_v2
  raw_source: s3://bucket/raw/traffic_signs/
  train_split: 0.7
  val_split: 0.15
  test_split: 0.15
  seed: 42
  preprocessing:
    resize: [224, 224]
    normalize: imagenet
"""

# prepare_data.py - Deterministic processing
import random
import numpy as np

def prepare_dataset(config):
    # Set seeds for reproducibility  # (#1:Deterministic splits)
    random.seed(config['seed'])
    np.random.seed(config['seed'])
    
    # Download raw data
    raw_data = download_raw(config['raw_source'])  # (#2:Fetch raw only)
    
    # Apply processing (can always be re-run)
    processed = preprocess(raw_data, config['preprocessing'])  # (#3:Reproducible)
    
    # Split deterministically
    return split_data(processed, config)  # (#4:Same splits every time)

Hands-on Lab: Dataset Preparation

Objectives

Download and explore a dataset from Roboflow
Analyze class distribution and image statistics
Implement quality checks (blur, duplicates)
Convert between annotation formats

Exercises

Load a detection dataset and visualize annotations
Calculate and plot class distribution
Find and remove low-quality images
Convert COCO annotations to YOLO format

Lab: Getting Started

# Option 1: Use Roboflow
from roboflow import Roboflow
rf = Roboflow(api_key="your_key")  # (#1:Free account)
project = rf.workspace().project("coco-dataset")
dataset = project.version(1).download("coco")

# Option 2: Use Kaggle
!kaggle datasets download -d dataset-name  # (#2:Kaggle CLI)

# Option 3: Use torchvision
from torchvision.datasets import VOCDetection
dataset = VOCDetection(  # (#3:Pascal VOC)
    root='./data',
    year='2012',
    image_set='train',
    download=True
)

Key Takeaways

Quality Over Quantity

Well-annotated data is more valuable than large, noisy datasets

Know Your Data

Statistical analysis reveals issues before training

Version Everything

Track data changes like code for reproducibility

Next Session Preview

Session 3: Data Augmentation

Why augmentation matters
Geometric transformations
Color augmentations
Advanced techniques (MixUp, CutOut)
Building augmentation pipelines

Preparation: Install albumentations: pip install albumentations

Resources

Type	Resource
Datasets	Roboflow Universe
Tool	LabelImg GitHub
Tool	CVAT - Computer Vision Annotation Tool
Documentation	DVC Documentation
Guide	COCO Data Format
Article	Training Data Best Practices

Questions?

Lab Time

Download a dataset and perform exploratory analysis

Practical Work

Complete the data preparation exercises

Challenge

Create quality check scripts for your own dataset