Collecting, annotating, and organizing datasets for CV projects
Ready-to-use, curated datasets for common tasks
Domain-specific data you collect yourself
Computer-generated data for rare scenarios
Key insight: Data quality matters more than quantity. A smaller, well-annotated dataset often outperforms a large, noisy one.
| Platform | Strengths | Best For |
|---|---|---|
| Kaggle | Competitions, community notebooks | Learning, benchmarking |
| Google Dataset Search | Broad coverage, research focus | Academic datasets |
| Roboflow Universe | Pre-annotated, multiple formats | Object detection, quick start |
| Hugging Face Datasets | Easy loading, standardized | Integration with transformers |
| Papers With Code | Linked to research papers | State-of-the-art benchmarks |
# Kaggle API
import kaggle # (#1:Install: pip install kaggle)
kaggle.api.dataset_download_files('username/dataset-name', unzip=True)
# Hugging Face Datasets
from datasets import load_dataset # (#2:pip install datasets)
dataset = load_dataset("cifar10")
# Roboflow
from roboflow import Roboflow # (#3:pip install roboflow)
rf = Roboflow(api_key="YOUR_API_KEY")
project = rf.workspace("workspace").project("project")
dataset = project.version(1).download("yolov8") # (#4:Multiple format options)
import requests
from bs4 import BeautifulSoup # (#1:pip install beautifulsoup4)
import os
def download_images(url, save_dir, limit=100):
os.makedirs(save_dir, exist_ok=True) # (#2:Create directory)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
images = soup.find_all('img') # (#3:Find all img tags)
for i, img in enumerate(images[:limit]):
img_url = img.get('src')
if img_url and img_url.startswith('http'):
try:
img_data = requests.get(img_url).content
with open(f"{save_dir}/image_{i}.jpg", 'wb') as f:
f.write(img_data) # (#4:Save image)
except Exception as e:
print(f"Error: {e}")
Warning: Always check website terms of service and robots.txt before scraping.
Pro tip: Mix synthetic with real data. Domain adaptation helps bridge the gap.
Images are automatically copyrighted. Check usage rights before using.
Personal data (faces) requires consent in EU. Anonymize or get permission.
Understand Creative Commons, MIT, Apache licenses for datasets.
| License | Commercial Use | Attribution | Share-Alike |
|---|---|---|---|
| CC0 (Public Domain) | Yes | No | No |
| CC BY | Yes | Yes | No |
| CC BY-SA | Yes | Yes | Yes |
| CC BY-NC | No | Yes | No |
| Research Only | No | Varies | Varies |
Important: Always document the license of each dataset you use for compliance.
| Tool | Type | Best For | Cost |
|---|---|---|---|
| LabelImg | Desktop | Simple bbox annotation | Free |
| CVAT | Web-based | Video, team projects | Free (self-hosted) |
| Label Studio | Web-based | Multi-modal data | Free / Enterprise |
| Roboflow | Cloud | End-to-end workflow | Freemium |
| V7 Labs | Cloud | AI-assisted annotation | Paid |
Different tools and frameworks expect data in specific formats:
| Format | Used By |
|---|---|
| COCO | Detectron2, MMDetection, benchmarks |
| YOLO | Ultralytics, Darknet, real-time apps |
| Pascal VOC | TensorFlow OD API, legacy tools |
Key insight: Understanding formats lets you use any dataset with any framework. Format conversion is a common preprocessing step.
{
"images": [ // (#1:List of all images)
{
"id": 1,
"file_name": "image_001.jpg",
"width": 640,
"height": 480
}
],
"annotations": [ // (#2:All annotations)
{
"id": 1,
"image_id": 1, // (#3:Links to image)
"category_id": 0,
"bbox": [100, 50, 200, 150], // (#4:[x, y, width, height])
"area": 30000,
"segmentation": [[...]] // (#5:Polygon points)
}
],
"categories": [ // (#6:Class definitions)
{"id": 0, "name": "cat"},
{"id": 1, "name": "dog"}
]
}
# class center_x center_y width height
0 0.45 0.52 0.31 0.42
1 0.72 0.38 0.15 0.25
0 0.21 0.65 0.18 0.30
All values normalized to 0-1 relative to image dimensions
train: ./train/images
val: ./valid/images
test: ./test/images
nc: 2 # number of classes
names: ['cat', 'dog']
<annotation>
<filename>image_001.jpg</filename> <!-- (#1:Image file name) -->
<size>
<width>640</width>
<height>480</height>
<depth>3</depth> <!-- (#2:Number of channels) -->
</size>
<object>
<name>cat</name> <!-- (#3:Class label) -->
<bndbox>
<xmin>100</xmin> <!-- (#4:Absolute pixel values) -->
<ymin>50</ymin>
<xmax>300</xmax>
<ymax>200</ymax>
</bndbox>
<difficult>0</difficult> <!-- (#5:Hard example flag) -->
</object>
</annotation>
import json
import os
def coco_to_yolo(coco_json, output_dir, img_width, img_height):
"""Convert COCO format to YOLO format""" # (#1:Conversion function)
with open(coco_json) as f:
data = json.load(f)
os.makedirs(output_dir, exist_ok=True)
for ann in data['annotations']:
img_id = ann['image_id']
x, y, w, h = ann['bbox'] # (#2:COCO uses [x, y, w, h])
# Convert to YOLO: center_x, center_y, width, height (normalized)
cx = (x + w/2) / img_width # (#3:Normalize to 0-1)
cy = (y + h/2) / img_height
nw = w / img_width
nh = h / img_height
# Write YOLO format
label_file = f"{output_dir}/{img_id}.txt"
with open(label_file, 'a') as f:
f.write(f"{ann['category_id']} {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}\n") # (#4:YOLO format)
A bounding box is defined in COCO format as:
{"bbox": [120, 80, 200, 150], "image_id": 1}
The image is 640x480 pixels.
What are the top-left coordinates (x1, y1) of this box?
Hint: COCO format = [x, y, w, h]
What is the center point (cx, cy) in normalized YOLO coordinates?
Formula: cx = (x + w/2) / img_width
What would be the full YOLO line for class 0?
Format: class cx cy w h
Quality tip: Have multiple annotators label the same images and measure inter-annotator agreement.
import matplotlib.pyplot as plt
import cv2
import os
import random
def visualize_samples(image_dir, num_samples=9):
"""Display random samples from dataset""" # (#1:Quick dataset overview)
images = os.listdir(image_dir)
samples = random.sample(images, min(num_samples, len(images)))
fig, axes = plt.subplots(3, 3, figsize=(12, 12))
for ax, img_name in zip(axes.flat, samples):
img = cv2.imread(os.path.join(image_dir, img_name))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # (#2:BGR to RGB)
ax.imshow(img)
ax.set_title(img_name[:20]) # (#3:Truncate long names)
ax.axis('off')
plt.tight_layout()
plt.show()
def draw_yolo_boxes(image_path, label_path, class_names):
"""Draw YOLO format bboxes on image""" # (#1:Verify annotations)
img = cv2.imread(image_path)
h, w = img.shape[:2]
with open(label_path) as f:
for line in f:
parts = line.strip().split()
cls_id = int(parts[0])
cx, cy, bw, bh = map(float, parts[1:]) # (#2:Parse YOLO format)
# Convert to pixel coordinates
x1 = int((cx - bw/2) * w) # (#3:Denormalize)
y1 = int((cy - bh/2) * h)
x2 = int((cx + bw/2) * w)
y2 = int((cy + bh/2) * h)
cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2) # (#4:Draw bbox)
cv2.putText(img, class_names[cls_id], (x1, y1-10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
return img
import numpy as np
from collections import Counter
def analyze_dataset(label_dir):
"""Compute dataset statistics""" # (#1:Essential analysis)
class_counts = Counter()
bbox_sizes = []
for label_file in os.listdir(label_dir):
with open(os.path.join(label_dir, label_file)) as f:
for line in f:
parts = line.strip().split()
class_counts[int(parts[0])] += 1 # (#2:Count classes)
w, h = float(parts[3]), float(parts[4])
bbox_sizes.append(w * h) # (#3:Bbox area)
print("Class Distribution:")
for cls, count in sorted(class_counts.items()):
print(f" Class {cls}: {count} ({count/sum(class_counts.values())*100:.1f}%)")
print(f"\nBbox Statistics:")
print(f" Mean area: {np.mean(bbox_sizes):.4f}") # (#4:Size statistics)
print(f" Std area: {np.std(bbox_sizes):.4f}")
import cv2
import matplotlib.pyplot as plt
from collections import defaultdict
def analyze_dimensions(image_dir):
"""Analyze image sizes in dataset""" # (#1:Size distribution)
widths, heights = [], []
aspect_ratios = []
for img_name in os.listdir(image_dir):
img = cv2.imread(os.path.join(image_dir, img_name))
if img is not None:
h, w = img.shape[:2]
widths.append(w)
heights.append(h)
aspect_ratios.append(w/h) # (#2:Aspect ratio)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(widths, bins=30) # (#3:Width distribution)
axes[0].set_title('Width Distribution')
axes[1].hist(heights, bins=30) # (#4:Height distribution)
axes[1].set_title('Height Distribution')
axes[2].hist(aspect_ratios, bins=30)
axes[2].set_title('Aspect Ratio Distribution')
plt.tight_layout()
plt.show()
def analyze_color_distribution(image_dir, sample_size=100):
"""Analyze color channels across dataset""" # (#1:Color statistics)
all_means = {'R': [], 'G': [], 'B': []}
images = random.sample(os.listdir(image_dir),
min(sample_size, len(os.listdir(image_dir))))
for img_name in images:
img = cv2.imread(os.path.join(image_dir, img_name))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
all_means['R'].append(img[:,:,0].mean()) # (#2:Channel means)
all_means['G'].append(img[:,:,1].mean())
all_means['B'].append(img[:,:,2].mean())
print("Channel Statistics (for normalization):")
for channel, values in all_means.items():
print(f" {channel}: mean={np.mean(values)/255:.3f}, "
f"std={np.std(values)/255:.3f}") # (#3:For normalization)
Use case: These statistics help set proper normalization values for training.
Identify out-of-focus images using Laplacian variance
Find images with excessive noise or compression artifacts
Remove exact or near-duplicate images
def detect_blur(image_path, threshold=100):
"""Detect blurry images using Laplacian variance""" # (#1:Sharpness metric)
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
laplacian_var = cv2.Laplacian(img, cv2.CV_64F).var() # (#2:Edge detection)
return laplacian_var < threshold, laplacian_var # (#3:Lower = blurrier)
def find_blurry_images(image_dir, threshold=100):
"""Find all blurry images in directory"""
blurry = []
for img_name in os.listdir(image_dir):
img_path = os.path.join(image_dir, img_name)
is_blurry, score = detect_blur(img_path, threshold)
if is_blurry:
blurry.append((img_name, score)) # (#4:Track blurry images)
print(f"Found {len(blurry)} blurry images")
return sorted(blurry, key=lambda x: x[1]) # (#5:Sort by blur score)
import hashlib
from PIL import Image
import imagehash # (#1:pip install imagehash)
def find_exact_duplicates(image_dir):
"""Find exact duplicates using MD5 hash"""
hashes = {}
duplicates = []
for img_name in os.listdir(image_dir):
with open(os.path.join(image_dir, img_name), 'rb') as f:
img_hash = hashlib.md5(f.read()).hexdigest() # (#2:File hash)
if img_hash in hashes:
duplicates.append((img_name, hashes[img_hash]))
else:
hashes[img_hash] = img_name
return duplicates
def find_similar_images(image_dir, threshold=5):
"""Find similar images using perceptual hash""" # (#3:Near-duplicates)
hashes = {}
for img_name in os.listdir(image_dir):
img = Image.open(os.path.join(image_dir, img_name))
h = imagehash.phash(img) # (#4:Perceptual hash)
hashes[img_name] = h
# Compare hashes - difference < threshold = similar
from sklearn.utils import resample
import numpy as np
def oversample_minority(X, y, target_count=None):
"""Oversample minority classes to balance dataset""" # (#1:Balance classes)
classes, counts = np.unique(y, return_counts=True)
max_count = target_count or counts.max()
X_balanced, y_balanced = [], []
for cls in classes:
X_cls = X[y == cls]
y_cls = y[y == cls]
if len(X_cls) < max_count:
X_resampled, y_resampled = resample( # (#2:Sklearn resample)
X_cls, y_cls,
replace=True, # (#3:With replacement)
n_samples=max_count,
random_state=42
)
else:
X_resampled, y_resampled = X_cls, y_cls
X_balanced.extend(X_resampled)
y_balanced.extend(y_resampled)
return np.array(X_balanced), np.array(y_balanced)
from sklearn.utils.class_weight import compute_class_weight
import torch.nn as nn
# Compute balanced class weights
class_weights = compute_class_weight( # (#1:Auto-compute weights)
class_weight='balanced',
classes=np.unique(y_train),
y=y_train
)
# Use in loss function (PyTorch)
weights = torch.tensor(class_weights, dtype=torch.float32)
criterion = nn.CrossEntropyLoss(weight=weights) # (#2:Weighted loss)
# Focal Loss - focuses on hard examples
class FocalLoss(nn.Module):
def __init__(self, alpha=1, gamma=2): # (#3:gamma controls focus)
super().__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, inputs, targets):
ce_loss = nn.CrossEntropyLoss(reduction='none')(inputs, targets)
pt = torch.exp(-ce_loss) # (#4:Probability of correct class)
focal_loss = self.alpha * (1-pt)**self.gamma * ce_loss
return focal_loss.mean()
Given a dataset with the following class distribution:
What is the imbalance ratio between Cat and Bird?
Answer: 500/100 = ?
If using inverse frequency weights, what weight should Bird have?
Formula: weight = total / (n_classes * class_count)
How many Bird images would you need to oversample to match Cat?
Think: Current + needed = target
dataset/
train/
images/
img001.jpg
img002.jpg
labels/
img001.txt
img002.txt
valid/
images/
labels/
test/
images/
labels/
data.yaml
dataset/
train/
class_a/
img001.jpg
img002.jpg
class_b/
img001.jpg
img002.jpg
valid/
class_a/
class_b/
test/
class_a/
class_b/
Tip: PyTorch's ImageFolder automatically creates labels from directory names.
from sklearn.model_selection import train_test_split, StratifiedKFold
# Simple split (80/10/10)
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y # (#1:Preserve class ratio)
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)
# K-Fold Cross Validation
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # (#2:5-fold CV)
for fold, (train_idx, val_idx) in enumerate(kfold.split(X, y)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
print(f"Fold {fold}: Train={len(train_idx)}, Val={len(val_idx)}") # (#3:Train each fold)
Maintain class distribution across all splits
Same object should not appear in train and test
For temporal data, split chronologically
| Dataset Size | Recommended Split |
|---|---|
| Small (<1K) | K-fold cross-validation |
| Medium (1K-10K) | 70/15/15 or 80/10/10 |
| Large (>10K) | 90/5/5 or fixed val/test |
Industry best practice: Share only raw data + processing scripts. Anyone can reproduce the exact processed dataset.
| Solution | Best For | Versioning |
|---|---|---|
| Cloud Storage (S3, GCS) | Large datasets, team collaboration | Object versioning, lifecycle policies |
| Hugging Face Datasets | Public datasets, ML community | Git-based, automatic splits |
| Git LFS | Small-medium datasets | Git integration, simple workflow |
Tip: Document your data lineage: source, collection date, preprocessing steps, and any known issues.
# data_config.yaml - Version this file in Git
"""
dataset:
name: traffic_signs_v2
raw_source: s3://bucket/raw/traffic_signs/
train_split: 0.7
val_split: 0.15
test_split: 0.15
seed: 42
preprocessing:
resize: [224, 224]
normalize: imagenet
"""
# prepare_data.py - Deterministic processing
import random
import numpy as np
def prepare_dataset(config):
# Set seeds for reproducibility # (#1:Deterministic splits)
random.seed(config['seed'])
np.random.seed(config['seed'])
# Download raw data
raw_data = download_raw(config['raw_source']) # (#2:Fetch raw only)
# Apply processing (can always be re-run)
processed = preprocess(raw_data, config['preprocessing']) # (#3:Reproducible)
# Split deterministically
return split_data(processed, config) # (#4:Same splits every time)
# Option 1: Use Roboflow
from roboflow import Roboflow
rf = Roboflow(api_key="your_key") # (#1:Free account)
project = rf.workspace().project("coco-dataset")
dataset = project.version(1).download("coco")
# Option 2: Use Kaggle
!kaggle datasets download -d dataset-name # (#2:Kaggle CLI)
# Option 3: Use torchvision
from torchvision.datasets import VOCDetection
dataset = VOCDetection( # (#3:Pascal VOC)
root='./data',
year='2012',
image_set='train',
download=True
)
Well-annotated data is more valuable than large, noisy datasets
Statistical analysis reveals issues before training
Track data changes like code for reproducibility
Preparation: Install albumentations: pip install albumentations
| Type | Resource |
|---|---|
| Datasets | Roboflow Universe |
| Tool | LabelImg GitHub |
| Tool | CVAT - Computer Vision Annotation Tool |
| Documentation | DVC Documentation |
| Guide | COCO Data Format |
| Article | Training Data Best Practices |
Download a dataset and perform exploratory analysis
Complete the data preparation exercises
Create quality check scripts for your own dataset