Computer Vision

Session 4 - CNN Architectures & Modeling

Deep dive into neural network architectures for vision tasks

Today's Agenda

CNN Fundamentals Overview

flowchart LR A(["Input
224x224x3"]) -->|3x3 conv| B["Conv+ReLU
112x112x64"] B -->|2x2 max| C["Pool
56x56x64"] C -->|3x3 conv| D["Conv+ReLU
56x56x128"] D -->|2x2 max| E["Pool
28x28x128"] E -->|layers| F["...
7x7x512"] F -->|global avg| G["GAP
1x1x512"] G -->|dense| H["FC
512→1000"] H -->|softmax| I(["Classes
1000 probs"]) classDef input fill:#78909c,stroke:#546e7a,color:#fff classDef conv fill:#4a90d9,stroke:#2e6da4,color:#fff classDef pool fill:#ff9800,stroke:#ef6c00,color:#fff classDef gap fill:#7cb342,stroke:#558b2f,color:#fff classDef output fill:#9c27b0,stroke:#7b1fa2,color:#fff class A input class B,D,F conv class C,E pool class G,H gap class I output

What is a CNN?

Neural networks designed for grid-like data (images).

Key Components

  • Conv - Features | Pool - Reduce | FC - Classify

Convolution Layer

Mathematical Formula

Output feature map at position (i, j):

y[i,j] = sum(sum(x[i+m, j+n] * k[m,n])) + b

Where:
- x: input feature map
- k: kernel/filter of size (M, N)
- b: bias term
- m, n: kernel indices

Key Parameters

  • Kernel size: 3x3, 5x5, 7x7
  • Stride: Step size (usually 1 or 2)
  • Padding: "same" or "valid"
  • Filters: Number of output channels

Convolution Layer Implementation

import torch.nn as nn

# PyTorch Conv2d
conv_layer = nn.Conv2d(
    in_channels=3,      # (#1:RGB input)
    out_channels=64,    # (#2:Number of filters)
    kernel_size=3,      # (#3:3x3 kernel)
    stride=1,           # (#4:Step size)
    padding=1           # (#5:Same padding)
)

# TensorFlow/Keras equivalent
from tensorflow.keras.layers import Conv2D

conv_layer = Conv2D(
    filters=64,
    kernel_size=(3, 3),
    strides=(1, 1),
    padding='same',     # (#6:Output same size as input)
    activation='relu'   # (#7:Activation included)
)

Convolution Output Size Formula

Formula

Output Size = floor((W - K + 2P) / S) + 1

Where:
- W: Input width/height
- K: Kernel size
- P: Padding
- S: Stride

Examples

InputKernelPaddingStrideOutput
3231132
3230130
3231216
224732112

Quick Exercise: Calculate Output Size

Formula: Output = floor((W - K + 2P) / S) + 1

Question 1

Input: 64x64, Kernel: 3x3, Padding: 0, Stride: 1

Output size?

Question 2

Input: 64x64, Kernel: 3x3, Padding: 1, Stride: 2

Output size?

Question 3

After Conv(224, k=7, p=3, s=2) then Pool(k=3, s=2)?

Final size? (like ResNet stem)

Pooling Layers

Max Pooling

Takes maximum value in each window. Preserves strongest activations.

nn.MaxPool2d(kernel_size=2, stride=2)

Average Pooling

Computes average value in each window. Smoother features.

nn.AvgPool2d(kernel_size=2, stride=2)

Global Average Pooling

Averages entire feature map to single value. Used before classification.

nn.AdaptiveAvgPool2d((1, 1))

Purpose: Reduces spatial dimensions, provides translation invariance, and reduces parameters.

Pooling Implementation

import torch
import torch.nn as nn

# Input: batch of 4 images, 64 channels, 32x32 spatial
x = torch.randn(4, 64, 32, 32)  # (#1:NCHW format)

# Max Pooling - reduces spatial by 2x
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
out = max_pool(x)  # (#2:Shape: (4, 64, 16, 16))

# Global Average Pooling - reduces to 1x1
gap = nn.AdaptiveAvgPool2d((1, 1))
out = gap(x)  # (#3:Shape: (4, 64, 1, 1))

# Flatten for fully connected layer
out = out.view(out.size(0), -1)  # (#4:Shape: (4, 64))

Fully Connected Layers

Purpose

Dense layers that connect all input neurons to all output neurons.

  • Used for final classification
  • Combines spatial features
  • Learns global patterns

Implementation

# PyTorch Linear layer
fc = nn.Linear(
    in_features=512,
    out_features=10  # num_classes
)

# Keras Dense layer
from tensorflow.keras.layers import Dense
fc = Dense(units=10, activation='softmax')

Modern trend: Replace FC layers with Global Average Pooling to reduce parameters and overfitting.

Activation Functions

Function Formula Range Use Case
ReLU max(0, x) [0, inf) Hidden layers (default)
LeakyReLU max(0.01x, x) (-inf, inf) Avoid dying neurons
Sigmoid 1/(1+e^-x) [0, 1] Binary classification
Softmax e^xi / sum(e^xj) [0, 1], sum=1 Multi-class classification
GELU x * P(X ≤ x) (-inf, inf) Transformers

Why These Activation Functions?

Why ReLU Works

Simple thresholding at zero. Creates sparsity (many zeros) which makes the network more efficient. Gradient is either 0 or 1, avoiding vanishing gradients for positive values.

Why Softmax Sums to 1

Each output e^xi is divided by the sum of all e^xj. This normalization creates a valid probability distribution over all classes.

Sigmoid's Problem

Vanishing gradients: For very large or small inputs, gradient approaches 0. During backprop, gradients get multiplied and shrink to near-zero.

Visualizing Activation Functions

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 100)

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# ReLU: f(x) = max(0, x)
axes[0,0].plot(x, np.maximum(0, x), 'b-', linewidth=2)
axes[0,0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0,0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[0,0].set_title('ReLU: max(0, x)')
axes[0,0].set_ylim(-1, 5)

# Sigmoid: f(x) = 1/(1+e^-x)
axes[0,1].plot(x, 1/(1+np.exp(-x)), 'r-', linewidth=2)
axes[0,1].axhline(y=0.5, color='k', linestyle='--', alpha=0.3)
axes[0,1].set_title('Sigmoid: 1/(1+e^-x)')
axes[0,1].set_ylim(-0.1, 1.1)

# LeakyReLU: f(x) = max(0.1x, x)
axes[1,0].plot(x, np.where(x > 0, x, 0.1*x), 'g-', linewidth=2)
axes[1,0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1,0].set_title('LeakyReLU: max(0.1x, x)')

# Softmax example (3 classes)
logits = np.array([2.0, 1.0, 0.1])
softmax = np.exp(logits) / np.sum(np.exp(logits))
axes[1,1].bar(['Class A', 'Class B', 'Class C'], softmax, color=['blue', 'orange', 'green'])
axes[1,1].set_title(f'Softmax: probabilities sum to {softmax.sum():.1f}')

plt.tight_layout()
plt.savefig('activation_plots.png')

The Dying ReLU Problem

The Problem

  • ReLU outputs 0 for all negative inputs
  • If a neuron's weights cause it to always receive negative inputs, it "dies"
  • Dead neurons have zero gradient and never learn
  • Can happen with high learning rates or bad initialization

Solutions

  • LeakyReLU: Small slope for negatives (0.01x)
  • PReLU: Learnable slope parameter
  • ELU: Exponential for negatives
  • Careful initialization: Use He initialization
# He initialization for ReLU
nn.init.kaiming_normal_(layer.weight, 
                        nonlinearity='relu')

Activation Functions Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.randn(4, 64)  # (#1:Batch of 4, 64 features)

# ReLU - Most common for hidden layers
relu = nn.ReLU()
out = relu(x)  # (#2:Zeros out negatives)

# LeakyReLU - Allows small negative gradient
leaky = nn.LeakyReLU(negative_slope=0.01)
out = leaky(x)  # (#3:Prevents dying ReLU)

# Sigmoid - Binary classification output
out = torch.sigmoid(x)  # (#4:Range [0, 1])

# Softmax - Multi-class classification
out = F.softmax(x, dim=1)  # (#5:Probabilities sum to 1)

# GELU - Used in Transformers
gelu = nn.GELU()
out = gelu(x)  # (#6:Smooth ReLU variant)

VGG-16 Architecture (2014)

Key Ideas

  • Use only 3x3 convolutions
  • Stack multiple conv layers before pooling
  • Double channels after each pool
  • Simple and uniform architecture

Statistics

  • Layers: 16 weight layers
  • Parameters: ~138 million
  • Input: 224x224x3

Architecture Pattern

Input: 224x224x3
Conv3-64 x2 -> Pool -> 112x112x64
Conv3-128 x2 -> Pool -> 56x56x128
Conv3-256 x3 -> Pool -> 28x28x256
Conv3-512 x3 -> Pool -> 14x14x512
Conv3-512 x3 -> Pool -> 7x7x512
Flatten -> FC-4096 -> FC-4096
FC-1000 (Softmax)

VGG-16 Implementation

import torch.nn as nn
import torchvision.models as models

# Load pretrained VGG-16
vgg16 = models.vgg16(weights='IMAGENET1K_V1')  # (#1:Pretrained weights)

# Custom VGG-like block
class VGGBlock(nn.Module):
    def __init__(self, in_ch, out_ch, num_convs):
        super().__init__()
        layers = []
        for i in range(num_convs):
            layers.append(nn.Conv2d(
                in_ch if i == 0 else out_ch,  # (#2:First conv changes channels)
                out_ch, kernel_size=3, padding=1
            ))
            layers.append(nn.ReLU(inplace=True))
        layers.append(nn.MaxPool2d(2, 2))  # (#3:Halve spatial dims)
        self.block = nn.Sequential(*layers)

    def forward(self, x):
        return self.block(x)

ResNet - Residual Networks (2015)

flowchart TB X(["Input x
H x W x C"]) X -->|"F(x)"| C1["Conv 3x3
+ BatchNorm
+ ReLU"] C1 --> C2["Conv 3x3
+ BatchNorm"] X -->|"identity"| SKIP["Skip
Connection"] C2 --> ADD{{"y = F(x) + x"}} SKIP --> ADD ADD --> R["ReLU"] R --> OUT(["Output
H x W x C"]) classDef input fill:#78909c,stroke:#546e7a,color:#fff classDef conv fill:#4a90d9,stroke:#2e6da4,color:#fff classDef skip fill:#7cb342,stroke:#558b2f,color:#fff classDef add fill:#ff9800,stroke:#ef6c00,color:#fff classDef output fill:#9c27b0,stroke:#7b1fa2,color:#fff class X input class C1,C2 conv class SKIP skip class ADD add class R,OUT output

Skip Connection

Output = F(x) + x

- x: input (identity)
- F(x): learned residual
- +: element-wise addition

Impact: Won ImageNet 2015, enabled 1000+ layer networks!

Residual Block Implementation

import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels,
                               kernel_size=3, stride=stride, padding=1)  # (#1:May downsample)
        self.bn1 = nn.BatchNorm2d(out_channels)  # (#2:Batch normalization)
        self.conv2 = nn.Conv2d(out_channels, out_channels,
                               kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        # Skip connection with projection if dimensions change
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride),  # (#3:1x1 conv to match dims)
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = self.shortcut(x)  # (#4:Skip connection)
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += identity  # (#5:Add residual)
        return self.relu(out)

ResNet Variants

Model Layers Parameters Top-1 Acc Block Type
ResNet-18 18 11.7M 69.8% Basic
ResNet-34 34 21.8M 73.3% Basic
ResNet-50 50 25.6M 76.1% Bottleneck
ResNet-101 101 44.5M 77.4% Bottleneck
ResNet-152 152 60.2M 78.3% Bottleneck

Bottleneck block: Uses 1x1 conv to reduce dimensions, 3x3 conv, then 1x1 to expand - more efficient!

EfficientNet - Compound Scaling (2019)

Key Innovation

Compound scaling - systematically scale depth, width, and resolution together.

depth: d = alpha^phi
width: w = beta^phi
resolution: r = gamma^phi

Constraint: alpha * beta^2 * gamma^2 ~ 2

EfficientNet Family

ModelParamsTop-1
B05.3M77.1%
B312M81.6%
B530M83.6%
B766M84.3%

Key insight: EfficientNet-B0 achieves similar accuracy to ResNet-50 with 8x fewer parameters!

EfficientNet Implementation

import torchvision.models as models
import torch.nn as nn

# Load pretrained EfficientNet
efficientnet = models.efficientnet_b0(weights='IMAGENET1K_V1')  # (#1:5.3M params)

# Modify for custom number of classes
num_classes = 10
efficientnet.classifier = nn.Sequential(
    nn.Dropout(p=0.2, inplace=True),
    nn.Linear(1280, num_classes)  # (#2:Replace final layer)
)

# EfficientNet-V2 (improved version)
efficientnet_v2 = models.efficientnet_v2_s(weights='IMAGENET1K_V1')  # (#3:Faster training)

# Timm library - more options
import timm
model = timm.create_model('efficientnet_b3', pretrained=True, num_classes=10)  # (#4:Easy model creation)

Vision Transformer (ViT) - 2020

Key Ideas

  • Apply Transformer architecture to images
  • Split image into fixed-size patches
  • Treat patches as tokens (like words)
  • Add positional embeddings
  • Use self-attention mechanism

Architecture

1. Split image into patches (16x16)
2. Flatten patches to vectors
3. Linear projection + position embed
4. Add [CLS] token
5. Transformer encoder blocks
6. MLP head on [CLS] for classification

Trade-off: ViT requires large datasets (JFT-300M) to outperform CNNs. With less data, CNNs still win!

Vision Transformer Implementation

import torch
import torch.nn as nn
from transformers import ViTForImageClassification, ViTImageProcessor

# Using Hugging Face Transformers
model_name = "google/vit-base-patch16-224"  # (#1:16x16 patches, 224x224 input)
processor = ViTImageProcessor.from_pretrained(model_name)  # (#2:Preprocessing)
model = ViTForImageClassification.from_pretrained(model_name)

# Using timm library
import timm
vit = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=10)  # (#3:Easy setup)

# PyTorch native (torchvision)
from torchvision.models import vit_b_16, ViT_B_16_Weights
vit = vit_b_16(weights=ViT_B_16_Weights.IMAGENET1K_V1)  # (#4:Pretrained)
vit.heads = nn.Linear(768, 10)  # (#5:Custom head)

ConvNeXt and Swin Transformer

ConvNeXt (2022)

Modernized CNN with Transformer-inspired design:

  • Patchify stem (4x4 conv, stride 4)
  • Inverted bottleneck
  • Large kernels (7x7)
  • GELU activation, LayerNorm
  • Matches ViT performance!

Swin Transformer (2021)

Hierarchical Vision Transformer:

  • Shifted windows for efficiency
  • Hierarchical feature maps
  • Linear complexity w.r.t. image size
  • Great for dense prediction tasks
  • State-of-the-art backbone

Quick Exercise: Choose the Architecture

Which architecture would you recommend for each scenario?

Scenario A

Mobile app with 5MB size limit, real-time inference needed.

ResNet-50, EfficientNet-B0, or ViT-B?

Scenario B

Medical imaging with only 500 labeled images, high accuracy required.

Train from scratch or fine-tune?

Scenario C

Production system with 10M+ training images, cost is not a concern.

CNN or Transformer-based?

ConvNeXt and Swin Implementation

import timm
import torchvision.models as models

# ConvNeXt - CNN competitive with ViT
convnext = models.convnext_tiny(weights='IMAGENET1K_V1')  # (#1:28.6M params)
convnext_base = timm.create_model('convnext_base', pretrained=True)  # (#2:Via timm)

# Swin Transformer
swin = models.swin_t(weights='IMAGENET1K_V1')  # (#3:Swin-Tiny)
swin_base = timm.create_model('swin_base_patch4_window7_224', pretrained=True)  # (#4:Base model)

# Modify for custom classes
num_classes = 10
convnext.classifier[2] = nn.Linear(768, num_classes)  # (#5:ConvNeXt head)
swin.head = nn.Linear(768, num_classes)  # (#6:Swin head)

YOLO - You Only Look Once

Key Innovation

Single-shot detection: One neural network predicts bounding boxes and class probabilities in a single pass.

  • Real-time performance (30+ FPS)
  • End-to-end trainable
  • Global context reasoning

YOLO Evolution

VersionYearKey Feature
YOLOv32018Multi-scale predictions
YOLOv52020PyTorch native
YOLOv82023Anchor-free
YOLOv112024Latest improvements

YOLO with Ultralytics

from ultralytics import YOLO

# Load pretrained model
model = YOLO('yolov8n.pt')  # (#1:Nano model - fastest)
# Options: yolov8n, yolov8s, yolov8m, yolov8l, yolov8x

# Run inference
results = model('image.jpg')  # (#2:Single image)
results = model(['img1.jpg', 'img2.jpg'])  # (#3:Batch inference)

# Process results
for result in results:
    boxes = result.boxes  # (#4:Bounding boxes)
    for box in boxes:
        xyxy = box.xyxy[0]  # (#5:x1, y1, x2, y2)
        conf = box.conf[0]  # (#6:Confidence score)
        cls = box.cls[0]    # (#7:Class index)

# Train custom model
model.train(data='custom.yaml', epochs=100, imgsz=640)  # (#8:Fine-tuning)

U-Net for Segmentation (2015)

Architecture

  • Encoder: Contracting path (downsampling)
  • Bottleneck: Lowest resolution
  • Decoder: Expanding path (upsampling)
  • Skip connections: Concatenate encoder features

Key Features

  • Works with limited training data
  • Precise localization
  • Context + localization combined
  • Originally for biomedical imaging
  • Now used across all domains

U-shape: The architecture resembles letter "U" - hence the name. Skip connections are crucial for preserving spatial information!

U-Net Implementation

import segmentation_models_pytorch as smp

# Using segmentation_models_pytorch
model = smp.Unet(
    encoder_name="resnet34",        # (#1:Pretrained backbone)
    encoder_weights="imagenet",      # (#2:ImageNet weights)
    in_channels=3,                   # (#3:RGB input)
    classes=1,                       # (#4:Binary segmentation)
)

# U-Net++ (improved skip connections)
model = smp.UnetPlusPlus(
    encoder_name="efficientnet-b3",
    encoder_weights="imagenet",
    classes=5,  # (#5:Multi-class segmentation)
)

# Using Hugging Face
from transformers import SegformerForSemanticSegmentation
model = SegformerForSemanticSegmentation.from_pretrained(
    "nvidia/segformer-b0-finetuned-ade-512-512"  # (#6:Pretrained segmentation)
)

Segment Anything Model (SAM)

What is SAM?

Foundation model for promptable segmentation trained on 11M images and 1B masks.

  • Zero-shot generalization
  • Point, box, or text prompts
  • Trained on SA-1B dataset

Components

  • Image encoder: ViT-H (heavy, one-time)
  • Prompt encoder: Points, boxes, masks
  • Mask decoder: Lightweight, fast

SAM 2 (2024): Video support added!

SAM Implementation

from segment_anything import sam_model_registry, SamPredictor

# Load SAM model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")  # (#1:ViT-Huge)
predictor = SamPredictor(sam)

# Set image (encodes once)
image = cv2.imread("image.jpg")
predictor.set_image(image)  # (#2:Run image encoder)

# Predict with point prompt
input_point = np.array([[500, 375]])  # (#3:Click location)
input_label = np.array([1])  # (#4:1=foreground, 0=background)
masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True  # (#5:Returns 3 masks)
)

# Predict with box prompt
input_box = np.array([100, 100, 400, 400])  # (#6:x1, y1, x2, y2)
masks, _, _ = predictor.predict(box=input_box)

Loss Functions for Computer Vision

Loss Formula Use Case
Binary Cross-Entropy -[y*log(p) + (1-y)*log(1-p)] Binary classification
Categorical CE -sum(y_i * log(p_i)) Multi-class classification
Dice Loss 1 - 2*|X ∩ Y| / (|X| + |Y|) Segmentation
IoU Loss 1 - |X ∩ Y| / |X ∪ Y| Object detection
Focal Loss -alpha*(1-p)^gamma*log(p) Class imbalance

Loss Functions Implementation

import torch
import torch.nn as nn

# Binary Cross-Entropy (with logits)
criterion = nn.BCEWithLogitsLoss()  # (#1:Includes sigmoid)
loss = criterion(predictions, targets)

# Multi-class Cross-Entropy
criterion = nn.CrossEntropyLoss()  # (#2:Includes softmax)
loss = criterion(predictions, targets)  # (#3:targets: class indices)

# Dice Loss for segmentation
def dice_loss(pred, target, smooth=1e-6):
    pred = torch.sigmoid(pred)  # (#4:Convert to probabilities)
    intersection = (pred * target).sum()
    union = pred.sum() + target.sum()
    dice = (2. * intersection + smooth) / (union + smooth)  # (#5:Dice coefficient)
    return 1 - dice  # (#6:Loss = 1 - coefficient)

# Combined loss (common for segmentation)
loss = 0.5 * bce_loss + 0.5 * dice_loss  # (#7:Balance both objectives)

Optimizers

SGD (+ Momentum)

Classic optimizer. Good generalization but slower convergence.

optim.SGD(params, lr=0.01, momentum=0.9)

Adam

Adaptive learning rates. Fast convergence, works well out-of-box.

optim.Adam(params, lr=0.001)

AdamW

Adam with decoupled weight decay. Best for Transformers.

optim.AdamW(params, lr=0.001, weight_decay=0.01)

Rule of thumb: Start with AdamW for transformers, Adam for CNNs. Fine-tune with SGD+momentum for best final results.

Optimizer Configuration

import torch.optim as optim

# Basic optimizer setup
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-4,           # (#1:Learning rate)
    weight_decay=0.01  # (#2:L2 regularization)
)

# Different LR for different layers (transfer learning)
optimizer = optim.AdamW([
    {'params': model.backbone.parameters(), 'lr': 1e-5},  # (#3:Pretrained - low LR)
    {'params': model.head.parameters(), 'lr': 1e-3}       # (#4:New layers - high LR)
])

# Training step
optimizer.zero_grad()  # (#5:Clear gradients)
loss = criterion(model(inputs), targets)
loss.backward()        # (#6:Compute gradients)
optimizer.step()       # (#7:Update weights)

Learning Rate Schedulers

Scheduler Description Best For
StepLR Decay by gamma every N epochs Simple decay
CosineAnnealing Cosine curve from max to min LR Transformers, long training
OneCycleLR Warm-up then decay in one cycle Fast convergence
ReduceLROnPlateau Reduce when metric stops improving Adaptive training
WarmupCosine Linear warmup + cosine decay ViT, large batch training

Learning Rate Scheduler Implementation

from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR

# Cosine Annealing
scheduler = CosineAnnealingLR(
    optimizer,
    T_max=100,        # (#1:Total epochs)
    eta_min=1e-6      # (#2:Minimum LR)
)

# OneCycleLR (recommended for fast training)
scheduler = OneCycleLR(
    optimizer,
    max_lr=1e-3,      # (#3:Peak learning rate)
    epochs=100,
    steps_per_epoch=len(train_loader),  # (#4:Total steps)
    pct_start=0.1     # (#5:10% warmup)
)

# Training loop
for epoch in range(epochs):
    for batch in train_loader:
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()  # (#6:Update LR each step for OneCycleLR)
    # scheduler.step()  # (#7:Or once per epoch for others)

Complete Training Loop

def train_epoch(model, loader, criterion, optimizer, scheduler, device):
    model.train()  # (#1:Training mode)
    total_loss = 0

    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)  # (#2:Move to GPU)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # (#3:Gradient clipping)

        optimizer.step()
        scheduler.step()
        total_loss += loss.item()

    return total_loss / len(loader)

# Validation loop
@torch.no_grad()  # (#4:Disable gradients)
def validate(model, loader, criterion, device):
    model.eval()  # (#5:Evaluation mode)
    # ... similar but no backward pass

Hands-on Lab: CNN Architectures

Objectives

Exercises

  1. Implement a VGG-like network for CIFAR-10
  2. Add residual connections and compare accuracy
  3. Fine-tune EfficientNet-B0 on a custom dataset
  4. Run YOLO inference on sample images
  5. Experiment with different optimizers and schedulers

Key Takeaways

CNN Building Blocks

Conv, Pool, FC layers with activations form the foundation of all vision architectures

Skip Connections

ResNet's residual connections enable training very deep networks effectively

Modern Architectures

ViT and ConvNeXt compete for state-of-the-art; choose based on data size and task

Practical advice: Start with pretrained EfficientNet or ResNet. Use AdamW + cosine scheduler. Fine-tune, don't train from scratch!

Next Session Preview

Session 5: Transfer Learning & Fine-Tuning

Preparation: Download a small custom dataset from Kaggle for fine-tuning practice.

Resources

Type Resource
Paper ResNet - Deep Residual Learning
Paper ViT - An Image is Worth 16x16 Words
Paper ConvNeXt - A ConvNet for the 2020s
Library Ultralytics YOLO
Library segmentation_models_pytorch
Tutorial PyTorch Transfer Learning

Questions?

Lab Time

Build and train CNN models on Google Colab

Practical Work

Implement ResNet and fine-tune EfficientNet

Experiment

Compare optimizers and learning rate schedules

1 / 1

Slide Overview