Computer Vision

Session 4 - CNN Architectures & Modeling

Deep dive into neural network architectures for vision tasks

Today's Agenda

CNN Fundamentals (Convolution, Pooling, Fully Connected)
Activation Functions (ReLU, LeakyReLU, Sigmoid, Softmax)
Classic Architectures (VGG-16, ResNet, EfficientNet)
Modern Architectures (ViT, ConvNeXt, Swin Transformer)
Task-Specific Models (YOLO, U-Net, SAM)
Training Components (Loss Functions, Optimizers, Schedulers)
Hands-on Lab: Building CNN Models

CNN Fundamentals Overview

What is a CNN?

Neural networks designed for grid-like data (images).

Key Components

Conv - Features | Pool - Reduce | FC - Classify

Convolution Layer

Mathematical Formula

Output feature map at position (i, j):

y[i,j] = sum(sum(x[i+m, j+n] * k[m,n])) + b

Where:
- x: input feature map
- k: kernel/filter of size (M, N)
- b: bias term
- m, n: kernel indices

Key Parameters

Kernel size: 3x3, 5x5, 7x7
Stride: Step size (usually 1 or 2)
Padding: "same" or "valid"
Filters: Number of output channels

Convolution Layer Implementation

import torch.nn as nn

# PyTorch Conv2d
conv_layer = nn.Conv2d(
    in_channels=3,      # (#1:RGB input)
    out_channels=64,    # (#2:Number of filters)
    kernel_size=3,      # (#3:3x3 kernel)
    stride=1,           # (#4:Step size)
    padding=1           # (#5:Same padding)
)

# TensorFlow/Keras equivalent
from tensorflow.keras.layers import Conv2D

conv_layer = Conv2D(
    filters=64,
    kernel_size=(3, 3),
    strides=(1, 1),
    padding='same',     # (#6:Output same size as input)
    activation='relu'   # (#7:Activation included)
)

Convolution Output Size Formula

Formula

Output Size = floor((W - K + 2P) / S) + 1

Where:
- W: Input width/height
- K: Kernel size
- P: Padding
- S: Stride

Examples

Input	Kernel	Padding	Stride	Output
32	3	1	1	32
32	3	0	1	30
32	3	1	2	16
224	7	3	2	112

Quick Exercise: Calculate Output Size

Formula: Output = floor((W - K + 2P) / S) + 1

Question 1

Input: 64x64, Kernel: 3x3, Padding: 0, Stride: 1

Output size?

Question 2

Input: 64x64, Kernel: 3x3, Padding: 1, Stride: 2

Output size?

Question 3

After Conv(224, k=7, p=3, s=2) then Pool(k=3, s=2)?

Final size? (like ResNet stem)

Pooling Layers

Max Pooling

Takes maximum value in each window. Preserves strongest activations.

nn.MaxPool2d(kernel_size=2, stride=2)

Average Pooling

Computes average value in each window. Smoother features.

nn.AvgPool2d(kernel_size=2, stride=2)

Global Average Pooling

Averages entire feature map to single value. Used before classification.

nn.AdaptiveAvgPool2d((1, 1))

Purpose: Reduces spatial dimensions, provides translation invariance, and reduces parameters.

Pooling Implementation

import torch
import torch.nn as nn

# Input: batch of 4 images, 64 channels, 32x32 spatial
x = torch.randn(4, 64, 32, 32)  # (#1:NCHW format)

# Max Pooling - reduces spatial by 2x
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
out = max_pool(x)  # (#2:Shape: (4, 64, 16, 16))

# Global Average Pooling - reduces to 1x1
gap = nn.AdaptiveAvgPool2d((1, 1))
out = gap(x)  # (#3:Shape: (4, 64, 1, 1))

# Flatten for fully connected layer
out = out.view(out.size(0), -1)  # (#4:Shape: (4, 64))

Fully Connected Layers

Purpose

Dense layers that connect all input neurons to all output neurons.

Used for final classification
Combines spatial features
Learns global patterns

Implementation

# PyTorch Linear layer
fc = nn.Linear(
    in_features=512,
    out_features=10  # num_classes
)

# Keras Dense layer
from tensorflow.keras.layers import Dense
fc = Dense(units=10, activation='softmax')

Modern trend: Replace FC layers with Global Average Pooling to reduce parameters and overfitting.

Activation Functions

Function	Formula	Range	Use Case
ReLU	max(0, x)	[0, inf)	Hidden layers (default)
LeakyReLU	max(0.01x, x)	(-inf, inf)	Avoid dying neurons
Sigmoid	1/(1+e^-x)	[0, 1]	Binary classification
Softmax	e^xi / sum(e^xj)	[0, 1], sum=1	Multi-class classification
GELU	x * P(X ≤ x)	(-inf, inf)	Transformers

Why These Activation Functions?

Why ReLU Works

Simple thresholding at zero. Creates sparsity (many zeros) which makes the network more efficient. Gradient is either 0 or 1, avoiding vanishing gradients for positive values.

Why Softmax Sums to 1

Each output e^xi is divided by the sum of all e^xj. This normalization creates a valid probability distribution over all classes.

Sigmoid's Problem

Vanishing gradients: For very large or small inputs, gradient approaches 0. During backprop, gradients get multiplied and shrink to near-zero.

Visualizing Activation Functions

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 100)

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# ReLU: f(x) = max(0, x)
axes[0,0].plot(x, np.maximum(0, x), 'b-', linewidth=2)
axes[0,0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0,0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[0,0].set_title('ReLU: max(0, x)')
axes[0,0].set_ylim(-1, 5)

# Sigmoid: f(x) = 1/(1+e^-x)
axes[0,1].plot(x, 1/(1+np.exp(-x)), 'r-', linewidth=2)
axes[0,1].axhline(y=0.5, color='k', linestyle='--', alpha=0.3)
axes[0,1].set_title('Sigmoid: 1/(1+e^-x)')
axes[0,1].set_ylim(-0.1, 1.1)

# LeakyReLU: f(x) = max(0.1x, x)
axes[1,0].plot(x, np.where(x > 0, x, 0.1*x), 'g-', linewidth=2)
axes[1,0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1,0].set_title('LeakyReLU: max(0.1x, x)')

# Softmax example (3 classes)
logits = np.array([2.0, 1.0, 0.1])
softmax = np.exp(logits) / np.sum(np.exp(logits))
axes[1,1].bar(['Class A', 'Class B', 'Class C'], softmax, color=['blue', 'orange', 'green'])
axes[1,1].set_title(f'Softmax: probabilities sum to {softmax.sum():.1f}')

plt.tight_layout()
plt.savefig('activation_plots.png')

The Dying ReLU Problem

The Problem

ReLU outputs 0 for all negative inputs
If a neuron's weights cause it to always receive negative inputs, it "dies"
Dead neurons have zero gradient and never learn
Can happen with high learning rates or bad initialization

Solutions

LeakyReLU: Small slope for negatives (0.01x)
PReLU: Learnable slope parameter
ELU: Exponential for negatives
Careful initialization: Use He initialization

# He initialization for ReLU
nn.init.kaiming_normal_(layer.weight, 
                        nonlinearity='relu')

Activation Functions Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.randn(4, 64)  # (#1:Batch of 4, 64 features)

# ReLU - Most common for hidden layers
relu = nn.ReLU()
out = relu(x)  # (#2:Zeros out negatives)

# LeakyReLU - Allows small negative gradient
leaky = nn.LeakyReLU(negative_slope=0.01)
out = leaky(x)  # (#3:Prevents dying ReLU)

# Sigmoid - Binary classification output
out = torch.sigmoid(x)  # (#4:Range [0, 1])

# Softmax - Multi-class classification
out = F.softmax(x, dim=1)  # (#5:Probabilities sum to 1)

# GELU - Used in Transformers
gelu = nn.GELU()
out = gelu(x)  # (#6:Smooth ReLU variant)

VGG-16 Architecture (2014)

Key Ideas

Use only 3x3 convolutions
Stack multiple conv layers before pooling
Double channels after each pool
Simple and uniform architecture

Statistics

Layers: 16 weight layers
Parameters: ~138 million
Input: 224x224x3

Architecture Pattern

Input: 224x224x3
Conv3-64 x2 -> Pool -> 112x112x64
Conv3-128 x2 -> Pool -> 56x56x128
Conv3-256 x3 -> Pool -> 28x28x256
Conv3-512 x3 -> Pool -> 14x14x512
Conv3-512 x3 -> Pool -> 7x7x512
Flatten -> FC-4096 -> FC-4096
FC-1000 (Softmax)

VGG-16 Implementation

import torch.nn as nn
import torchvision.models as models

# Load pretrained VGG-16
vgg16 = models.vgg16(weights='IMAGENET1K_V1')  # (#1:Pretrained weights)

# Custom VGG-like block
class VGGBlock(nn.Module):
    def __init__(self, in_ch, out_ch, num_convs):
        super().__init__()
        layers = []
        for i in range(num_convs):
            layers.append(nn.Conv2d(
                in_ch if i == 0 else out_ch,  # (#2:First conv changes channels)
                out_ch, kernel_size=3, padding=1
            ))
            layers.append(nn.ReLU(inplace=True))
        layers.append(nn.MaxPool2d(2, 2))  # (#3:Halve spatial dims)
        self.block = nn.Sequential(*layers)

    def forward(self, x):
        return self.block(x)

ResNet - Residual Networks (2015)

flowchart TB X(["Input x
H x W x C"]) X -->|"F(x)"| C1["Conv 3x3
+ BatchNorm
+ ReLU"] C1 --> C2["Conv 3x3
+ BatchNorm"] X -->|"identity"| SKIP["Skip
Connection"] C2 --> ADD{{"y = F(x) + x"}} SKIP --> ADD ADD --> R["ReLU"] R --> OUT(["Output
H x W x C"]) classDef input fill:#78909c,stroke:#546e7a,color:#fff classDef conv fill:#4a90d9,stroke:#2e6da4,color:#fff classDef skip fill:#7cb342,stroke:#558b2f,color:#fff classDef add fill:#ff9800,stroke:#ef6c00,color:#fff classDef output fill:#9c27b0,stroke:#7b1fa2,color:#fff class X input class C1,C2 conv class SKIP skip class ADD add class R,OUT output

Skip Connection

Output = F(x) + x

- x: input (identity)
- F(x): learned residual
- +: element-wise addition

Impact: Won ImageNet 2015, enabled 1000+ layer networks!

Residual Block Implementation

import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels,
                               kernel_size=3, stride=stride, padding=1)  # (#1:May downsample)
        self.bn1 = nn.BatchNorm2d(out_channels)  # (#2:Batch normalization)
        self.conv2 = nn.Conv2d(out_channels, out_channels,
                               kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        # Skip connection with projection if dimensions change
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride),  # (#3:1x1 conv to match dims)
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = self.shortcut(x)  # (#4:Skip connection)
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += identity  # (#5:Add residual)
        return self.relu(out)

ResNet Variants

Model	Layers	Parameters	Top-1 Acc	Block Type
ResNet-18	18	11.7M	69.8%	Basic
ResNet-34	34	21.8M	73.3%	Basic
ResNet-50	50	25.6M	76.1%	Bottleneck
ResNet-101	101	44.5M	77.4%	Bottleneck
ResNet-152	152	60.2M	78.3%	Bottleneck

Bottleneck block: Uses 1x1 conv to reduce dimensions, 3x3 conv, then 1x1 to expand - more efficient!

EfficientNet - Compound Scaling (2019)

Key Innovation

Compound scaling - systematically scale depth, width, and resolution together.

depth: d = alpha^phi
width: w = beta^phi
resolution: r = gamma^phi

Constraint: alpha * beta^2 * gamma^2 ~ 2

EfficientNet Family

Model	Params	Top-1
B0	5.3M	77.1%
B3	12M	81.6%
B5	30M	83.6%
B7	66M	84.3%

Key insight: EfficientNet-B0 achieves similar accuracy to ResNet-50 with 8x fewer parameters!

EfficientNet Implementation

import torchvision.models as models
import torch.nn as nn

# Load pretrained EfficientNet
efficientnet = models.efficientnet_b0(weights='IMAGENET1K_V1')  # (#1:5.3M params)

# Modify for custom number of classes
num_classes = 10
efficientnet.classifier = nn.Sequential(
    nn.Dropout(p=0.2, inplace=True),
    nn.Linear(1280, num_classes)  # (#2:Replace final layer)
)

# EfficientNet-V2 (improved version)
efficientnet_v2 = models.efficientnet_v2_s(weights='IMAGENET1K_V1')  # (#3:Faster training)

# Timm library - more options
import timm
model = timm.create_model('efficientnet_b3', pretrained=True, num_classes=10)  # (#4:Easy model creation)

Vision Transformer (ViT) - 2020

Key Ideas

Apply Transformer architecture to images
Split image into fixed-size patches
Treat patches as tokens (like words)
Add positional embeddings
Use self-attention mechanism

Architecture

1. Split image into patches (16x16)
2. Flatten patches to vectors
3. Linear projection + position embed
4. Add [CLS] token
5. Transformer encoder blocks
6. MLP head on [CLS] for classification

Trade-off: ViT requires large datasets (JFT-300M) to outperform CNNs. With less data, CNNs still win!

Vision Transformer Implementation

import torch
import torch.nn as nn
from transformers import ViTForImageClassification, ViTImageProcessor

# Using Hugging Face Transformers
model_name = "google/vit-base-patch16-224"  # (#1:16x16 patches, 224x224 input)
processor = ViTImageProcessor.from_pretrained(model_name)  # (#2:Preprocessing)
model = ViTForImageClassification.from_pretrained(model_name)

# Using timm library
import timm
vit = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=10)  # (#3:Easy setup)

# PyTorch native (torchvision)
from torchvision.models import vit_b_16, ViT_B_16_Weights
vit = vit_b_16(weights=ViT_B_16_Weights.IMAGENET1K_V1)  # (#4:Pretrained)
vit.heads = nn.Linear(768, 10)  # (#5:Custom head)

ConvNeXt and Swin Transformer

ConvNeXt (2022)

Modernized CNN with Transformer-inspired design:

Patchify stem (4x4 conv, stride 4)
Inverted bottleneck
Large kernels (7x7)
GELU activation, LayerNorm
Matches ViT performance!

Swin Transformer (2021)

Hierarchical Vision Transformer:

Shifted windows for efficiency
Hierarchical feature maps
Linear complexity w.r.t. image size
Great for dense prediction tasks
State-of-the-art backbone

Quick Exercise: Choose the Architecture

Which architecture would you recommend for each scenario?

Scenario A

Mobile app with 5MB size limit, real-time inference needed.

ResNet-50, EfficientNet-B0, or ViT-B?

Scenario B

Medical imaging with only 500 labeled images, high accuracy required.

Train from scratch or fine-tune?

Scenario C

Production system with 10M+ training images, cost is not a concern.

CNN or Transformer-based?

ConvNeXt and Swin Implementation

import timm
import torchvision.models as models

# ConvNeXt - CNN competitive with ViT
convnext = models.convnext_tiny(weights='IMAGENET1K_V1')  # (#1:28.6M params)
convnext_base = timm.create_model('convnext_base', pretrained=True)  # (#2:Via timm)

# Swin Transformer
swin = models.swin_t(weights='IMAGENET1K_V1')  # (#3:Swin-Tiny)
swin_base = timm.create_model('swin_base_patch4_window7_224', pretrained=True)  # (#4:Base model)

# Modify for custom classes
num_classes = 10
convnext.classifier[2] = nn.Linear(768, num_classes)  # (#5:ConvNeXt head)
swin.head = nn.Linear(768, num_classes)  # (#6:Swin head)

YOLO - You Only Look Once

Key Innovation

Single-shot detection: One neural network predicts bounding boxes and class probabilities in a single pass.

Real-time performance (30+ FPS)
End-to-end trainable
Global context reasoning

YOLO Evolution

Version	Year	Key Feature
YOLOv3	2018	Multi-scale predictions
YOLOv5	2020	PyTorch native
YOLOv8	2023	Anchor-free
YOLOv11	2024	Latest improvements

YOLO with Ultralytics

from ultralytics import YOLO

# Load pretrained model
model = YOLO('yolov8n.pt')  # (#1:Nano model - fastest)
# Options: yolov8n, yolov8s, yolov8m, yolov8l, yolov8x

# Run inference
results = model('image.jpg')  # (#2:Single image)
results = model(['img1.jpg', 'img2.jpg'])  # (#3:Batch inference)

# Process results
for result in results:
    boxes = result.boxes  # (#4:Bounding boxes)
    for box in boxes:
        xyxy = box.xyxy[0]  # (#5:x1, y1, x2, y2)
        conf = box.conf[0]  # (#6:Confidence score)
        cls = box.cls[0]    # (#7:Class index)

# Train custom model
model.train(data='custom.yaml', epochs=100, imgsz=640)  # (#8:Fine-tuning)

U-Net for Segmentation (2015)

Architecture

Encoder: Contracting path (downsampling)
Bottleneck: Lowest resolution
Decoder: Expanding path (upsampling)
Skip connections: Concatenate encoder features

Key Features

Works with limited training data
Precise localization
Context + localization combined
Originally for biomedical imaging
Now used across all domains

U-shape: The architecture resembles letter "U" - hence the name. Skip connections are crucial for preserving spatial information!

U-Net Implementation

import segmentation_models_pytorch as smp

# Using segmentation_models_pytorch
model = smp.Unet(
    encoder_name="resnet34",        # (#1:Pretrained backbone)
    encoder_weights="imagenet",      # (#2:ImageNet weights)
    in_channels=3,                   # (#3:RGB input)
    classes=1,                       # (#4:Binary segmentation)
)

# U-Net++ (improved skip connections)
model = smp.UnetPlusPlus(
    encoder_name="efficientnet-b3",
    encoder_weights="imagenet",
    classes=5,  # (#5:Multi-class segmentation)
)

# Using Hugging Face
from transformers import SegformerForSemanticSegmentation
model = SegformerForSemanticSegmentation.from_pretrained(
    "nvidia/segformer-b0-finetuned-ade-512-512"  # (#6:Pretrained segmentation)
)

Segment Anything Model (SAM)

What is SAM?

Foundation model for promptable segmentation trained on 11M images and 1B masks.

Zero-shot generalization
Point, box, or text prompts
Trained on SA-1B dataset

Components

Image encoder: ViT-H (heavy, one-time)
Prompt encoder: Points, boxes, masks
Mask decoder: Lightweight, fast

SAM 2 (2024): Video support added!

SAM Implementation

from segment_anything import sam_model_registry, SamPredictor

# Load SAM model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")  # (#1:ViT-Huge)
predictor = SamPredictor(sam)

# Set image (encodes once)
image = cv2.imread("image.jpg")
predictor.set_image(image)  # (#2:Run image encoder)

# Predict with point prompt
input_point = np.array([[500, 375]])  # (#3:Click location)
input_label = np.array([1])  # (#4:1=foreground, 0=background)
masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True  # (#5:Returns 3 masks)
)

# Predict with box prompt
input_box = np.array([100, 100, 400, 400])  # (#6:x1, y1, x2, y2)
masks, _, _ = predictor.predict(box=input_box)

Loss Functions for Computer Vision

Loss	Formula	Use Case
Binary Cross-Entropy	-[ylog(p) + (1-y)log(1-p)]	Binary classification
Categorical CE	-sum(y_i * log(p_i))	Multi-class classification
Dice Loss	1 - 2*\|X ∩ Y\| / (\|X\| + \|Y\|)	Segmentation
IoU Loss	1 - \|X ∩ Y\| / \|X ∪ Y\|	Object detection
Focal Loss	-alpha(1-p)^gammalog(p)	Class imbalance

Loss Functions Implementation

import torch
import torch.nn as nn

# Binary Cross-Entropy (with logits)
criterion = nn.BCEWithLogitsLoss()  # (#1:Includes sigmoid)
loss = criterion(predictions, targets)

# Multi-class Cross-Entropy
criterion = nn.CrossEntropyLoss()  # (#2:Includes softmax)
loss = criterion(predictions, targets)  # (#3:targets: class indices)

# Dice Loss for segmentation
def dice_loss(pred, target, smooth=1e-6):
    pred = torch.sigmoid(pred)  # (#4:Convert to probabilities)
    intersection = (pred * target).sum()
    union = pred.sum() + target.sum()
    dice = (2. * intersection + smooth) / (union + smooth)  # (#5:Dice coefficient)
    return 1 - dice  # (#6:Loss = 1 - coefficient)

# Combined loss (common for segmentation)
loss = 0.5 * bce_loss + 0.5 * dice_loss  # (#7:Balance both objectives)

Optimizers

SGD (+ Momentum)

Classic optimizer. Good generalization but slower convergence.

optim.SGD(params, lr=0.01, momentum=0.9)

Adam

Adaptive learning rates. Fast convergence, works well out-of-box.

optim.Adam(params, lr=0.001)

AdamW

Adam with decoupled weight decay. Best for Transformers.

optim.AdamW(params, lr=0.001, weight_decay=0.01)

Rule of thumb: Start with AdamW for transformers, Adam for CNNs. Fine-tune with SGD+momentum for best final results.

Optimizer Configuration

import torch.optim as optim

# Basic optimizer setup
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-4,           # (#1:Learning rate)
    weight_decay=0.01  # (#2:L2 regularization)
)

# Different LR for different layers (transfer learning)
optimizer = optim.AdamW([
    {'params': model.backbone.parameters(), 'lr': 1e-5},  # (#3:Pretrained - low LR)
    {'params': model.head.parameters(), 'lr': 1e-3}       # (#4:New layers - high LR)
])

# Training step
optimizer.zero_grad()  # (#5:Clear gradients)
loss = criterion(model(inputs), targets)
loss.backward()        # (#6:Compute gradients)
optimizer.step()       # (#7:Update weights)

Learning Rate Schedulers

Scheduler	Description	Best For
StepLR	Decay by gamma every N epochs	Simple decay
CosineAnnealing	Cosine curve from max to min LR	Transformers, long training
OneCycleLR	Warm-up then decay in one cycle	Fast convergence
ReduceLROnPlateau	Reduce when metric stops improving	Adaptive training
WarmupCosine	Linear warmup + cosine decay	ViT, large batch training

Learning Rate Scheduler Implementation

from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR

# Cosine Annealing
scheduler = CosineAnnealingLR(
    optimizer,
    T_max=100,        # (#1:Total epochs)
    eta_min=1e-6      # (#2:Minimum LR)
)

# OneCycleLR (recommended for fast training)
scheduler = OneCycleLR(
    optimizer,
    max_lr=1e-3,      # (#3:Peak learning rate)
    epochs=100,
    steps_per_epoch=len(train_loader),  # (#4:Total steps)
    pct_start=0.1     # (#5:10% warmup)
)

# Training loop
for epoch in range(epochs):
    for batch in train_loader:
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()  # (#6:Update LR each step for OneCycleLR)
    # scheduler.step()  # (#7:Or once per epoch for others)

Complete Training Loop

def train_epoch(model, loader, criterion, optimizer, scheduler, device):
    model.train()  # (#1:Training mode)
    total_loss = 0

    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)  # (#2:Move to GPU)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # (#3:Gradient clipping)

        optimizer.step()
        scheduler.step()
        total_loss += loss.item()

    return total_loss / len(loader)

# Validation loop
@torch.no_grad()  # (#4:Disable gradients)
def validate(model, loader, criterion, device):
    model.eval()  # (#5:Evaluation mode)
    # ... similar but no backward pass

Hands-on Lab: CNN Architectures

Objectives

Build a custom CNN from scratch
Implement ResNet-style skip connections
Fine-tune pretrained models (ResNet, EfficientNet)
Compare CNN vs ViT performance

Exercises

Implement a VGG-like network for CIFAR-10
Add residual connections and compare accuracy
Fine-tune EfficientNet-B0 on a custom dataset
Run YOLO inference on sample images
Experiment with different optimizers and schedulers

Key Takeaways

CNN Building Blocks

Conv, Pool, FC layers with activations form the foundation of all vision architectures

Skip Connections

ResNet's residual connections enable training very deep networks effectively

Modern Architectures

ViT and ConvNeXt compete for state-of-the-art; choose based on data size and task

Practical advice: Start with pretrained EfficientNet or ResNet. Use AdamW + cosine scheduler. Fine-tune, don't train from scratch!

Next Session Preview

Session 5: Transfer Learning & Fine-Tuning

Transfer learning strategies
Feature extraction vs fine-tuning
Progressive unfreezing
Domain adaptation techniques
Model selection guidelines

Preparation: Download a small custom dataset from Kaggle for fine-tuning practice.

Resources

Type	Resource
Paper	ResNet - Deep Residual Learning
Paper	ViT - An Image is Worth 16x16 Words
Paper	ConvNeXt - A ConvNet for the 2020s
Library	Ultralytics YOLO
Library	segmentation_models_pytorch
Tutorial	PyTorch Transfer Learning

Questions?

Lab Time

Build and train CNN models on Google Colab

Practical Work

Implement ResNet and fine-tune EfficientNet

Experiment

Compare optimizers and learning rate schedules