Computer Vision

Session 1 - Introduction

Fundamentals, applications, and development environment

Today's Agenda

Context and Fundamentals
Applications and Domains
Fundamental Tasks (Classification, Detection, Segmentation)
Tools and Development Environment
Hands-on Lab: Getting Started

What is Computer Vision?

Definition

Computer Vision is a field of AI that enables computers to interpret and understand visual information from the world.

It involves acquiring, processing, analyzing, and understanding images to produce numerical or symbolic information.

AI Field Hierarchy

Artificial Intelligence Rule-based + Learning

Machine Learning Statistical Patterns

Deep Learning Neural Networks

👁 CV

Computer Vision applies Deep Learning to images

How Computers "See"

Human Vision

Eyes capture light
Brain processes patterns
Instant recognition
Context understanding
Abstract reasoning

Computer Vision

Sensors capture pixels
Algorithms process numbers
Pattern matching
Statistical inference
Requires training data

Key insight: An image is just a matrix of numbers (0-255 per channel) to a computer.

Historical Milestones

Year	Milestone	Impact
1957	Mark I Perceptron	First neural network hardware
1980	Neocognitron	Inspiration for CNNs
1998	LeNet-5	First practical CNN (digit recognition)
2012	AlexNet / ImageNet	Deep learning revolution begins
2020+	Vision Transformers	Attention mechanisms for images

The ImageNet Breakthrough (2012)

AlexNet achieved 15.3% error rate on ImageNet (previous best: 26%)

What Changed

Deep CNNs with GPU training proved vastly superior to hand-crafted features

Key Innovations

ReLU activation, dropout regularization, data augmentation, GPU computing

Impact

Started the deep learning revolution that transformed computer vision

Current Challenges in Computer Vision

Viewpoint variation - Object orientation
Illumination - Lighting conditions
Occlusion - Partial visibility
Scale variation - Size differences

Intra-class variation - Same class, different looks
Background clutter - Complex scenes
Deformation - Non-rigid objects
Domain shift - Training vs. real world

Quick Exercise: Identify the Challenge

For each scenario, identify which CV challenge is most relevant:

Scenario A

A security camera captures faces at night with minimal lighting.

Which challenge?

Scenario B

A self-driving car model trained in California fails in snowy conditions.

Which challenge?

Scenario C

Detecting a person behind a partially open door.

Which challenge?

Applications: Healthcare

Medical Imaging

X-ray, MRI, CT scan analysis for diagnosis

Tumor Detection

Automatic detection and segmentation of tumors

Retinal Analysis

Diabetic retinopathy and disease detection

Real-world impact: AI systems now match or exceed radiologist performance in specific tasks like detecting breast cancer in mammograms.

Applications: Autonomous Systems

Self-Driving Vehicles

Lane detection
Object recognition (pedestrians, vehicles)
Traffic sign recognition
Depth estimation

Robotics

Pick and place operations
Navigation and mapping
Quality inspection
Warehouse automation

Applications: Other Industries

Industry	Applications
Security	Face recognition, anomaly detection, surveillance
Retail	Visual search, virtual try-on, shelf monitoring
Agriculture	Crop monitoring, disease detection, yield prediction
Manufacturing	Quality control, defect detection, assembly verification
Sports	Player tracking, performance analysis, broadcast
Documents	OCR, document classification, form extraction

Fundamental Computer Vision Tasks

Classification

"What is this?"
→ Label + Confidence

Detection

"Where are objects?"
→ Boxes + Labels

Segmentation

"Which pixel belongs?"
→ Pixel-wise Mask

Key difference: Classification answers "what", Detection answers "where", Segmentation answers "which pixel".

Image Classification

Definition

Assigning a single label to an entire image

Types

Single-label: One class per image
Multi-label: Multiple classes per image

Popular Datasets

ImageNet: 1000 classes, 1.2M images
CIFAR-10/100: 10/100 classes
MNIST: Handwritten digits
ISIC: Skin lesions

Object Detection

Definition

Locating and classifying multiple objects in an image

Output

Bounding boxes (x, y, w, h)
Class labels
Confidence scores

Key Models

YOLO (v5-v11): Real-time detection
R-CNN family: Two-stage detectors
SSD: Single Shot Detector
DETR: Transformer-based

Annotation Formats

COCO JSON:

{
  "images": [{
    "id": 1,
    "file_name": "image.jpg"
  }],
  "annotations": [{
    "id": 1, "image_id": 1,
    "bbox": [x, y, w, h]
  }]
}

YOLO TXT:

# class center_x center_y w h
0 0.5 0.5 0.3 0.4
1 0.2 0.7 0.1 0.15

Values normalized to 0-1

Image Segmentation Types

Semantic

All "car" pixels = same
H x W class tensor

Instance

Each car = unique ID
N binary masks

Panoptic

Stuff + Things
Full scene parsing

Key Models

U-Net: Medical imaging | Mask R-CNN: Instance | DeepLab: Semantic | SAM: Segment Anything

Quick Exercise: Choose the Right Task

Which CV task would you use for each application?

Application 1

A mobile app that identifies plant species from a photo.

Classification, Detection, or Segmentation?

Application 2

A system that counts all cars in a parking lot image.

Classification, Detection, or Segmentation?

Application 3

Medical imaging tool that highlights tumor regions in MRI scans.

Classification, Detection, or Segmentation?

Image Generation

Creating new images from learned distributions

Applications

Data augmentation
Style transfer
Image inpainting
Super-resolution

Key Models

GANs: Generative Adversarial Networks
VAEs: Variational Autoencoders
Diffusion: Stable Diffusion, DALL-E

Python Libraries for Image Processing

# OpenCV - Comprehensive computer vision library
import cv2  # (#1:Most used CV library)

# PIL/Pillow - Basic image manipulation
from PIL import Image  # (#2:Simple image operations)

# scikit-image - Classical algorithms
from skimage import filters, feature  # (#3:Traditional CV algorithms)

# NumPy - Array operations
import numpy as np  # (#4:Foundation for image arrays)

Deep Learning Frameworks

# TensorFlow/Keras - Google's framework
import tensorflow as tf  # (#1:Production-ready)
from tensorflow import keras  # (#2:High-level API)

# PyTorch - Meta's framework
import torch  # (#3:Research-friendly)
import torchvision  # (#4:CV utilities)

# Hugging Face - Pretrained models
from transformers import ViTForImageClassification  # (#5:State-of-the-art models)

Development Environments

Environment	Best For	GPU Access
Jupyter Notebooks	Interactive development	Local
Google Colab	Free GPU, quick experiments	Free T4/V100
Kaggle Notebooks	Datasets + competitions	Free P100
VS Code	Professional development	Local

Recommendation: Start with Google Colab for this course - free GPU and no setup required!

Hardware Considerations

CPU vs GPU

CPU: Good for inference, small models
GPU: Essential for training
GPUs offer 10-100x speedup for DL

Requirements

CUDA: NVIDIA GPU computing
cuDNN: Deep learning primitives
RAM: 16GB+ recommended
VRAM: 8GB+ for training

Loading and Displaying Images

import cv2
import matplotlib.pyplot as plt

# Load image (OpenCV uses BGR by default)
img_bgr = cv2.imread('image.jpg')  # (#1:Returns numpy array)

# Convert to RGB for matplotlib
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)  # (#2:BGR to RGB)

# Display image
plt.figure(figsize=(10, 8))
plt.imshow(img_rgb)
plt.title('My Image')
plt.axis('off')  # (#3:Hide axes)
plt.show()

# Image properties
print(f"Shape: {img_rgb.shape}")  # (#4:(H, W, C))
print(f"Type: {img_rgb.dtype}")  # (#5:uint8)

Images as NumPy Arrays

import numpy as np

# Image shape: (Height, Width, Channels)
# For RGB: 3 channels
img.shape  # (480, 640, 3)

# Value range: 0-255 for uint8
img.min()  # 0
img.max()  # 255

# Access pixel at (y, x)
pixel = img[100, 200]  # [R, G, B]

# Access channel
red_channel = img[:, :, 0]

Image Shape: (H, W, C)

480

Height

x

640

Width

x

3

RGB

= 921,600 pixel values (0-255 each)

Quick Exercise: Image Arrays

Question 1

An RGB image has shape (1080, 1920, 3). What is the total number of pixel values stored?

Think: Height x Width x Channels = ?

Question 2

If a grayscale image is 256x256 and stored as uint8, how many bytes of memory does it use?

Hint: uint8 = 1 byte per value

Question 3

What color is a pixel with RGB values [255, 0, 0]? What about [0, 255, 0]?

Remember: RGB = Red, Green, Blue

The MNIST Dataset

Overview

70,000 handwritten digit images
60,000 training + 10,000 test
28x28 grayscale images
10 classes (digits 0-9)

Why MNIST?

"Hello World" of computer vision
Quick to train, easy to understand

Balanced Class Distribution

0

1

2

3

4

5

6

7

8

9

~6,000 samples per class

Fun fact: MNIST was created by Yann LeCun and is derived from NIST handwriting samples.

Loading MNIST Dataset

from tensorflow.keras.datasets import mnist

# Load dataset (downloads automatically)
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Explore shapes
print(f"Training: {X_train.shape}")
print(f"Test: {X_test.shape}")
print(f"Labels: {y_train.shape}")

# Value range
print(f"Min: {X_train.min()}, Max: {X_train.max()}")

Expected Output

Training: (60000, 28, 28)
Test: (10000, 28, 28)
Labels: (60000,)

Min: 0, Max: 255

60K train + 10K test images

Visualizing MNIST Samples

import matplotlib.pyplot as plt

# Display grid of samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))

for i, ax in enumerate(axes.flat):
    ax.imshow(X_train[i], cmap='gray')
    ax.set_title(f'Label: {y_train[i]}')
    ax.axis('off')

plt.suptitle('MNIST Sample Images')
plt.tight_layout()
plt.show()

Expected Output

5

0

4

1

9

2

1

3

1

4

28x28 grayscale digits

Hands-on Lab: Getting Started

Objectives

Set up Google Colab environment
Load and display images with OpenCV
Explore the MNIST dataset structure
Understand image representation as arrays

Exercises

Load a sample image and display its properties
Visualize MNIST samples by class
Calculate basic statistics on the dataset
Plot class distribution histogram

Key Takeaways

CV is AI for Images

Enables machines to understand visual information through deep learning

Core Tasks

Classification, detection, and segmentation form the foundation

Images = Arrays

Everything is a matrix of numbers that we can process

Next Session Preview

Session 2: Data Preparation & Exploration

Finding and collecting datasets
Annotation tools and formats
Data quality analysis
Handling class imbalance
Data organization best practices

Preparation: Create a Kaggle account and explore the Datasets section.

Resources

Type	Resource
Course	CS231n - Stanford CNN Course
Documentation	OpenCV Documentation
Tutorial	PyTorch Tutorials
Dataset	Kaggle MNIST Competition
Paper	ResNet Paper

Questions?

Lab Time

Open Google Colab and start the hands-on exercises

Practical Work

Complete the Getting Started practical work

Kaggle

Sign up and join the MNIST Digit Recognizer competition

Computer Vision

Session 1 - Introduction

Today's Agenda

What is Computer Vision?

Definition

How Computers "See"

Human Vision

Computer Vision

Historical Milestones

The ImageNet Breakthrough (2012)

What Changed

Key Innovations

Impact

Current Challenges in Computer Vision

Quick Exercise: Identify the Challenge

Scenario A

Scenario B

Scenario C

Applications: Healthcare

Medical Imaging

Tumor Detection

Retinal Analysis

Applications: Autonomous Systems

Self-Driving Vehicles

Robotics

Applications: Other Industries

Fundamental Computer Vision Tasks

Image Classification

Definition

Types

Popular Datasets

Object Detection

Definition

Output

Key Models

Annotation Formats

Image Segmentation Types

Key Models

Quick Exercise: Choose the Right Task

Application 1

Application 2

Application 3

Image Generation

Applications

Key Models

Python Libraries for Image Processing

Deep Learning Frameworks

Development Environments

Hardware Considerations

CPU vs GPU

Requirements

Loading and Displaying Images

Images as NumPy Arrays

Quick Exercise: Image Arrays

Question 1

Question 2

Question 3

The MNIST Dataset

Overview

Why MNIST?

Loading MNIST Dataset

Visualizing MNIST Samples

Hands-on Lab: Getting Started

Objectives

Exercises

Key Takeaways

CV is AI for Images

Core Tasks

Images = Arrays

Next Session Preview

Session 2: Data Preparation & Exploration

Resources

Questions?

Lab Time

Practical Work

Kaggle

Slide Overview