Computer Vision

Session 1 - Introduction

Fundamentals, applications, and development environment

Today's Agenda

What is Computer Vision?

Definition

Computer Vision is a field of AI that enables computers to interpret and understand visual information from the world.

It involves acquiring, processing, analyzing, and understanding images to produce numerical or symbolic information.

AI Field Hierarchy
Artificial Intelligence Rule-based + Learning
Machine Learning Statistical Patterns
Deep Learning Neural Networks
👁 CV
Computer Vision applies Deep Learning to images

How Computers "See"

Human Vision

  • Eyes capture light
  • Brain processes patterns
  • Instant recognition
  • Context understanding
  • Abstract reasoning

Computer Vision

  • Sensors capture pixels
  • Algorithms process numbers
  • Pattern matching
  • Statistical inference
  • Requires training data

Key insight: An image is just a matrix of numbers (0-255 per channel) to a computer.

Historical Milestones

Year Milestone Impact
1957 Mark I Perceptron First neural network hardware
1980 Neocognitron Inspiration for CNNs
1998 LeNet-5 First practical CNN (digit recognition)
2012 AlexNet / ImageNet Deep learning revolution begins
2020+ Vision Transformers Attention mechanisms for images

The ImageNet Breakthrough (2012)

AlexNet achieved 15.3% error rate on ImageNet (previous best: 26%)

What Changed

Deep CNNs with GPU training proved vastly superior to hand-crafted features

Key Innovations

ReLU activation, dropout regularization, data augmentation, GPU computing

Impact

Started the deep learning revolution that transformed computer vision

Current Challenges in Computer Vision

  • Viewpoint variation - Object orientation
  • Illumination - Lighting conditions
  • Occlusion - Partial visibility
  • Scale variation - Size differences
  • Intra-class variation - Same class, different looks
  • Background clutter - Complex scenes
  • Deformation - Non-rigid objects
  • Domain shift - Training vs. real world

Quick Exercise: Identify the Challenge

For each scenario, identify which CV challenge is most relevant:

Scenario A

A security camera captures faces at night with minimal lighting.

Which challenge?

Scenario B

A self-driving car model trained in California fails in snowy conditions.

Which challenge?

Scenario C

Detecting a person behind a partially open door.

Which challenge?

Applications: Healthcare

Medical Imaging

X-ray, MRI, CT scan analysis for diagnosis

Tumor Detection

Automatic detection and segmentation of tumors

Retinal Analysis

Diabetic retinopathy and disease detection

Real-world impact: AI systems now match or exceed radiologist performance in specific tasks like detecting breast cancer in mammograms.

Applications: Autonomous Systems

Self-Driving Vehicles

  • Lane detection
  • Object recognition (pedestrians, vehicles)
  • Traffic sign recognition
  • Depth estimation

Robotics

  • Pick and place operations
  • Navigation and mapping
  • Quality inspection
  • Warehouse automation

Applications: Other Industries

Industry Applications
Security Face recognition, anomaly detection, surveillance
Retail Visual search, virtual try-on, shelf monitoring
Agriculture Crop monitoring, disease detection, yield prediction
Manufacturing Quality control, defect detection, assembly verification
Sports Player tracking, performance analysis, broadcast
Documents OCR, document classification, form extraction

Fundamental Computer Vision Tasks

Classification
"What is this?"
→ Label + Confidence
Detection
"Where are objects?"
→ Boxes + Labels
Segmentation
"Which pixel belongs?"
→ Pixel-wise Mask

Key difference: Classification answers "what", Detection answers "where", Segmentation answers "which pixel".

Image Classification

Definition

Assigning a single label to an entire image

Types

  • Single-label: One class per image
  • Multi-label: Multiple classes per image

Popular Datasets

  • ImageNet: 1000 classes, 1.2M images
  • CIFAR-10/100: 10/100 classes
  • MNIST: Handwritten digits
  • ISIC: Skin lesions

Object Detection

Definition

Locating and classifying multiple objects in an image

Output

  • Bounding boxes (x, y, w, h)
  • Class labels
  • Confidence scores

Key Models

  • YOLO (v5-v11): Real-time detection
  • R-CNN family: Two-stage detectors
  • SSD: Single Shot Detector
  • DETR: Transformer-based

Annotation Formats

COCO JSON:

{
  "images": [{
    "id": 1,
    "file_name": "image.jpg"
  }],
  "annotations": [{
    "id": 1, "image_id": 1,
    "bbox": [x, y, w, h]
  }]
}

YOLO TXT:

# class center_x center_y w h
0 0.5 0.5 0.3 0.4
1 0.2 0.7 0.1 0.15

Values normalized to 0-1

Image Segmentation Types

Semantic
All "car" pixels = same
H x W class tensor
Instance
Each car = unique ID
N binary masks
Panoptic
Stuff + Things
Full scene parsing

Key Models

Quick Exercise: Choose the Right Task

Which CV task would you use for each application?

Application 1

A mobile app that identifies plant species from a photo.

Classification, Detection, or Segmentation?

Application 2

A system that counts all cars in a parking lot image.

Classification, Detection, or Segmentation?

Application 3

Medical imaging tool that highlights tumor regions in MRI scans.

Classification, Detection, or Segmentation?

Image Generation

Creating new images from learned distributions

Applications

  • Data augmentation
  • Style transfer
  • Image inpainting
  • Super-resolution

Key Models

  • GANs: Generative Adversarial Networks
  • VAEs: Variational Autoencoders
  • Diffusion: Stable Diffusion, DALL-E

Python Libraries for Image Processing

# OpenCV - Comprehensive computer vision library
import cv2  # (#1:Most used CV library)

# PIL/Pillow - Basic image manipulation
from PIL import Image  # (#2:Simple image operations)

# scikit-image - Classical algorithms
from skimage import filters, feature  # (#3:Traditional CV algorithms)

# NumPy - Array operations
import numpy as np  # (#4:Foundation for image arrays)

Deep Learning Frameworks

# TensorFlow/Keras - Google's framework
import tensorflow as tf  # (#1:Production-ready)
from tensorflow import keras  # (#2:High-level API)

# PyTorch - Meta's framework
import torch  # (#3:Research-friendly)
import torchvision  # (#4:CV utilities)

# Hugging Face - Pretrained models
from transformers import ViTForImageClassification  # (#5:State-of-the-art models)

Development Environments

Environment Best For GPU Access
Jupyter Notebooks Interactive development Local
Google Colab Free GPU, quick experiments Free T4/V100
Kaggle Notebooks Datasets + competitions Free P100
VS Code Professional development Local

Recommendation: Start with Google Colab for this course - free GPU and no setup required!

Hardware Considerations

CPU vs GPU

  • CPU: Good for inference, small models
  • GPU: Essential for training
  • GPUs offer 10-100x speedup for DL

Requirements

  • CUDA: NVIDIA GPU computing
  • cuDNN: Deep learning primitives
  • RAM: 16GB+ recommended
  • VRAM: 8GB+ for training

Loading and Displaying Images

import cv2
import matplotlib.pyplot as plt

# Load image (OpenCV uses BGR by default)
img_bgr = cv2.imread('image.jpg')  # (#1:Returns numpy array)

# Convert to RGB for matplotlib
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)  # (#2:BGR to RGB)

# Display image
plt.figure(figsize=(10, 8))
plt.imshow(img_rgb)
plt.title('My Image')
plt.axis('off')  # (#3:Hide axes)
plt.show()

# Image properties
print(f"Shape: {img_rgb.shape}")  # (#4:(H, W, C))
print(f"Type: {img_rgb.dtype}")  # (#5:uint8)

Images as NumPy Arrays

import numpy as np

# Image shape: (Height, Width, Channels)
# For RGB: 3 channels
img.shape  # (480, 640, 3)

# Value range: 0-255 for uint8
img.min()  # 0
img.max()  # 255

# Access pixel at (y, x)
pixel = img[100, 200]  # [R, G, B]

# Access channel
red_channel = img[:, :, 0]
Image Shape: (H, W, C)
480
Height
x
640
Width
x
3
RGB
= 921,600 pixel values (0-255 each)

Quick Exercise: Image Arrays

Question 1

An RGB image has shape (1080, 1920, 3). What is the total number of pixel values stored?

Think: Height x Width x Channels = ?

Question 2

If a grayscale image is 256x256 and stored as uint8, how many bytes of memory does it use?

Hint: uint8 = 1 byte per value

Question 3

What color is a pixel with RGB values [255, 0, 0]? What about [0, 255, 0]?

Remember: RGB = Red, Green, Blue

The MNIST Dataset

Overview

  • 70,000 handwritten digit images
  • 60,000 training + 10,000 test
  • 28x28 grayscale images
  • 10 classes (digits 0-9)

Why MNIST?

  • "Hello World" of computer vision
  • Quick to train, easy to understand
Balanced Class Distribution
0
1
2
3
4
5
6
7
8
9
~6,000 samples per class

Fun fact: MNIST was created by Yann LeCun and is derived from NIST handwriting samples.

Loading MNIST Dataset

from tensorflow.keras.datasets import mnist

# Load dataset (downloads automatically)
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Explore shapes
print(f"Training: {X_train.shape}")
print(f"Test: {X_test.shape}")
print(f"Labels: {y_train.shape}")

# Value range
print(f"Min: {X_train.min()}, Max: {X_train.max()}")
Expected Output
Training: (60000, 28, 28)
Test: (10000, 28, 28)
Labels: (60000,)

Min: 0, Max: 255
60K train + 10K test images

Visualizing MNIST Samples

import matplotlib.pyplot as plt

# Display grid of samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))

for i, ax in enumerate(axes.flat):
    ax.imshow(X_train[i], cmap='gray')
    ax.set_title(f'Label: {y_train[i]}')
    ax.axis('off')

plt.suptitle('MNIST Sample Images')
plt.tight_layout()
plt.show()
Expected Output
5
5
0
0
4
4
1
1
9
9
2
2
1
1
3
3
1
1
4
4
28x28 grayscale digits

Hands-on Lab: Getting Started

Objectives

Exercises

  1. Load a sample image and display its properties
  2. Visualize MNIST samples by class
  3. Calculate basic statistics on the dataset
  4. Plot class distribution histogram

Key Takeaways

CV is AI for Images

Enables machines to understand visual information through deep learning

Core Tasks

Classification, detection, and segmentation form the foundation

Images = Arrays

Everything is a matrix of numbers that we can process

Next Session Preview

Session 2: Data Preparation & Exploration

Preparation: Create a Kaggle account and explore the Datasets section.

Resources

Type Resource
Course CS231n - Stanford CNN Course
Documentation OpenCV Documentation
Tutorial PyTorch Tutorials
Dataset Kaggle MNIST Competition
Paper ResNet Paper

Questions?

Lab Time

Open Google Colab and start the hands-on exercises

Practical Work

Complete the Getting Started practical work

Kaggle

Sign up and join the MNIST Digit Recognizer competition

1 / 1

Slide Overview