Fundamentals, applications, and development environment
Computer Vision is a field of AI that enables computers to interpret and understand visual information from the world.
It involves acquiring, processing, analyzing, and understanding images to produce numerical or symbolic information.
Key insight: An image is just a matrix of numbers (0-255 per channel) to a computer.
| Year | Milestone | Impact |
|---|---|---|
| 1957 | Mark I Perceptron | First neural network hardware |
| 1980 | Neocognitron | Inspiration for CNNs |
| 1998 | LeNet-5 | First practical CNN (digit recognition) |
| 2012 | AlexNet / ImageNet | Deep learning revolution begins |
| 2020+ | Vision Transformers | Attention mechanisms for images |
AlexNet achieved 15.3% error rate on ImageNet (previous best: 26%)
Deep CNNs with GPU training proved vastly superior to hand-crafted features
ReLU activation, dropout regularization, data augmentation, GPU computing
Started the deep learning revolution that transformed computer vision
For each scenario, identify which CV challenge is most relevant:
A security camera captures faces at night with minimal lighting.
Which challenge?
A self-driving car model trained in California fails in snowy conditions.
Which challenge?
Detecting a person behind a partially open door.
Which challenge?
X-ray, MRI, CT scan analysis for diagnosis
Automatic detection and segmentation of tumors
Diabetic retinopathy and disease detection
Real-world impact: AI systems now match or exceed radiologist performance in specific tasks like detecting breast cancer in mammograms.
| Industry | Applications |
|---|---|
| Security | Face recognition, anomaly detection, surveillance |
| Retail | Visual search, virtual try-on, shelf monitoring |
| Agriculture | Crop monitoring, disease detection, yield prediction |
| Manufacturing | Quality control, defect detection, assembly verification |
| Sports | Player tracking, performance analysis, broadcast |
| Documents | OCR, document classification, form extraction |
Key difference: Classification answers "what", Detection answers "where", Segmentation answers "which pixel".
Assigning a single label to an entire image
Locating and classifying multiple objects in an image
COCO JSON:
{
"images": [{
"id": 1,
"file_name": "image.jpg"
}],
"annotations": [{
"id": 1, "image_id": 1,
"bbox": [x, y, w, h]
}]
}
YOLO TXT:
# class center_x center_y w h
0 0.5 0.5 0.3 0.4
1 0.2 0.7 0.1 0.15
Values normalized to 0-1
Which CV task would you use for each application?
A mobile app that identifies plant species from a photo.
Classification, Detection, or Segmentation?
A system that counts all cars in a parking lot image.
Classification, Detection, or Segmentation?
Medical imaging tool that highlights tumor regions in MRI scans.
Classification, Detection, or Segmentation?
Creating new images from learned distributions
# OpenCV - Comprehensive computer vision library
import cv2 # (#1:Most used CV library)
# PIL/Pillow - Basic image manipulation
from PIL import Image # (#2:Simple image operations)
# scikit-image - Classical algorithms
from skimage import filters, feature # (#3:Traditional CV algorithms)
# NumPy - Array operations
import numpy as np # (#4:Foundation for image arrays)
# TensorFlow/Keras - Google's framework
import tensorflow as tf # (#1:Production-ready)
from tensorflow import keras # (#2:High-level API)
# PyTorch - Meta's framework
import torch # (#3:Research-friendly)
import torchvision # (#4:CV utilities)
# Hugging Face - Pretrained models
from transformers import ViTForImageClassification # (#5:State-of-the-art models)
| Environment | Best For | GPU Access |
|---|---|---|
| Jupyter Notebooks | Interactive development | Local |
| Google Colab | Free GPU, quick experiments | Free T4/V100 |
| Kaggle Notebooks | Datasets + competitions | Free P100 |
| VS Code | Professional development | Local |
Recommendation: Start with Google Colab for this course - free GPU and no setup required!
import cv2
import matplotlib.pyplot as plt
# Load image (OpenCV uses BGR by default)
img_bgr = cv2.imread('image.jpg') # (#1:Returns numpy array)
# Convert to RGB for matplotlib
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB) # (#2:BGR to RGB)
# Display image
plt.figure(figsize=(10, 8))
plt.imshow(img_rgb)
plt.title('My Image')
plt.axis('off') # (#3:Hide axes)
plt.show()
# Image properties
print(f"Shape: {img_rgb.shape}") # (#4:(H, W, C))
print(f"Type: {img_rgb.dtype}") # (#5:uint8)
import numpy as np
# Image shape: (Height, Width, Channels)
# For RGB: 3 channels
img.shape # (480, 640, 3)
# Value range: 0-255 for uint8
img.min() # 0
img.max() # 255
# Access pixel at (y, x)
pixel = img[100, 200] # [R, G, B]
# Access channel
red_channel = img[:, :, 0]
An RGB image has shape (1080, 1920, 3). What is the total number of pixel values stored?
Think: Height x Width x Channels = ?
If a grayscale image is 256x256 and stored as uint8, how many bytes of memory does it use?
Hint: uint8 = 1 byte per value
What color is a pixel with RGB values [255, 0, 0]? What about [0, 255, 0]?
Remember: RGB = Red, Green, Blue
Fun fact: MNIST was created by Yann LeCun and is derived from NIST handwriting samples.
from tensorflow.keras.datasets import mnist
# Load dataset (downloads automatically)
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Explore shapes
print(f"Training: {X_train.shape}")
print(f"Test: {X_test.shape}")
print(f"Labels: {y_train.shape}")
# Value range
print(f"Min: {X_train.min()}, Max: {X_train.max()}")
import matplotlib.pyplot as plt
# Display grid of samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
ax.imshow(X_train[i], cmap='gray')
ax.set_title(f'Label: {y_train[i]}')
ax.axis('off')
plt.suptitle('MNIST Sample Images')
plt.tight_layout()
plt.show()
Enables machines to understand visual information through deep learning
Classification, detection, and segmentation form the foundation
Everything is a matrix of numbers that we can process
Preparation: Create a Kaggle account and explore the Datasets section.
| Type | Resource |
|---|---|
| Course | CS231n - Stanford CNN Course |
| Documentation | OpenCV Documentation |
| Tutorial | PyTorch Tutorials |
| Dataset | Kaggle MNIST Competition |
| Paper | ResNet Paper |
Open Google Colab and start the hands-on exercises
Complete the Getting Started practical work
Sign up and join the MNIST Digit Recognizer competition