Session 1: Foundations
Understanding Visual AI
How machines see and interpret visual data
2026 WayUp
18-hour comprehensive program on computer vision for business
| Session | Theme | Format |
|---|---|---|
| 1 | Foundations of Computer Vision | Theory + Demo |
| 2 | Business Applications & Use Cases | Theory + Case Studies |
| 3 | Hands-on: Cloud Vision APIs | Practical Workshop |
| 4 | Custom Models & Transfer Learning | Theory + Practice |
| 5 | Ethics, Governance & Presentations | Discussion + Projects |
| 6 | Deployment & Integration Strategies | Practice + Lab |
What you'll be able to do after completing this course
Fundamental concepts and architectures behind modern computer vision systems
Business opportunities where computer vision creates measurable value
Select appropriate computer vision solutions for specific use cases
Build applications using cloud APIs and pre-trained models
Costs, benefits, and risks of computer vision implementations
Ethical, legal, and privacy considerations in visual AI deployments
By the end of this session, you will be able to:
How machines "see" and interpret images
Key terminology and concepts in computer vision
Different types of computer vision tasks
Evolution from classical CV to deep learning
3-hour journey through computer vision foundations
Enabling computers to derive meaningful information from visual inputs
Computer vision is a field of artificial intelligence that enables computers to derive meaningful information from digital images, videos, and other visual inputs.
Understanding the differences
How computers process visual information, step by step
Cameras, smartphones, satellites, sensors
JPEG, PNG, RAW, video streams
Preparing images for analysis
Raw Image (1920×1080) → Resize (224×224) → Normalize (0-1 range) → Feature Map (512 channels)
At the lowest level, an image is just a matrix of numbers
| Concept | Definition | Example |
|---|---|---|
| Pixel | Smallest unit of a digital image | Single point of color |
| Resolution | Number of pixels (width × height) | 1920 × 1080 = 2.07 megapixels |
| Channels | Color components per pixel | RGB = 3, Grayscale = 1, RGBA = 4 |
| Bit Depth | Values per channel | 8-bit = 0-255, 16-bit = 0-65535 |
| Tensor | Multi-dimensional array in ML | Shape: (height, width, channels) |
Understanding the data structure
Color Image (Height x Width x 3 channels)
┌─────────────────────────────────────┐
│ Red Channel [255] [128] [64] │
│ Green Channel [100] [150] [200] │
│ Blue Channel [ 50] [ 75] [100] │
└─────────────────────────────────────┘
Each pixel = 3 values (R, G, B)
A 1920×1080 RGB image = 1920 × 1080 × 3 = 6,220,800 values
Grayscale and RGB representations
A grayscale image is a 2D matrix of pixel intensities (0=black, 255=white):
[[ 0, 50, 100, 150, 200, 255],
[ 25, 75, 125, 175, 225, 255],
[ 50, 100, 150, 200, 250, 255],
[ 75, 125, 175, 225, 255, 255]] # 4×6 grayscale image
An RGB color image is a 3D tensor with 3 channels:
# Red Channel: Green Channel: Blue Channel:
[[255, 0, 0], [[0, 255, 0], [[0, 0, 255],
[128, 0, 0]] [0, 128, 0]] [0, 0, 128]]
# Combined: Shape = (2, 3, 3) → 2×3 RGB image
Different tasks answer different questions about images
Image-level → "This is a cat"
Object-level → "3 cats at x,y"
Pixel-level → "These pixels are cat"
| Task | Output | Use Case |
|---|---|---|
| Classification | Single label | Content tagging |
| Detection | Boxes + labels | Counting objects |
| Segmentation | Pixel masks | Medical imaging |
Assigning a label to an entire image
Locating and classifying multiple objects using bounding boxes
Object detection combines localization (where is it?) with classification (what is it?)
{
"class": "person",
"confidence": 0.95,
"bbox": {"x": 120, "y": 80, "width": 150, "height": 320}
}
Classifying pixels into categories
Output: "These are car pixels"
Output: "This is Car 1, that is Car 2"
| Task | What it tells you | Output |
|---|---|---|
| Classification | "There's a cat in this image" | Single label |
| Object Detection | "There are 3 cats, here are boxes" | Boxes + labels |
| Semantic Segmentation | "These pixels are cat pixels" | Pixel mask |
| Instance Segmentation | "Cat 1 is here, Cat 2 is there" | Per-instance masks |
| Pose Estimation | "The person's arms are raised" | Keypoint coordinates |
| OCR | "The sign says 'STOP'" | Text string |
Detecting key points on objects to understand posture and movement
Pose estimation detects anatomical keypoints (joints, facial features) to understand body position and movement
Extracting text from images
Handwriting vs printed text
Multiple languages and scripts
Curved, rotated, or distorted text
Specialized detection and analysis of human faces
| Task | Description | Use Case |
|---|---|---|
| Verification (1:1) | Is this the same person? | Phone unlock |
| Identification (1:N) | Who is this person? | Photo tagging |
| Attribute Detection | Age, emotion, accessories | Demographics analysis |
From hand-crafted features to learned representations
Edge Detection, Pattern Recognition
SIFT/SURF Features, Haar Cascades
AlexNet Revolution, ResNet (152 layers)
Vision Transformers, Foundation Models
Before deep learning, CV relied on hand-crafted features
AlexNet's victory in ImageNet 2012 changed everything
Parallel processing made training feasible
ImageNet with 14M+ labeled images
ReLU, Dropout, Batch Normalization
The architecture that powers modern computer vision
Nearby pixels are related
Same filter across entire image
Cat is cat regardless of position
CNNs learn progressively complex features
Edges, Corners, Simple Textures
Parts, Patterns, Structures
Objects, Faces, Categories
Layer 1: Edges → Layer 2: Eyes, Nose → Layer 3: Complete Face
| Architecture | Year | Key Innovation | Impact |
|---|---|---|---|
| AlexNet | 2012 | Deep CNNs + GPU | Started revolution |
| VGG | 2014 | Deeper with 3×3 filters | Simplicity |
| ResNet | 2015 | Skip connections | Very deep networks |
| YOLO | 2016 | Single-shot detection | Real-time (45 FPS) |
| EfficientNet | 2019 | Compound scaling | Accuracy/efficiency |
| ViT | 2020 | Vision Transformers | Attention mechanism |
Applying the Transformer architecture from NLP to computer vision
Each patch attends to every other patch
Performance improves with more data/compute
Pre-trained models work across tasks
Large pre-trained models that work across many tasks
Foundation Models: Large-scale models trained on diverse data that can be adapted to many downstream tasks with minimal fine-tuning
| Model | Year | Capability |
|---|---|---|
| CLIP (OpenAI) | 2021 | Vision-language, zero-shot classification |
| DALL-E (OpenAI) | 2021 | Text-to-image generation |
| SAM (Meta) | 2023 | Zero-shot segmentation of any object |
| GPT-4 Vision | 2023 | General visual understanding |
| Claude Vision | 2024 | Multimodal reasoning, documents |
Foundation models can perform tasks they weren't explicitly trained for
import clip
# No training on these specific classes!
classes = ["a photo of a dog", "a photo of a cat", "a photo of a bird"]
# CLIP compares image embedding with text embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(clip.tokenize(classes))
# Similarity determines classification
similarities = image_features @ text_features.T
prediction = classes[similarities.argmax()]
# Result: "a photo of a dog" with 94% confidence
Seeing CV in action with real code examples
Exploring pixel arrays and tensors
Pre-trained ResNet on ImageNet
YOLO for real-time detection
Claude Vision for complex reasoning
import numpy as np
from PIL import Image
# Load and explore an image
img = Image.open("sample.jpg")
img_array = np.array(img)
# Examine the data structure
print(f"Shape: {img_array.shape}") # (height, width, channels)
print(f"Data type: {img_array.dtype}") # uint8 (0-255)
print(f"Min value: {img_array.min()}") # 0 (black)
print(f"Max value: {img_array.max()}") # 255 (white)
print(f"Total pixels: {img_array.size}") # h × w × c
# Access individual pixels
pixel = img_array[100, 150] # Row 100, Column 150
print(f"Pixel RGB: {pixel}") # e.g., [128, 64, 200]
import torch
from torchvision import models, transforms
from PIL import Image
# Load pre-trained ResNet-50 (ImageNet: 1000 classes)
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
model.eval()
# Standard ImageNet preprocessing
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Inference
img = Image.open("golden_retriever.jpg")
input_tensor = preprocess(img).unsqueeze(0)
with torch.no_grad():
output = model(input_tensor)
# Output: "golden retriever" with 92% confidence
from ultralytics import YOLO
# Load YOLOv8 (You Only Look Once)
model = YOLO('yolov8n.pt') # 'n' = nano, fast
# Run inference
results = model('street_scene.jpg')
# Process detections
for result in results:
for box in result.boxes:
class_id = int(box.cls[0])
class_name = model.names[class_id]
confidence = float(box.conf[0])
x1, y1, x2, y2 = box.xyxy[0].tolist()
print(f"Detected: {class_name} ({confidence:.1%})")
print(f" Location: ({x1:.0f}, {y1:.0f}) to ({x2:.0f}, {y2:.0f})")
# YOLO detects 80+ classes: person, car, bicycle, dog...
import anthropic
import base64
def analyze_image(image_path: str, question: str) -> str:
"""Use Claude Vision to understand images."""
client = anthropic.Anthropic()
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64", "media_type": "image/jpeg",
"data": image_data}},
{"type": "text", "text": question}
]
}]
)
return response.content[0].text
# Example: Complex reasoning
result = analyze_image("shelf.jpg", "What products are low on stock?")
| Aspect | Classical CV | Deep Learning | Foundation Models |
|---|---|---|---|
| Features | Hand-crafted | Learned from data | Pre-learned, general |
| Data Needed | Small | Large (thousands) | Zero to few |
| Expertise | Domain + CV | ML engineering | Prompt engineering |
| Flexibility | Task-specific | Retrainable | Highly general |
| Compute | Low | High (training) | Medium (inference) |
| Best For | Constrained envs | Custom accuracy | Rapid prototyping |
Duration: 1 hour | Deadline: Before Session 2
Identify 5 real-world applications of CV you encounter this week
Write a 300-word reflection on one impressive application
Short report (1-2 pages) with observations and reflection
Pixels, Tensors, RGB Channels, Resolution
Classification, Detection, Segmentation, OCR, Pose
Classical CV → Deep Learning (2012) → Foundation Models (2023)
Cloud APIs, Custom Models, Vision LLMs
1
Computer vision enables machines to extract meaning from visual data
2
Images are matrices of pixel values organized as tensors
3
Deep learning revolution (2012) shifted from hand-crafted to learned features
4
Foundation models enable zero-shot capabilities
Let's discuss Computer Vision
Next: Session 2 - Business Applications & Use Cases
2026 WayUp - way-up.io