Computer Vision for Business

Session 1: Foundations

Understanding Visual AI

How machines see and interpret visual data

Course Overview

18-hour comprehensive program on computer vision for business

1

Foundations

Core concepts

2

Business

Use cases

3

Cloud APIs

Hands-on

4

Custom

Your models

5

Ethics

Responsible AI

6

Deploy

Production

Format: 3 hours per session | Theory + Hands-on + Assignments

Course Sessions

Session	Theme	Format
1	Foundations of Computer Vision	Theory + Demo
2	Business Applications & Use Cases	Theory + Case Studies
3	Hands-on: Cloud Vision APIs	Practical Workshop
4	Custom Models & Transfer Learning	Theory + Practice
5	Ethics, Governance & Presentations	Discussion + Projects
6	Deployment & Integration Strategies	Practice + Lab

Prerequisites: Basic Python programming, fundamental understanding of machine learning concepts

Course Learning Objectives

What you'll be able to do after completing this course

1 Understand

Fundamental concepts and architectures behind modern computer vision systems

2 Identify

Business opportunities where computer vision creates measurable value

3 Evaluate

Select appropriate computer vision solutions for specific use cases

4 Prototype

Build applications using cloud APIs and pre-trained models

Analyze

Costs, benefits, and risks of computer vision implementations

Navigate

Ethical, legal, and privacy considerations in visual AI deployments

Session 1: Learning Objectives

By the end of this session, you will be able to:

Understand

How machines "see" and interpret images

Master

Key terminology and concepts in computer vision

Recognize

Different types of computer vision tasks

Trace

Evolution from classical CV to deep learning

Hands-on: Live demonstrations of CV in action

Session 1 Structure

3-hour journey through computer vision foundations

1

Theory

90 min

2

Demos

60 min

3

Q&A

30 min

Part 1: Theory (1h30)

Core Concepts
Key Terminology
CV Task Types
Deep Learning Evolution

Part 2: Demos (1h)

Image as Data
Classification
Object Detection
Vision LLMs

What is Computer Vision?

Enabling computers to derive meaningful information from visual inputs

Computer vision is a field of artificial intelligence that enables computers to derive meaningful information from digital images, videos, and other visual inputs.

1

Input

Visual data

2

Analysis

Process

3

Understand

Extract meaning

4

Action

Make decisions

Goal: Bridge the gap between pixels and meaning

Human vs. Machine Vision

Understanding the differences

Human Vision

Evolved over millions of years
Instinctive pattern recognition
Context-aware interpretation
Handles variability naturally

Machine Vision

Must be explicitly trained
Learns patterns from data
Requires structured inputs
Struggles with edge cases

Processing Pipeline

Human: Light → Retina → Visual Cortex → Understanding

Machine: Light → Sensor → Neural Network → Prediction

The Visual Processing Pipeline

How computers process visual information, step by step

1

Acquisition

Capture

2

Preprocess

Clean

3

Features

Extract

4

Interpret

Analyze

5

Decision

Output

Input Sources

Cameras, smartphones, satellites, sensors

Digital Formats

JPEG, PNG, RAW, video streams

Key considerations: Resolution, frame rate, lighting conditions, sensor quality

Preprocessing & Feature Extraction

Preparing images for analysis

Preprocessing

Resize / crop
Normalize pixel values
Noise reduction
Color correction
Data augmentation

Feature Extraction

Edge detection
Texture patterns
Shape descriptors
Color histograms
Deep features (CNN)

Example Pipeline

Raw Image (1920×1080) → Resize (224×224) → Normalize (0-1 range) → Feature Map (512 channels)

Digital Image Representation

At the lowest level, an image is just a matrix of numbers

Concept	Definition	Example
Pixel	Smallest unit of a digital image	Single point of color
Resolution	Number of pixels (width × height)	1920 × 1080 = 2.07 megapixels
Channels	Color components per pixel	RGB = 3, Grayscale = 1, RGBA = 4
Bit Depth	Values per channel	8-bit = 0-255, 16-bit = 0-65535
Tensor	Multi-dimensional array in ML	Shape: (height, width, channels)

Image as a 3D Tensor

Understanding the data structure

Color Image (Height x Width x 3 channels)

┌─────────────────────────────────────┐
│  Red Channel    [255] [128] [64]    │
│  Green Channel  [100] [150] [200]   │
│  Blue Channel   [ 50] [ 75] [100]   │
└─────────────────────────────────────┘

Each pixel = 3 values (R, G, B)

Key Insight

A 1920×1080 RGB image = 1920 × 1080 × 3 = 6,220,800 values

Image as a Matrix

Grayscale and RGB representations

A grayscale image is a 2D matrix of pixel intensities (0=black, 255=white):

[[  0,  50, 100, 150, 200, 255],
 [ 25,  75, 125, 175, 225, 255],
 [ 50, 100, 150, 200, 250, 255],
 [ 75, 125, 175, 225, 255, 255]]  # 4×6 grayscale image

An RGB color image is a 3D tensor with 3 channels:

# Red Channel:    Green Channel:   Blue Channel:
[[255, 0, 0],   [[0, 255, 0],    [[0, 0, 255],
 [128, 0, 0]]    [0, 128, 0]]     [0, 0, 128]]

# Combined: Shape = (2, 3, 3) → 2×3 RGB image

Computer Vision Task Types

Different tasks answer different questions about images

1 Classification

Image-level → "This is a cat"

2 Detection

Object-level → "3 cats at x,y"

3 Segmentation

Pixel-level → "These pixels are cat"

Task	Output	Use Case
Classification	Single label	Content tagging
Detection	Boxes + labels	Counting objects
Segmentation	Pixel masks	Medical imaging

CV Task: Image Classification

Assigning a label to an entire image

1

Input

Image

2

Classifier

Process

3

Probabilities

Scores

4

Label

Cat

Single-Label

One category per image
Mutually exclusive classes
Example: "cat" vs "dog"

Multi-Label

Multiple categories per image
Non-exclusive labels
Example: "sunny", "beach", "people"

CV Task: Object Detection

Locating and classifying multiple objects using bounding boxes

Object detection combines localization (where is it?) with classification (what is it?)

1

Image

Input

2

Detector

Process

3

Boxes

person, car, dog

Output for each detected object:

{
  "class": "person",
  "confidence": 0.95,
  "bbox": {"x": 120, "y": 80, "width": 150, "height": 320}
}

CV Task: Segmentation

Classifying pixels into categories

Semantic Segmentation

All 'road' pixels same color
All 'car' pixels same color
No instance distinction

Output: "These are car pixels"

Instance Segmentation

Car #1 different color
Car #2 different color
Each object unique

Output: "This is Car 1, that is Car 2"

Use cases: Medical imaging (semantic), autonomous driving (instance)

CV Tasks Comparison

Task	What it tells you	Output
Classification	"There's a cat in this image"	Single label
Object Detection	"There are 3 cats, here are boxes"	Boxes + labels
Semantic Segmentation	"These pixels are cat pixels"	Pixel mask
Instance Segmentation	"Cat 1 is here, Cat 2 is there"	Per-instance masks
Pose Estimation	"The person's arms are raised"	Keypoint coordinates
OCR	"The sign says 'STOP'"	Text string

CV Task: Pose Estimation

Detecting key points on objects to understand posture and movement

Pose estimation detects anatomical keypoints (joints, facial features) to understand body position and movement

Human Keypoints

Head, shoulders, elbows
Wrists, hips, knees, ankles
17+ keypoints tracked

Applications

Sports analytics
Physical therapy
Gaming, fitness apps
Gesture control

Real-time capability: Modern models can track poses at 30+ FPS

CV Task: Optical Character Recognition

Extracting text from images

1

Image

With text

2

Detect

Find text

3

Recognize

Read chars

4

Post-process

Clean

5

Output

Digital text

Challenges:

Handwriting vs printed text

Multiple languages and scripts

Curved, rotated, or distorted text

CV Task: Face Analysis

Specialized detection and analysis of human faces

1

Detection

Find faces

2

Landmarks

Key points

3

Analysis

Attributes

Task	Description	Use Case
Verification (1:1)	Is this the same person?	Phone unlock
Identification (1:N)	Who is this person?	Photo tagging
Attribute Detection	Age, emotion, accessories	Demographics analysis

Evolution of Computer Vision

From hand-crafted features to learned representations

1 1960s-1980s

Edge Detection, Pattern Recognition

2 1990s-2000s

SIFT/SURF Features, Haar Cascades

3 2012-2015

AlexNet Revolution, ResNet (152 layers)

4 2020+

Vision Transformers, Foundation Models

Key insight: Each era brought 10x improvements in accuracy and capabilities

Classical Computer Vision (Pre-2012)

Before deep learning, CV relied on hand-crafted features

1

Raw Image

Input

2

Feature Extractor

Hand-crafted

3

Feature Vector

Numbers

4

ML Classifier

SVM, etc

5

Prediction

Output

Common Techniques:

SIFT/SURF: Scale/rotation invariant keypoints
HOG: Histogram of Oriented Gradients

Haar Cascades: Face detection (Viola-Jones)
Edge Detection: Canny, Sobel operators

Limitation: Required domain expertise to design features for each task

The Deep Learning Revolution (2012)

AlexNet's victory in ImageNet 2012 changed everything

Before 2012

Hand-crafted features
26.2% error rate
Limited to simple tasks

After 2012

Learned features automatically
16.4% error rate (-38% improvement)
Scalable to complex tasks

Key Enablers:

1 GPU Computing

Parallel processing made training feasible

2 Large Datasets

ImageNet with 14M+ labeled images

3 Algorithmic Innovations

ReLU, Dropout, Batch Normalization

Convolutional Neural Networks (CNNs)

The architecture that powers modern computer vision

1

Input

224×224

2

Conv+ReLU

Learn

3

Pool

Reduce

4

Conv+ReLU

Learn

5

Flatten

Vector

6

Dense

Classify

Why CNNs Work:

Local Connectivity

Nearby pixels are related

Parameter Sharing

Same filter across entire image

Translation Invariance

Cat is cat regardless of position

Hierarchical Feature Learning

CNNs learn progressively complex features

1 Early Layers

Edges, Corners, Simple Textures

2 Middle Layers

Parts, Patterns, Structures

3 Deep Layers

Objects, Faces, Categories

Example: Face Detection

Layer 1: Edges → Layer 2: Eyes, Nose → Layer 3: Complete Face

Key insight: The network automatically discovers the best features for the task

Key Deep Learning Architectures

Architecture	Year	Key Innovation	Impact
AlexNet	2012	Deep CNNs + GPU	Started revolution
VGG	2014	Deeper with 3×3 filters	Simplicity
ResNet	2015	Skip connections	Very deep networks
YOLO	2016	Single-shot detection	Real-time (45 FPS)
EfficientNet	2019	Compound scaling	Accuracy/efficiency
ViT	2020	Vision Transformers	Attention mechanism

Trend: Models becoming more accurate, efficient, and specialized

Vision Transformers (ViT) - 2020

Applying the Transformer architecture from NLP to computer vision

1

Image

Input

2

Patches

Split

3

Linear Proj

Embed

4

+ Pos Enc

Position

5

Transformer

Attend

6

Classifier

Output

Advantages:

Global Context

Each patch attends to every other patch

Scalable

Performance improves with more data/compute

Transferable

Pre-trained models work across tasks

Foundation Models Era (2021-Present)

Large pre-trained models that work across many tasks

Foundation Models: Large-scale models trained on diverse data that can be adapted to many downstream tasks with minimal fine-tuning

Model	Year	Capability
CLIP (OpenAI)	2021	Vision-language, zero-shot classification
DALL-E (OpenAI)	2021	Text-to-image generation
SAM (Meta)	2023	Zero-shot segmentation of any object
GPT-4 Vision	2023	General visual understanding
Claude Vision	2024	Multimodal reasoning, documents

Key shift: From task-specific models to general-purpose visual AI

Zero-Shot and Few-Shot Learning

Foundation models can perform tasks they weren't explicitly trained for

Traditional ML

1

Collect

2

Label

3

Train

4

Use

Zero-Shot

1

Describe

2

Use

Zero-Shot Classification with CLIP:

import clip

# No training on these specific classes!
classes = ["a photo of a dog", "a photo of a cat", "a photo of a bird"]

# CLIP compares image embedding with text embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(clip.tokenize(classes))

# Similarity determines classification
similarities = image_features @ text_features.T
prediction = classes[similarities.argmax()]
# Result: "a photo of a dog" with 94% confidence

Part 2: Live Demonstrations

Seeing CV in action with real code examples

1 Image as Data

Exploring pixel arrays and tensors

2 Classification

Pre-trained ResNet on ImageNet

3 Object Detection

YOLO for real-time detection

4 Vision LLMs

Claude Vision for complex reasoning

Demo 1: Understanding Images as Data

import numpy as np
from PIL import Image

# Load and explore an image
img = Image.open("sample.jpg")
img_array = np.array(img)

# Examine the data structure
print(f"Shape: {img_array.shape}")       # (height, width, channels)
print(f"Data type: {img_array.dtype}")   # uint8 (0-255)
print(f"Min value: {img_array.min()}")   # 0 (black)
print(f"Max value: {img_array.max()}")   # 255 (white)
print(f"Total pixels: {img_array.size}") # h × w × c

# Access individual pixels
pixel = img_array[100, 150]  # Row 100, Column 150
print(f"Pixel RGB: {pixel}")  # e.g., [128, 64, 200]

Key insight: Images are just 3D arrays of numbers that we can manipulate mathematically

Demo 2: Pre-trained Classification

import torch
from torchvision import models, transforms
from PIL import Image

# Load pre-trained ResNet-50 (ImageNet: 1000 classes)
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
model.eval()

# Standard ImageNet preprocessing
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

# Inference
img = Image.open("golden_retriever.jpg")
input_tensor = preprocess(img).unsqueeze(0)
with torch.no_grad():
    output = model(input_tensor)
# Output: "golden retriever" with 92% confidence

Demo 3: Object Detection with YOLO

from ultralytics import YOLO

# Load YOLOv8 (You Only Look Once)
model = YOLO('yolov8n.pt')  # 'n' = nano, fast

# Run inference
results = model('street_scene.jpg')

# Process detections
for result in results:
    for box in result.boxes:
        class_id = int(box.cls[0])
        class_name = model.names[class_id]
        confidence = float(box.conf[0])
        x1, y1, x2, y2 = box.xyxy[0].tolist()

        print(f"Detected: {class_name} ({confidence:.1%})")
        print(f"  Location: ({x1:.0f}, {y1:.0f}) to ({x2:.0f}, {y2:.0f})")

# YOLO detects 80+ classes: person, car, bicycle, dog...

Performance: YOLOv8n runs at 45+ FPS on modern GPUs

Demo 4: Vision-Language Understanding

import anthropic
import base64

def analyze_image(image_path: str, question: str) -> str:
    """Use Claude Vision to understand images."""
    client = anthropic.Anthropic()

    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64", "media_type": "image/jpeg",
                    "data": image_data}},
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.content[0].text

# Example: Complex reasoning
result = analyze_image("shelf.jpg", "What products are low on stock?")

Comparing CV Approaches

Aspect	Classical CV	Deep Learning	Foundation Models
Features	Hand-crafted	Learned from data	Pre-learned, general
Data Needed	Small	Large (thousands)	Zero to few
Expertise	Domain + CV	ML engineering	Prompt engineering
Flexibility	Task-specific	Retrainable	Highly general
Compute	Low	High (training)	Medium (inference)
Best For	Constrained envs	Custom accuracy	Rapid prototyping

Business impact: Rapid prototyping without collecting training data

Self-Practice Assignment 1

Duration: 1 hour | Deadline: Before Session 2

Task: Computer Vision in the Wild

1 Part 1: Observation (30 min)

Identify 5 real-world applications of CV you encounter this week

2 Part 2: Reflection (30 min)

Write a 300-word reflection on one impressive application

For Each Application, Note:

Where you found it
What CV task it performs
The business value it provides
Any limitations observed

Deliverable

Short report (1-2 pages) with observations and reflection

Session 1 Summary

1 Images as Data

Pixels, Tensors, RGB Channels, Resolution

2 CV Tasks

Classification, Detection, Segmentation, OCR, Pose

3 Evolution

Classical CV → Deep Learning (2012) → Foundation Models (2023)

4 Approaches

Cloud APIs, Custom Models, Vision LLMs

Key Takeaways

1

Computer vision enables machines to extract meaning from visual data

2

Images are matrices of pixel values organized as tensors

3

Deep learning revolution (2012) shifted from hand-crafted to learned features

4

Foundation models enable zero-shot capabilities

Next Session: Business Applications & Use Cases - Identifying opportunities and ROI

Resources

Books

"Deep Learning for Vision Systems" - Mohamed Elgendy
"Computer Vision: Algorithms and Applications" - Szeliski (free online)

Online Courses

Stanford CS231n: Convolutional Neural Networks
Fast.ai Practical Deep Learning for Coders

Libraries

PyTorch / TensorFlow: Deep learning frameworks
OpenCV: Classical computer vision

Ultralytics (YOLO): Object detection
Hugging Face: Pre-trained models

Questions?

Let's discuss Computer Vision

Next: Session 2 - Business Applications & Use Cases