Computer Vision for Business

Session 1: Foundations

Understanding Visual AI

How machines see and interpret visual data

Course Overview

18-hour comprehensive program on computer vision for business

1
Foundations
Core concepts
2
Business
Use cases
3
Cloud APIs
Hands-on
4
Custom
Your models
5
Ethics
Responsible AI
6
Deploy
Production
Format: 3 hours per session | Theory + Hands-on + Assignments

Course Sessions

SessionThemeFormat
1Foundations of Computer VisionTheory + Demo
2Business Applications & Use CasesTheory + Case Studies
3Hands-on: Cloud Vision APIsPractical Workshop
4Custom Models & Transfer LearningTheory + Practice
5Ethics, Governance & PresentationsDiscussion + Projects
6Deployment & Integration StrategiesPractice + Lab
Prerequisites: Basic Python programming, fundamental understanding of machine learning concepts

Course Learning Objectives

What you'll be able to do after completing this course

1 Understand

Fundamental concepts and architectures behind modern computer vision systems

2 Identify

Business opportunities where computer vision creates measurable value

3 Evaluate

Select appropriate computer vision solutions for specific use cases

4 Prototype

Build applications using cloud APIs and pre-trained models

Analyze

Costs, benefits, and risks of computer vision implementations

Navigate

Ethical, legal, and privacy considerations in visual AI deployments

Session 1: Learning Objectives

By the end of this session, you will be able to:

Understand

How machines "see" and interpret images

Master

Key terminology and concepts in computer vision

Recognize

Different types of computer vision tasks

Trace

Evolution from classical CV to deep learning

Hands-on: Live demonstrations of CV in action

Session 1 Structure

3-hour journey through computer vision foundations

1
Theory
90 min
2
Demos
60 min
3
Q&A
30 min

Part 1: Theory (1h30)

  • Core Concepts
  • Key Terminology
  • CV Task Types
  • Deep Learning Evolution

Part 2: Demos (1h)

  • Image as Data
  • Classification
  • Object Detection
  • Vision LLMs

What is Computer Vision?

Enabling computers to derive meaningful information from visual inputs

Computer vision is a field of artificial intelligence that enables computers to derive meaningful information from digital images, videos, and other visual inputs.

1
Input
Visual data
2
Analysis
Process
3
Understand
Extract meaning
4
Action
Make decisions
Goal: Bridge the gap between pixels and meaning

Human vs. Machine Vision

Understanding the differences

Human Vision

  • Evolved over millions of years
  • Instinctive pattern recognition
  • Context-aware interpretation
  • Handles variability naturally

Machine Vision

  • Must be explicitly trained
  • Learns patterns from data
  • Requires structured inputs
  • Struggles with edge cases

Processing Pipeline

Human: Light → Retina → Visual Cortex → Understanding
Machine: Light → Sensor → Neural Network → Prediction

The Visual Processing Pipeline

How computers process visual information, step by step

1
Acquisition
Capture
2
Preprocess
Clean
3
Features
Extract
4
Interpret
Analyze
5
Decision
Output

Input Sources

Cameras, smartphones, satellites, sensors

Digital Formats

JPEG, PNG, RAW, video streams

Key considerations: Resolution, frame rate, lighting conditions, sensor quality

Preprocessing & Feature Extraction

Preparing images for analysis

Preprocessing

  • Resize / crop
  • Normalize pixel values
  • Noise reduction
  • Color correction
  • Data augmentation

Feature Extraction

  • Edge detection
  • Texture patterns
  • Shape descriptors
  • Color histograms
  • Deep features (CNN)

Example Pipeline

Raw Image (1920×1080) → Resize (224×224) → Normalize (0-1 range) → Feature Map (512 channels)

Digital Image Representation

At the lowest level, an image is just a matrix of numbers

ConceptDefinitionExample
Pixel Smallest unit of a digital image Single point of color
Resolution Number of pixels (width × height) 1920 × 1080 = 2.07 megapixels
Channels Color components per pixel RGB = 3, Grayscale = 1, RGBA = 4
Bit Depth Values per channel 8-bit = 0-255, 16-bit = 0-65535
Tensor Multi-dimensional array in ML Shape: (height, width, channels)

Image as a 3D Tensor

Understanding the data structure

Color Image (Height x Width x 3 channels)

┌─────────────────────────────────────┐
│  Red Channel    [255] [128] [64]    │
│  Green Channel  [100] [150] [200]   │
│  Blue Channel   [ 50] [ 75] [100]   │
└─────────────────────────────────────┘

Each pixel = 3 values (R, G, B)

Key Insight

A 1920×1080 RGB image = 1920 × 1080 × 3 = 6,220,800 values

Image as a Matrix

Grayscale and RGB representations

A grayscale image is a 2D matrix of pixel intensities (0=black, 255=white):

[[  0,  50, 100, 150, 200, 255],
 [ 25,  75, 125, 175, 225, 255],
 [ 50, 100, 150, 200, 250, 255],
 [ 75, 125, 175, 225, 255, 255]]  # 4×6 grayscale image

An RGB color image is a 3D tensor with 3 channels:

# Red Channel:    Green Channel:   Blue Channel:
[[255, 0, 0],   [[0, 255, 0],    [[0, 0, 255],
 [128, 0, 0]]    [0, 128, 0]]     [0, 0, 128]]

# Combined: Shape = (2, 3, 3) → 2×3 RGB image

Computer Vision Task Types

Different tasks answer different questions about images

1 Classification

Image-level → "This is a cat"

2 Detection

Object-level → "3 cats at x,y"

3 Segmentation

Pixel-level → "These pixels are cat"

TaskOutputUse Case
ClassificationSingle labelContent tagging
DetectionBoxes + labelsCounting objects
SegmentationPixel masksMedical imaging

CV Task: Image Classification

Assigning a label to an entire image

1
Input
Image
2
Classifier
Process
3
Probabilities
Scores
4
Label
Cat

Single-Label

  • One category per image
  • Mutually exclusive classes
  • Example: "cat" vs "dog"

Multi-Label

  • Multiple categories per image
  • Non-exclusive labels
  • Example: "sunny", "beach", "people"

CV Task: Object Detection

Locating and classifying multiple objects using bounding boxes

Object detection combines localization (where is it?) with classification (what is it?)

1
Image
Input
2
Detector
Process
3
Boxes
person, car, dog

Output for each detected object:

{
  "class": "person",
  "confidence": 0.95,
  "bbox": {"x": 120, "y": 80, "width": 150, "height": 320}
}

CV Task: Segmentation

Classifying pixels into categories

Semantic Segmentation

  • All 'road' pixels same color
  • All 'car' pixels same color
  • No instance distinction

Output: "These are car pixels"

Instance Segmentation

  • Car #1 different color
  • Car #2 different color
  • Each object unique

Output: "This is Car 1, that is Car 2"

Use cases: Medical imaging (semantic), autonomous driving (instance)

CV Tasks Comparison

TaskWhat it tells youOutput
Classification "There's a cat in this image" Single label
Object Detection "There are 3 cats, here are boxes" Boxes + labels
Semantic Segmentation "These pixels are cat pixels" Pixel mask
Instance Segmentation "Cat 1 is here, Cat 2 is there" Per-instance masks
Pose Estimation "The person's arms are raised" Keypoint coordinates
OCR "The sign says 'STOP'" Text string

CV Task: Pose Estimation

Detecting key points on objects to understand posture and movement

Pose estimation detects anatomical keypoints (joints, facial features) to understand body position and movement

Human Keypoints

  • Head, shoulders, elbows
  • Wrists, hips, knees, ankles
  • 17+ keypoints tracked

Applications

  • Sports analytics
  • Physical therapy
  • Gaming, fitness apps
  • Gesture control
Real-time capability: Modern models can track poses at 30+ FPS

CV Task: Optical Character Recognition

Extracting text from images

1
Image
With text
2
Detect
Find text
3
Recognize
Read chars
4
Post-process
Clean
5
Output
Digital text

Challenges:

Handwriting vs printed text

Multiple languages and scripts

Curved, rotated, or distorted text

CV Task: Face Analysis

Specialized detection and analysis of human faces

1
Detection
Find faces
2
Landmarks
Key points
3
Analysis
Attributes
TaskDescriptionUse Case
Verification (1:1) Is this the same person? Phone unlock
Identification (1:N) Who is this person? Photo tagging
Attribute Detection Age, emotion, accessories Demographics analysis

Evolution of Computer Vision

From hand-crafted features to learned representations

1 1960s-1980s

Edge Detection, Pattern Recognition

2 1990s-2000s

SIFT/SURF Features, Haar Cascades

3 2012-2015

AlexNet Revolution, ResNet (152 layers)

4 2020+

Vision Transformers, Foundation Models

Key insight: Each era brought 10x improvements in accuracy and capabilities

Classical Computer Vision (Pre-2012)

Before deep learning, CV relied on hand-crafted features

1
Raw Image
Input
2
Feature Extractor
Hand-crafted
3
Feature Vector
Numbers
4
ML Classifier
SVM, etc
5
Prediction
Output

Common Techniques:

  • SIFT/SURF: Scale/rotation invariant keypoints
  • HOG: Histogram of Oriented Gradients
  • Haar Cascades: Face detection (Viola-Jones)
  • Edge Detection: Canny, Sobel operators
Limitation: Required domain expertise to design features for each task

The Deep Learning Revolution (2012)

AlexNet's victory in ImageNet 2012 changed everything

Before 2012

  • Hand-crafted features
  • 26.2% error rate
  • Limited to simple tasks

After 2012

  • Learned features automatically
  • 16.4% error rate (-38% improvement)
  • Scalable to complex tasks

Key Enablers:

1 GPU Computing

Parallel processing made training feasible

2 Large Datasets

ImageNet with 14M+ labeled images

3 Algorithmic Innovations

ReLU, Dropout, Batch Normalization

Convolutional Neural Networks (CNNs)

The architecture that powers modern computer vision

1
Input
224×224
2
Conv+ReLU
Learn
3
Pool
Reduce
4
Conv+ReLU
Learn
5
Flatten
Vector
6
Dense
Classify

Why CNNs Work:

Local Connectivity

Nearby pixels are related

Parameter Sharing

Same filter across entire image

Translation Invariance

Cat is cat regardless of position

Hierarchical Feature Learning

CNNs learn progressively complex features

1 Early Layers

Edges, Corners, Simple Textures

2 Middle Layers

Parts, Patterns, Structures

3 Deep Layers

Objects, Faces, Categories

Example: Face Detection

Layer 1: Edges → Layer 2: Eyes, Nose → Layer 3: Complete Face

Key insight: The network automatically discovers the best features for the task

Key Deep Learning Architectures

ArchitectureYearKey InnovationImpact
AlexNet2012Deep CNNs + GPUStarted revolution
VGG2014Deeper with 3×3 filtersSimplicity
ResNet2015Skip connectionsVery deep networks
YOLO2016Single-shot detectionReal-time (45 FPS)
EfficientNet2019Compound scalingAccuracy/efficiency
ViT2020Vision TransformersAttention mechanism
Trend: Models becoming more accurate, efficient, and specialized

Vision Transformers (ViT) - 2020

Applying the Transformer architecture from NLP to computer vision

1
Image
Input
2
Patches
Split
3
Linear Proj
Embed
4
+ Pos Enc
Position
5
Transformer
Attend
6
Classifier
Output

Advantages:

Global Context

Each patch attends to every other patch

Scalable

Performance improves with more data/compute

Transferable

Pre-trained models work across tasks

Foundation Models Era (2021-Present)

Large pre-trained models that work across many tasks

Foundation Models: Large-scale models trained on diverse data that can be adapted to many downstream tasks with minimal fine-tuning

ModelYearCapability
CLIP (OpenAI)2021Vision-language, zero-shot classification
DALL-E (OpenAI)2021Text-to-image generation
SAM (Meta)2023Zero-shot segmentation of any object
GPT-4 Vision2023General visual understanding
Claude Vision2024Multimodal reasoning, documents
Key shift: From task-specific models to general-purpose visual AI

Zero-Shot and Few-Shot Learning

Foundation models can perform tasks they weren't explicitly trained for

Traditional ML

1
Collect
2
Label
3
Train
4
Use

Zero-Shot

1
Describe
2
Use

Zero-Shot Classification with CLIP:

import clip

# No training on these specific classes!
classes = ["a photo of a dog", "a photo of a cat", "a photo of a bird"]

# CLIP compares image embedding with text embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(clip.tokenize(classes))

# Similarity determines classification
similarities = image_features @ text_features.T
prediction = classes[similarities.argmax()]
# Result: "a photo of a dog" with 94% confidence

Part 2: Live Demonstrations

Seeing CV in action with real code examples

1 Image as Data

Exploring pixel arrays and tensors

2 Classification

Pre-trained ResNet on ImageNet

3 Object Detection

YOLO for real-time detection

4 Vision LLMs

Claude Vision for complex reasoning

Demo 1: Understanding Images as Data

import numpy as np
from PIL import Image

# Load and explore an image
img = Image.open("sample.jpg")
img_array = np.array(img)

# Examine the data structure
print(f"Shape: {img_array.shape}")       # (height, width, channels)
print(f"Data type: {img_array.dtype}")   # uint8 (0-255)
print(f"Min value: {img_array.min()}")   # 0 (black)
print(f"Max value: {img_array.max()}")   # 255 (white)
print(f"Total pixels: {img_array.size}") # h × w × c

# Access individual pixels
pixel = img_array[100, 150]  # Row 100, Column 150
print(f"Pixel RGB: {pixel}")  # e.g., [128, 64, 200]
Key insight: Images are just 3D arrays of numbers that we can manipulate mathematically

Demo 2: Pre-trained Classification

import torch
from torchvision import models, transforms
from PIL import Image

# Load pre-trained ResNet-50 (ImageNet: 1000 classes)
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
model.eval()

# Standard ImageNet preprocessing
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

# Inference
img = Image.open("golden_retriever.jpg")
input_tensor = preprocess(img).unsqueeze(0)
with torch.no_grad():
    output = model(input_tensor)
# Output: "golden retriever" with 92% confidence

Demo 3: Object Detection with YOLO

from ultralytics import YOLO

# Load YOLOv8 (You Only Look Once)
model = YOLO('yolov8n.pt')  # 'n' = nano, fast

# Run inference
results = model('street_scene.jpg')

# Process detections
for result in results:
    for box in result.boxes:
        class_id = int(box.cls[0])
        class_name = model.names[class_id]
        confidence = float(box.conf[0])
        x1, y1, x2, y2 = box.xyxy[0].tolist()

        print(f"Detected: {class_name} ({confidence:.1%})")
        print(f"  Location: ({x1:.0f}, {y1:.0f}) to ({x2:.0f}, {y2:.0f})")

# YOLO detects 80+ classes: person, car, bicycle, dog...
Performance: YOLOv8n runs at 45+ FPS on modern GPUs

Demo 4: Vision-Language Understanding

import anthropic
import base64

def analyze_image(image_path: str, question: str) -> str:
    """Use Claude Vision to understand images."""
    client = anthropic.Anthropic()

    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64", "media_type": "image/jpeg",
                    "data": image_data}},
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.content[0].text

# Example: Complex reasoning
result = analyze_image("shelf.jpg", "What products are low on stock?")

Comparing CV Approaches

AspectClassical CVDeep LearningFoundation Models
FeaturesHand-craftedLearned from dataPre-learned, general
Data NeededSmallLarge (thousands)Zero to few
ExpertiseDomain + CVML engineeringPrompt engineering
FlexibilityTask-specificRetrainableHighly general
ComputeLowHigh (training)Medium (inference)
Best ForConstrained envsCustom accuracyRapid prototyping
Business impact: Rapid prototyping without collecting training data

Self-Practice Assignment 1

Duration: 1 hour | Deadline: Before Session 2

Task: Computer Vision in the Wild

1 Part 1: Observation (30 min)

Identify 5 real-world applications of CV you encounter this week

2 Part 2: Reflection (30 min)

Write a 300-word reflection on one impressive application

For Each Application, Note:

  • Where you found it
  • What CV task it performs
  • The business value it provides
  • Any limitations observed

Deliverable

Short report (1-2 pages) with observations and reflection

Session 1 Summary

1 Images as Data

Pixels, Tensors, RGB Channels, Resolution

2 CV Tasks

Classification, Detection, Segmentation, OCR, Pose

3 Evolution

Classical CV → Deep Learning (2012) → Foundation Models (2023)

4 Approaches

Cloud APIs, Custom Models, Vision LLMs

Key Takeaways

1

Computer vision enables machines to extract meaning from visual data

2

Images are matrices of pixel values organized as tensors

3

Deep learning revolution (2012) shifted from hand-crafted to learned features

4

Foundation models enable zero-shot capabilities

Next Session: Business Applications & Use Cases - Identifying opportunities and ROI

Resources

Books

  • "Deep Learning for Vision Systems" - Mohamed Elgendy
  • "Computer Vision: Algorithms and Applications" - Szeliski (free online)

Online Courses

  • Stanford CS231n: Convolutional Neural Networks
  • Fast.ai Practical Deep Learning for Coders

Libraries

  • PyTorch / TensorFlow: Deep learning frameworks
  • OpenCV: Classical computer vision
  • Ultralytics (YOLO): Object detection
  • Hugging Face: Pre-trained models

Questions?

Let's discuss Computer Vision

Next: Session 2 - Business Applications & Use Cases

Slide Overview