Computer Vision for Business

Session 3: Hands-on Cloud Vision APIs

Building CV applications with cloud APIs from Google, AWS, Azure, and Anthropic

Session 3: Learning Objectives

Today's goals (3 hours)

Cloud

Use Cloud APIs

Effectively leverage cloud-based CV services

Compare

Compare Providers

Understand offerings from major providers

Build

Build Applications

Create functional CV-powered tools

Session Structure: Overview (45min) + Hands-on Labs (1h45) + Optimization (30min)

Session 3 Roadmap

Our journey today

1
Cloud API Overview
30min
2
Provider Comparison
15min
3
Lab 1: OCR
25min
4
Lab 2: Products
30min
5
Lab 3: Scenes
30min
6
Lab 4: Full App
20min
7
Cost Optimization
30min

Why Use Cloud Vision APIs?

Benefits over building from scratch

1 No ML Expertise

Pre-trained models, production-ready from day one

No need to hire data scientists or ML engineers

2 Fast Time-to-Market

Minutes to integrate, instant scaling

Focus on business logic, not infrastructure

3 Cost Efficiency

Pay per use, no infrastructure costs

Start small, scale automatically

4 Enterprise Ready

SLAs, security certifications

Compliance and support included

How Cloud Vision APIs Work

The request-response cycle

1
Load & Encode
Your App
2
HTTP POST
API Request
3
Process Image
AI Model
4
Extract Features
Analysis
5
Generate Analysis
Results
6
JSON Response
API Reply
7
Parse & Use
Your App
Key Insight: All cloud vision APIs follow this pattern - differences are in pricing, features, and model quality

The Cloud Vision Landscape

Major players in the ecosystem

Traditional CV APIs

Specialized, predefined tasks with structured outputs

Google Cloud Vision Amazon Rekognition Azure Computer Vision

Multimodal LLMs

Open-ended reasoning, natural language interaction

Claude Vision GPT-4 Vision Google Gemini

Specialized Services

Purpose-built for specific use cases

Document AI Face APIs AutoML / Custom Labels

Google Cloud Vision

Comprehensive traditional CV API

Strengths

  • OCR: Industry-leading text extraction
  • Labels: 10,000+ object categories
  • Web Entities: Find similar images online
  • Document AI: Structured extraction

Best For

  • Document processing
  • Content moderation
  • Landmark recognition
  • Product search

Pricing: ~$1.50/1,000 images

Amazon Rekognition

AWS's computer vision service

Strengths

  • Face Analysis: Emotions, age, attributes
  • Celebrity Recognition: 100K+ celebrities
  • Video Analysis: Real-time streaming
  • Custom Labels: Train your own models

Best For

  • Security & surveillance
  • Media & entertainment
  • User verification
  • Video content analysis

Pricing: Tier-based, volume discounts

Azure Computer Vision

Microsoft's vision intelligence

Strengths

  • Dense Captioning: Detailed descriptions
  • Spatial Analysis: People counting, zones
  • Read API: Multi-language OCR
  • Image Analysis 4.0: GPT-4 powered

Best For

  • Accessibility (alt text)
  • Retail analytics
  • Multi-language documents
  • Enterprise integration

Pricing: Competitive, good free tier

Multimodal LLMs for Vision

The new generation: Claude, GPT-4V, Gemini

Flexible Input

Any image type Natural language prompt Context & examples

Advanced Reasoning

Understands context Follows instructions Multi-step reasoning

Structured Output

JSON on request Natural language Code generation
Key Advantage: No predefined tasks - ask anything about images!

Provider Comparison

Side-by-side comparison

Provider Type Best Features Pricing Model
Google Cloud Vision Traditional OCR, labels, web entities Per image
Amazon Rekognition Traditional Face, video, custom labels Tiered
Azure CV Traditional Captioning, spatial, reading Per transaction
Claude/GPT-4V/Gemini Multimodal LLM Open-ended reasoning Per token

Choosing the Right Service

Match your needs to the right API

Document/OCR Processing

Recommended: Google Cloud Vision, Azure Read API

Best for receipts, invoices, forms, ID cards

Face Analysis & Verification

Recommended: Amazon Rekognition

Best for security, user verification, demographics

Open-ended Analysis

Recommended: Claude Vision, GPT-4 Vision

Best for complex reasoning, custom tasks, products

Video & Streaming

Recommended: Amazon Rekognition Video

Best for surveillance, live streams, media analysis

Use Case to Service Mapping

Real-world applications matched to APIs

Use Case Recommended Service
Receipt/Invoice OCR Google Document AI, Azure Form Recognizer
Face verification Amazon Rekognition Face
Product analysis Claude Vision, GPT-4 Vision
Content moderation Google SafeSearch, Amazon Moderation
Retail foot traffic Azure Spatial Analysis
Complex reasoning Claude Vision, Gemini Pro Vision

Hands-on Labs Overview

What we'll build today

1 Lab 1: OCR

Receipt Processing → Extract text → Parse structure

2 Lab 2: Products

Product Cataloging → Analyze attributes → Generate metadata

3 Lab 3: Scenes

Retail Analysis → Understand context → Business insights

4 Lab 4: Full App

ExpenseTracker → Combine skills → Production code

Lab Setup

Environment preparation

# Install required packages
# pip install anthropic pillow requests

import os
from pathlib import Path
import base64

# Create working directory
WORK_DIR = Path("cv_workshop")
WORK_DIR.mkdir(exist_ok=True)

# Sample images for testing
SAMPLE_IMAGES = [
    "receipt.jpg",      # For OCR
    "products.jpg",     # For object detection
    "storefront.jpg",   # For scene analysis
    "document.pdf"      # For document processing
]

Image Encoding Helper

Converting images to Base64 for APIs

1
Image File
JPG/PNG/WebP
2
Read Binary
rb mode
3
Base64 Encode
standard_b64encode
4
UTF-8 String
Ready for API
def encode_image(image_path: str) -> str:
    """Encode image to base64 string for API calls."""
    with open(image_path, "rb") as image_file:
        return base64.standard_b64encode(
            image_file.read()
        ).decode("utf-8")

# Usage
image_data = encode_image("receipt.jpg")
# Returns: "/9j/4AAQSkZJRg..." (base64 string)

Lab 1: Document Processing with OCR

Extract structured data from receipts

1
Receipt Image
Input
2
Claude Vision
API Call
3
Analyze
Extract
4
Structured JSON
Output

Extracts

  • Vendor Name
  • Date
  • Line Items
  • Total/Tax
Why Claude for OCR? Handles messy layouts, understands context, outputs clean JSON

Lab 1: OCR Function Setup

import anthropic
from pathlib import Path

def extract_receipt_data(image_path: str) -> dict:
    """Extract structured data from a receipt image."""
    client = anthropic.Anthropic()

    # Determine media type from extension
    ext = Path(image_path).suffix.lower()
    media_types = {
        '.jpg': 'image/jpeg',
        '.jpeg': 'image/jpeg',
        '.png': 'image/png',
        '.webp': 'image/webp'
    }
    media_type = media_types.get(ext, 'image/jpeg')

    # Encode image to base64
    image_data = encode_image(image_path)

Lab 1: Making the API Call

    # Call Claude Vision API
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "..."  # Prompt on next slide
                }
            ]
        }]
    )

Lab 1: The OCR Prompt

Structured prompts get structured results

# The prompt that extracts structured data
prompt = """
Analyze this receipt and extract as JSON:
{
    "vendor_name": "",
    "date": "",
    "items": [
        {"name": "", "price": 0.00}
    ],
    "subtotal": 0.00,
    "tax": 0.00,
    "total": 0.00
}
Return only valid JSON, no additional text.
If any field is unclear, use null.
"""
Pro Tip: Show the exact JSON schema you want - Claude follows it precisely

Lab 1: Sample Output

What Claude returns from a restaurant receipt

{
    "vendor_name": "Café de Flore",
    "date": "2024-03-15",
    "items": [
        {"name": "Espresso", "price": 4.50},
        {"name": "Croissant", "price": 3.20},
        {"name": "Orange Juice", "price": 5.00}
    ],
    "subtotal": 12.70,
    "tax": 1.27,
    "total": 13.97
}

Lab 2: Product Analysis & Cataloging

Automate e-commerce product metadata

1
Product Photo
Input
2
Claude Vision
Analyze
3
Extract Data
Multiple
4
Listing Ready
Output

Category Classification

Automatic product categorization

Attribute Extraction

Colors, materials, sizes

Marketing Copy

Compelling descriptions

SEO Keywords

Search optimization

Lab 2: Product Analysis Code

def analyze_product(image_path: str) -> dict:
    """Analyze a product image for e-commerce cataloging."""
    client = anthropic.Anthropic()
    image_data = encode_image(image_path)

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1500,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": image_data
                }},
                {"type": "text", "text": """..."""}  # Next slide
            ]
        }]
    )
    return response.content[0].text

Lab 2: Product Analysis Prompt

prompt = """
Analyze this product for e-commerce listing.
Return JSON with:

{
    "category": "Main category",
    "subcategory": "Specific subcategory",
    "attributes": {
        "colors": [],
        "materials": [],
        "size_estimate": ""
    },
    "marketing_title": "Max 10 words, compelling",
    "description": "2-3 sentences, benefit-focused",
    "search_keywords": ["10", "relevant", "keywords"]
}

Be accurate about visible attributes.
Don't guess what you can't see clearly.
"""

Lab 2: Sample Product Output

{
    "category": "Fashion",
    "subcategory": "Women's Handbags",
    "attributes": {
        "colors": ["cognac brown", "gold accents"],
        "materials": ["leather", "metal hardware"],
        "size_estimate": "medium, ~30cm width"
    },
    "marketing_title": "Elegant Cognac Leather Tote with Gold Hardware",
    "description": "A sophisticated everyday tote crafted from
    premium cognac leather. Features secure zip closure and
    elegant gold-tone hardware for timeless style.",
    "search_keywords": ["leather tote", "brown handbag",
    "cognac bag", "gold hardware", "women's purse", ...]
}

Lab 2: Batch Product Processing

Process entire catalogs efficiently

1
Product Folder
Input
2
For Each Image
Loop
3
Analyze
API Call
4
Complete Catalog
Output
def process_product_catalog(image_folder: str) -> list:
    """Process all product images in a folder."""
    results = []
    image_extensions = {'.jpg', '.jpeg', '.png', '.webp'}

    for image_path in Path(image_folder).iterdir():
        if image_path.suffix.lower() in image_extensions:
            print(f"Processing: {image_path.name}")
            analysis = analyze_product(str(image_path))
            results.append({
                "filename": image_path.name,
                "analysis": analysis
            })
    return results

Lab 3: Scene Understanding

Analyze retail environments for business insights

Environment Analysis

Store type, size, layout, ambiance

Customer Insights

Count, demographics, behaviors

Merchandising Review

Displays, signage, product visibility

Operations Assessment

Cleanliness, safety, staff, queues

Lab 3: Scene Analysis Code

def analyze_retail_scene(image_path: str) -> dict:
    """Analyze a retail environment for insights."""
    client = anthropic.Anthropic()
    image_data = encode_image(image_path)

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": image_data
                }},
                {"type": "text", "text": """..."""}
            ]
        }]
    )
    return response.content[0].text

Lab 3: Scene Analysis Prompt

prompt = """
Analyze this retail environment comprehensively:

1. SCENE: Store type, estimated size, time of day,
   overall ambiance

2. CUSTOMERS: Approximate count, visible demographics,
   current activities and behaviors

3. MERCHANDISING: Display quality, signage effectiveness,
   product visibility, promotional materials

4. OPERATIONS: Cleanliness score (1-10), safety hazards,
   staff visibility, queue management

5. RECOMMENDATIONS: List 3 specific, actionable
   improvements with expected business impact

Be specific, quantitative where possible, and
business-focused in your analysis.
"""

Lab 3: Business Applications

How retail scene analysis drives value

Store Audits

  • Compliance checking
  • Brand standards
  • Mystery shopping

Operations

  • Queue monitoring
  • Staff allocation
  • Peak hour analysis

Marketing

  • Display effectiveness
  • Signage visibility
  • Promotion impact

Safety

  • Hazard detection
  • Crowd density
  • Emergency paths

Lab 4: Building a Complete Application

ExpenseTracker: CV-powered expense management

1
Receipt Photos
Input
2
OCR Extraction
Process
3
Auto-Categorization
Classify
4
Data Validation
Verify
5
Expense Database
Store
6
Reports & Analytics
Output

Lab 4: ExpenseTracker Architecture

Class-based design for maintainability

ExpenseTracker Class

Properties: expenses: List, client: Anthropic

Public: add_receipt(), get_summary()

Private: _extract_receipt(), _categorize()

Expense Data Structure

Fields: id, timestamp, image_path

Data: category, extracted data dict

Lab 4: ExpenseTracker Class

import json
from datetime import datetime

class ExpenseTracker:
    """Expense tracking using computer vision."""

    def __init__(self):
        self.expenses = []
        self.client = anthropic.Anthropic()

    def add_receipt(self, image_path: str,
                    category: str = None) -> dict:
        """Process a receipt and add to expenses."""
        # Extract data using CV
        extracted = self._extract_receipt(image_path)

        # Parse JSON response
        try:
            receipt_data = json.loads(extracted)
        except json.JSONDecodeError:
            receipt_data = {"raw": extracted, "error": True}

Lab 4: Adding Receipts

        # Create expense record
        expense = {
            "id": len(self.expenses) + 1,
            "timestamp": datetime.now().isoformat(),
            "image_path": image_path,
            "category": category or self._categorize(receipt_data),
            "data": receipt_data
        }
        self.expenses.append(expense)
        return expense

    def _extract_receipt(self, image_path: str) -> str:
        """Use CV to extract receipt data."""
        # Reuse our extract_receipt_data function
        return extract_receipt_data(image_path)

Lab 4: Auto-Categorization

Keyword-based category assignment

    def _categorize(self, data: dict) -> str:
        """Auto-categorize based on vendor name."""
        vendor = data.get("vendor_name", "").lower()

        categories = {
            "restaurant": ["restaurant", "cafe", "pizza",
                          "burger", "sushi", "bistro"],
            "grocery": ["supermarket", "grocery", "carrefour",
                       "auchan", "monoprix"],
            "transport": ["uber", "taxi", "sncf", "ratp"],
            "office": ["staples", "office", "amazon"],
        }

        for cat, keywords in categories.items():
            if any(kw in vendor for kw in keywords):
                return cat
        return "other"

Lab 4: Generating Summaries

    def get_summary(self) -> dict:
        """Generate expense summary by category."""
        by_category = {}

        for expense in self.expenses:
            cat = expense["category"]
            total = expense["data"].get("total", 0) or 0

            if cat not in by_category:
                by_category[cat] = {"count": 0, "total": 0}

            by_category[cat]["count"] += 1
            by_category[cat]["total"] += float(total)

        return {
            "total_expenses": len(self.expenses),
            "by_category": by_category,
            "grand_total": sum(
                c["total"] for c in by_category.values()
            )
        }

Lab 4: Using ExpenseTracker

# Initialize tracker
tracker = ExpenseTracker()

# Add receipts
tracker.add_receipt("lunch_receipt.jpg")
tracker.add_receipt("office_supplies.jpg")
tracker.add_receipt("taxi_receipt.jpg")

# Get summary
summary = tracker.get_summary()
print(f"Total expenses: {summary['total_expenses']}")
print(f"Grand total: €{summary['grand_total']:.2f}")

# Output:
# Total expenses: 3
# Grand total: €127.45
# By category:
#   restaurant: 1 expense, €23.50
#   office: 1 expense, €89.95
#   transport: 1 expense, €14.00

API Performance Comparison

Measuring and comparing API performance

1
Test Images
Input
2
API 1/2/3
Parallel Test
3
Measure
Metrics
4
Best Choice
Decision

Latency

Response time per image

Accuracy

Correctness of results

Cost

Total cost at expected volume

API Benchmarking Framework

import time
from typing import Callable

def benchmark_api(
    api_function: Callable,
    test_images: list,
    runs: int = 3
) -> dict:
    """Benchmark an API across multiple images."""
    results = {
        "timings": [], "successes": 0, "failures": 0
    }

    for image in test_images:
        for _ in range(runs):
            start = time.time()
            try:
                api_function(image)
                results["successes"] += 1
            except Exception as e:
                results["failures"] += 1
            results["timings"].append(time.time() - start)

    results["avg_time"] = sum(results["timings"]) / len(results["timings"])
    return results

Cost Optimization Strategies

Reducing API costs at scale

1
Raw Image
Input
2
Preprocess
Optimize
3
Cache Check
Hit/Miss
4
API Call
If needed

Resize & Compress

Reduce file size

Crop ROI

Focus on relevant areas

Return Cached

Skip duplicate calls

Store Result

Cache for future use

Cost Optimization Techniques

Practical strategies for production

1 Image Preprocessing

Resize to optimal dimensions - many APIs charge by size

Can reduce costs by 50-90% with minimal quality loss

2 Result Caching

Hash images, store results - avoid duplicate processing

Especially valuable for user-uploaded duplicate content

3 Batch Processing

Use batch endpoints for volume discounts when available

Process during off-peak hours for lower rates

Tiered Processing Strategy

Use cheap APIs for filtering, premium for analysis

1
All Images
100%
2
Tier 1: Fast Filter
$0.001/image
3
Quality Check
Pass/Fail
4
Tier 2: Analysis
$0.01/image
Example: Use basic classification to filter 80% of images before expensive multimodal analysis
Approach Cost per 1000 images
All images through Tier 2 $10.00
Tiered processing (20% pass) $3.00

Image Preprocessing for Cost Savings

from PIL import Image
import io

def optimize_image(image_path: str,
                   max_size: int = 1024,
                   quality: int = 85) -> bytes:
    """Optimize image size before API call."""
    with Image.open(image_path) as img:
        # Resize if too large
        if max(img.size) > max_size:
            ratio = max_size / max(img.size)
            new_size = tuple(int(d * ratio) for d in img.size)
            img = img.resize(new_size, Image.LANCZOS)

        # Convert to RGB if needed
        if img.mode in ('RGBA', 'P'):
            img = img.convert('RGB')

        # Save to bytes with compression
        buffer = io.BytesIO()
        img.save(buffer, format='JPEG', quality=quality)
        return buffer.getvalue()

# Can reduce file size by 70-90% while maintaining quality

Simple Result Caching

import hashlib
import json
from pathlib import Path

CACHE_DIR = Path("api_cache")
CACHE_DIR.mkdir(exist_ok=True)

def get_cached_or_call(image_path: str, api_func) -> dict:
    """Check cache before calling API."""
    # Generate hash of image content
    with open(image_path, "rb") as f:
        image_hash = hashlib.md5(f.read()).hexdigest()

    cache_file = CACHE_DIR / f"{image_hash}.json"

    # Return cached if exists
    if cache_file.exists():
        return json.loads(cache_file.read_text())

    # Call API and cache result
    result = api_func(image_path)
    cache_file.write_text(json.dumps(result))
    return result

Robust Error Handling

Production-ready API calls

import time
from typing import Optional

def call_api_with_retry(
    func,
    max_retries: int = 3,
    base_delay: float = 1.0
) -> Optional[dict]:
    """Call API with exponential backoff retry."""
    for attempt in range(max_retries):
        try:
            return func()
        except anthropic.RateLimitError:
            delay = base_delay * (2 ** attempt)
            print(f"Rate limited. Waiting {delay}s...")
            time.sleep(delay)
        except anthropic.APIError as e:
            print(f"API error: {e}")
            if attempt == max_retries - 1:
                raise
    return None

Security Best Practices

Protecting your API integration

Credentials Management

  • Environment variables
  • Secret managers (AWS Secrets, Azure Key Vault)
  • Never commit keys to version control

Data Privacy

  • PII detection and redaction
  • Image anonymization before API calls
  • Data retention policies

Network Security

HTTPS only, VPC endpoints, IP allowlisting

Monitoring

Usage alerts, anomaly detection, audit logging

Self-Practice Assignment 3

Duration: 1.5 hours | Deadline: Before Session 4

1 Choose a Use Case (15 min)

  • Select a business problem from Session 2
  • Or propose your own application idea

2 Implementation (45 min)

  • Build a working prototype using cloud APIs
  • Process at least 5 sample images
  • Include basic error handling

3 Documentation (30 min)

  • Document code with comments
  • Write README explaining usage
  • Include sample outputs

Deliverable: GitHub repository or zip file with code and documentation

Assignment Project Ideas

Inspiration for your CV application

Business Card Scanner

Contact extraction, CRM integration

Menu Scanner

Price extraction, allergen detection

Plant Identifier

Species recognition, disease detection

Fashion Analyzer

Style classification, similar items search

Session 3 Summary

Key takeaways from today

Cloud APIs Are Powerful

No ML expertise needed, instant production-ready, multiple providers

Choose Wisely

Traditional vs Multimodal, match use case to API, consider cost structure

Build Smart

Structured prompts, error handling, caching & optimization

Practice Makes Perfect

Start simple, iterate quickly, document everything

Looking Ahead: Session 4

Custom Models & Transfer Learning

1
Session 3: Cloud APIs
Today
2
Session 4: Custom Models
Next

When to Build Your Own

Decision criteria for custom models

Transfer Learning

Leverage pre-trained models

PyTorch Training

Hands-on model training

AutoML Platforms

No-code alternatives

Next Session: When and how to build custom CV models for specialized needs

Slide Overview