Practical Work 3

Building with Cloud Vision APIs

Hands-on integration with Google Cloud Vision and Claude Vision

Duration 2 hours

Difficulty Intermediate

Session 3 - Cloud APIs

Objectives

By the end of this practical work, you will be able to:

Set up and authenticate with cloud vision API providers
Extract text from documents using OCR
Analyze images for objects and scenes
Use Claude Vision for complex visual reasoning
Compare results and choose the right API for different tasks

Prerequisites

Python 3.8+ installed
Google Cloud account with Vision API enabled (free tier available)
Anthropic API key for Claude (free credits available)
Sample images for testing (receipts, products, scenes)

Install required packages:

pip install google-cloud-vision anthropic pillow

API Keys: Never commit API keys to version control. Use environment variables or a .env file.

Instructions

Step 1: Set Up Your Environment

Create a project structure and configure API credentials:

# Create project directory
mkdir cv-api-lab
cd cv-api-lab

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Create .env file for credentials
touch .env

Add your credentials to .env:

# .env
GOOGLE_APPLICATION_CREDENTIALS=path/to/your/service-account.json
ANTHROPIC_API_KEY=sk-ant-your-key-here

Create a utility file to load credentials:

# utils.py
import os
from dotenv import load_dotenv

load_dotenv()

def get_anthropic_key():
    return os.getenv("ANTHROPIC_API_KEY")

Step 2: Extract Text with Google Cloud Vision OCR

Create a script to extract text from receipts or documents:

# ocr_google.py
from google.cloud import vision
import io

def extract_text(image_path: str) -> dict:
    """Extract text from an image using Google Cloud Vision."""
    client = vision.ImageAnnotatorClient()

    with open(image_path, "rb") as image_file:
        content = image_file.read()

    image = vision.Image(content=content)

    # Perform text detection
    response = client.text_detection(image=image)
    texts = response.text_annotations

    if response.error.message:
        raise Exception(f"API Error: {response.error.message}")

    result = {
        "full_text": texts[0].description if texts else "",
        "words": [],
        "confidence": None
    }

    # Extract individual words with bounding boxes
    for text in texts[1:]:  # Skip first (full text)
        vertices = text.bounding_poly.vertices
        result["words"].append({
            "text": text.description,
            "bounds": [(v.x, v.y) for v in vertices]
        })

    return result

# Test it
if __name__ == "__main__":
    result = extract_text("receipt.jpg")
    print("Extracted Text:")
    print(result["full_text"])
    print(f"\nFound {len(result['words'])} words")

Step 3: Analyze Images with Claude Vision

Use Claude for more complex visual understanding:

# analyze_claude.py
import anthropic
import base64
from pathlib import Path

def encode_image(image_path: str) -> tuple[str, str]:
    """Encode image to base64 and detect media type."""
    path = Path(image_path)
    suffix = path.suffix.lower()

    media_types = {
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg",
        ".png": "image/png",
        ".gif": "image/gif",
        ".webp": "image/webp"
    }

    media_type = media_types.get(suffix, "image/jpeg")

    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    return image_data, media_type


def analyze_image(image_path: str, prompt: str) -> str:
    """Analyze an image using Claude Vision."""
    client = anthropic.Anthropic()

    image_data, media_type = encode_image(image_path)

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": image_data,
                        },
                    },
                    {
                        "type": "text",
                        "text": prompt
                    }
                ],
            }
        ],
    )

    return message.content[0].text


# Test it
if __name__ == "__main__":
    result = analyze_image(
        "product.jpg",
        "Describe this product in detail. Include: category, brand if visible, "
        "key features, estimated price range, and suggested uses."
    )
    print(result)

Step 4: Build a Receipt Parser

Combine OCR with intelligent parsing:

# receipt_parser.py
import json
from analyze_claude import analyze_image

def parse_receipt(image_path: str) -> dict:
    """Parse a receipt image into structured data."""

    prompt = """Analyze this receipt image and extract the following information
in JSON format:

{
    "store_name": "Name of the store",
    "date": "Date of purchase (YYYY-MM-DD format)",
    "items": [
        {"name": "Item name", "quantity": 1, "price": 0.00}
    ],
    "subtotal": 0.00,
    "tax": 0.00,
    "total": 0.00,
    "payment_method": "cash/card/other"
}

If any field cannot be determined, use null.
Return ONLY the JSON, no additional text."""

    response = analyze_image(image_path, prompt)

    # Parse the JSON response
    try:
        # Clean up response if needed
        json_str = response.strip()
        if json_str.startswith("```"):
            json_str = json_str.split("```")[1]
            if json_str.startswith("json"):
                json_str = json_str[4:]
        data = json.loads(json_str)
    except json.JSONDecodeError:
        data = {"raw_response": response, "error": "Could not parse JSON"}

    return data


if __name__ == "__main__":
    receipt_data = parse_receipt("receipt.jpg")
    print(json.dumps(receipt_data, indent=2))

Step 5: Build a Product Analyzer

Create a tool to analyze product images for e-commerce:

# product_analyzer.py
from analyze_claude import analyze_image
import json

def analyze_product(image_path: str) -> dict:
    """Analyze a product image for e-commerce cataloging."""

    prompt = """Analyze this product image for an e-commerce catalog.
Provide the following in JSON format:

{
    "category": "Main product category",
    "subcategory": "Specific subcategory",
    "title": "Suggested product title (max 80 chars)",
    "description": "Product description (2-3 sentences)",
    "key_features": ["Feature 1", "Feature 2", "Feature 3"],
    "colors": ["Primary color", "Secondary color if any"],
    "materials": ["Material if identifiable"],
    "condition": "new/used/refurbished",
    "suggested_tags": ["tag1", "tag2", "tag3"],
    "quality_issues": ["Any visible defects or concerns"]
}

Return ONLY valid JSON."""

    response = analyze_image(image_path, prompt)

    try:
        json_str = response.strip()
        if json_str.startswith("```"):
            json_str = json_str.split("```")[1]
            if json_str.startswith("json"):
                json_str = json_str[4:]
        return json.loads(json_str)
    except json.JSONDecodeError:
        return {"raw_response": response}


if __name__ == "__main__":
    product = analyze_product("product.jpg")
    print(json.dumps(product, indent=2))

Step 6: Compare API Results

Create a comparison script to evaluate different APIs:

# compare_apis.py
import time
from ocr_google import extract_text
from analyze_claude import analyze_image

def compare_ocr(image_path: str) -> dict:
    """Compare OCR results from different providers."""
    results = {}

    # Google Cloud Vision
    start = time.time()
    try:
        google_result = extract_text(image_path)
        results["google"] = {
            "text": google_result["full_text"],
            "word_count": len(google_result["words"]),
            "time_ms": (time.time() - start) * 1000
        }
    except Exception as e:
        results["google"] = {"error": str(e)}

    # Claude Vision
    start = time.time()
    try:
        claude_result = analyze_image(
            image_path,
            "Extract ALL text from this image. Return only the text, "
            "preserving the original layout as much as possible."
        )
        results["claude"] = {
            "text": claude_result,
            "word_count": len(claude_result.split()),
            "time_ms": (time.time() - start) * 1000
        }
    except Exception as e:
        results["claude"] = {"error": str(e)}

    return results


if __name__ == "__main__":
    comparison = compare_ocr("receipt.jpg")

    print("=== Google Cloud Vision ===")
    if "error" not in comparison["google"]:
        print(f"Time: {comparison['google']['time_ms']:.0f}ms")
        print(f"Words: {comparison['google']['word_count']}")
        print(comparison["google"]["text"][:500])
    else:
        print(f"Error: {comparison['google']['error']}")

    print("\n=== Claude Vision ===")
    if "error" not in comparison["claude"]:
        print(f"Time: {comparison['claude']['time_ms']:.0f}ms")
        print(f"Words: {comparison['claude']['word_count']}")
        print(comparison["claude"]["text"][:500])
    else:
        print(f"Error: {comparison['claude']['error']}")

Step 7: Build a Mini Application

Combine everything into a simple command-line application:

# cv_app.py
import argparse
import json
from receipt_parser import parse_receipt
from product_analyzer import analyze_product
from compare_apis import compare_ocr

def main():
    parser = argparse.ArgumentParser(description="CV API Tools")
    parser.add_argument("command", choices=["receipt", "product", "compare"])
    parser.add_argument("image", help="Path to image file")
    parser.add_argument("--output", "-o", help="Output file (JSON)")

    args = parser.parse_args()

    if args.command == "receipt":
        result = parse_receipt(args.image)
    elif args.command == "product":
        result = analyze_product(args.image)
    elif args.command == "compare":
        result = compare_ocr(args.image)

    output = json.dumps(result, indent=2)
    print(output)

    if args.output:
        with open(args.output, "w") as f:
            f.write(output)
        print(f"\nResults saved to {args.output}")


if __name__ == "__main__":
    main()

Usage:

# Parse a receipt
python cv_app.py receipt receipt.jpg -o receipt_data.json

# Analyze a product
python cv_app.py product product.jpg -o product_data.json

# Compare APIs
python cv_app.py compare document.jpg

Expected Output

After completing this practical work, you should have:

A working project with API integrations
Receipt parser that extracts structured data
Product analyzer for e-commerce cataloging
Comparison results showing API performance differences
A command-line tool combining all functionality

Deliverables

Complete project folder with all Python files
Sample outputs for at least 3 different images
Brief report (1 page) comparing the APIs: when to use each, pros/cons

Bonus Challenges

Challenge 1: Add AWS Rekognition as a third comparison option
Challenge 2: Build a simple web interface using Streamlit or Gradio
Challenge 3: Add error handling and retry logic for API failures
Challenge 4: Implement caching to avoid re-processing the same images