Computer Vision

Session 6 - Production Deployment

Model optimization, serving infrastructure, and final project

Today's Agenda

Model Optimization (Quantization, Pruning, ONNX, TensorRT)
Serving Infrastructure (FastAPI, TF Serving, Triton)
Containerization with Docker
Kubernetes Deployment
MLOps (MLflow, Monitoring, CI/CD)
Edge Deployment
Final Project Requirements

Production Deployment Pipeline

    flowchart LR
        A(["Train
500MB FP32"]) -->|export| B["Optimize
INT8 125MB
4x smaller"]
        B -->|package| C["Docker
2GB image"]
        C -->|scale| D["K8s/Cloud
3 replicas"]
        D -->|observe| E{"Monitor
P99 < 100ms
drift < 5%"}
        E -->|degrade| F["Retrain
Weekly/Monthly"]
        F -->|improve| A

        classDef train fill:#78909c,stroke:#546e7a,color:#fff
        classDef optimize fill:#4a90d9,stroke:#2e6da4,color:#fff
        classDef container fill:#7cb342,stroke:#558b2f,color:#fff
        classDef deploy fill:#9c27b0,stroke:#7b1fa2,color:#fff
        classDef monitor fill:#ff9800,stroke:#ef6c00,color:#fff
        classDef retrain fill:#f44336,stroke:#c62828,color:#fff

        class A train
        class B optimize
        class C container
        class D deploy
        class E monitor
        class F retrain

Model Quantization

What is Quantization?

Converting model weights from 32-bit floats to lower precision (16-bit, 8-bit, or 4-bit)

flowchart LR
    A["FP32 Model
100 MB"] -->|Quantize| B["INT8 Model
25 MB"]
    B -->|"4x smaller"| C["Deploy"]

    style A fill:#e74c3c,color:#fff
    style B fill:#27ae60,color:#fff
    style C fill:#3498db,color:#fff

Types

Post-Training Quantization: Applied after training
Quantization-Aware Training: During training
Dynamic Quantization: Weights only
Static Quantization: Weights + activations

TensorFlow Lite Quantization

import tensorflow as tf

# Load your trained model
model = tf.keras.models.load_model('model.h5')  # (#1:Load Keras model)

# Create TFLite converter
converter = tf.lite.TFLiteConverter.from_keras_model(model)  # (#2:Initialize converter)

# Post-training quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # (#3:Enable optimization)

# Full integer quantization (requires representative dataset)
def representative_dataset():  # (#4:Calibration data)
    for i in range(100):
        yield [X_train[i:i+1].astype(np.float32)]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_types = [tf.int8]  # (#5:INT8 precision)

# Convert and save
tflite_model = converter.convert()
with open('model_quantized.tflite', 'wb') as f:  # (#6:Save TFLite model)
    f.write(tflite_model)

PyTorch Quantization

import torch
from torch.quantization import quantize_dynamic, prepare, convert

# Dynamic quantization (easiest)
model_fp32 = load_model()
model_int8 = quantize_dynamic(  # (#1:Dynamic quant)
    model_fp32,
    {torch.nn.Linear, torch.nn.Conv2d},  # (#2:Layers to quantize)
    dtype=torch.qint8
)

# Static quantization (better accuracy)
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')  # (#3:Set config)
model_prepared = prepare(model_fp32)  # (#4:Prepare model)

# Calibrate with representative data
with torch.no_grad():
    for batch in calibration_loader:  # (#5:Run calibration)
        model_prepared(batch)

model_quantized = convert(model_prepared)  # (#6:Convert to INT8)

# Save quantized model
torch.save(model_quantized.state_dict(), 'model_quantized.pt')

Quick Exercise: Model Size Calculation

A ResNet-50 model has 25.6 million parameters stored as FP32.

Question 1

What is the model size in MB with FP32 weights?

Hint: FP32 = 4 bytes per parameter

Question 2

What is the model size after INT8 quantization?

Hint: INT8 = 1 byte per parameter

Question 3

If we also prune 50% of weights, what's the final size?

Think: Pruning + Quantization combined

Model Pruning

What is Pruning?

Removing unimportant weights (near-zero values) from the model

Types

Unstructured: Remove individual weights
Structured: Remove entire filters/neurons
Magnitude-based: Remove smallest weights

Benefits

Smaller model size
Faster inference (with sparse ops)
Often 50-90% sparsity with minimal accuracy loss

Note: Requires sparse inference support for speed gains

Pruning with TensorFlow

import tensorflow_model_optimization as tfmot

# Define pruning parameters
pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(  # (#1:Gradual pruning)
        initial_sparsity=0.0,
        final_sparsity=0.5,  # (#2:50% weights pruned)
        begin_step=1000,
        end_step=5000
    )
}

# Apply pruning to model
model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude(  # (#3:Wrap model)
    model, **pruning_params
)

# Compile and train with pruning callbacks
model_for_pruning.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
callbacks = [tfmot.sparsity.keras.UpdatePruningStep()]  # (#4:Update pruning)

model_for_pruning.fit(X_train, y_train, epochs=10, callbacks=callbacks)

# Strip pruning wrappers for deployment
final_model = tfmot.sparsity.keras.strip_pruning(model_for_pruning)  # (#5:Clean model)

ONNX: Open Neural Network Exchange

flowchart LR
    subgraph Train["Training Frameworks"]
        A[PyTorch]
        B[TensorFlow]
        C[Keras]
    end
    subgraph Format["Universal Format"]
        D[ONNX Model]
    end
    subgraph Deploy["Deployment Targets"]
        E[ONNX Runtime]
        F[TensorRT]
        G[OpenVINO]
        H[CoreML]
    end
    A --> D
    B --> D
    C --> D
    D --> E
    D --> F
    D --> G
    D --> H

    style D fill:#3498db,color:#fff

ONNX enables: Train once in any framework, deploy everywhere with optimized inference

Exporting to ONNX

# PyTorch to ONNX
import torch
import torch.onnx

model = load_pytorch_model()
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)  # (#1:Example input shape)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,  # (#2:Include weights)
    opset_version=13,  # (#3:ONNX version)
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={  # (#4:Variable batch size)
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

# Verify ONNX model
import onnx
model_onnx = onnx.load("model.onnx")
onnx.checker.check_model(model_onnx)  # (#5:Validate model)

ONNX Runtime Inference

import onnxruntime as ort
import numpy as np

# Create inference session
session = ort.InferenceSession(  # (#1:Load ONNX model)
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']  # (#2:GPU first, CPU fallback)
)

# Get input/output names
input_name = session.get_inputs()[0].name  # (#3:Input tensor name)
output_name = session.get_outputs()[0].name

# Prepare input
image = preprocess_image("test.jpg")  # (#4:Your preprocessing)
input_data = np.expand_dims(image, axis=0).astype(np.float32)

# Run inference
outputs = session.run([output_name], {input_name: input_data})  # (#5:Run prediction)
predictions = outputs[0]

# Post-process
class_id = np.argmax(predictions)
confidence = np.max(predictions)

TensorRT Optimization

What is TensorRT?

NVIDIA's high-performance deep learning inference optimizer

Optimizations

Layer fusion
Kernel auto-tuning
Precision calibration (FP16/INT8)
Memory optimization

Performance Gains

2-6x faster than native frameworks
Up to 40x faster with INT8
Reduced GPU memory

Requires: NVIDIA GPU (Pascal or newer)

Converting to TensorRT

# Method 1: Using trtexec CLI
# trtexec --onnx=model.onnx --saveEngine=model.trt --fp16

# Method 2: Using Python API
import tensorrt as trt

logger = trt.Logger(trt.Logger.WARNING)  # (#1:TRT logger)
builder = trt.Builder(logger)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)  # (#2:Explicit batch)
)
parser = trt.OnnxParser(network, logger)

# Parse ONNX model
with open("model.onnx", "rb") as f:
    parser.parse(f.read())  # (#3:Load ONNX)

# Configure builder
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)  # (#4:Enable FP16)
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # (#5:1GB workspace)

# Build engine
engine = builder.build_serialized_network(network, config)  # (#6:Build TRT engine)
with open("model.trt", "wb") as f:
    f.write(engine)

FastAPI for ML APIs

Why FastAPI?

High performance (async support)
Automatic OpenAPI documentation
Type hints & validation
Easy to learn and use

Key Features

Automatic request validation
JSON serialization
File upload handling
Background tasks
WebSocket support

FastAPI REST API for Image Classification

# app.py
from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import JSONResponse
import torch
import torchvision.transforms as transforms
from PIL import Image
import io

app = FastAPI(title="CV Model API")  # (#1:Create FastAPI app)

# Load model at startup
model = None

@app.on_event("startup")  # (#2:Load on startup)
async def load_model():
    global model
    model = torch.load("model.pt")
    model.eval()

transform = transforms.Compose([  # (#3:Preprocessing)
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

FastAPI REST API (continued)

CLASSES = ["cat", "dog", "bird", "car", "plane"]  # (#1:Class labels)

@app.post("/predict")  # (#2:POST endpoint)
async def predict(file: UploadFile = File(...)):
    if not file.content_type.startswith("image/"):
        raise HTTPException(400, "File must be an image")  # (#3:Validate input)

    contents = await file.read()
    image = Image.open(io.BytesIO(contents)).convert("RGB")  # (#4:Load image)

    input_tensor = transform(image).unsqueeze(0)  # (#5:Preprocess)

    with torch.no_grad():
        outputs = model(input_tensor)
        probabilities = torch.softmax(outputs, dim=1)  # (#6:Get probabilities)
        confidence, predicted = torch.max(probabilities, 1)

    return {
        "class": CLASSES[predicted.item()],
        "confidence": confidence.item(),
        "all_probabilities": {
            CLASSES[i]: prob.item()
            for i, prob in enumerate(probabilities[0])
        }
    }

@app.get("/health")  # (#7:Health check)
async def health():
    return {"status": "healthy", "model_loaded": model is not None}

Running FastAPI

# Install dependencies
pip install fastapi uvicorn python-multipart pillow torch torchvision

# Run development server
uvicorn app:app --reload --host 0.0.0.0 --port 8000  # (#1:Dev server)

# Production with Gunicorn
pip install gunicorn
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000  # (#2:Production)

Testing the API

# Using curl
curl -X POST "http://localhost:8000/predict" \
     -F "file=@test_image.jpg"

# API docs available at: http://localhost:8000/docs

Quick Exercise: Choose Deployment Architecture

flowchart TB
    A{"Requirements?"} -->|"Simple API
Low traffic"| B["FastAPI"]
    A -->|"TF models
Auto-batching"| C["TF Serving"]
    A -->|"Multi-framework
GPU cluster"| D["Triton"]

    style B fill:#00b894,color:#fff
    style C fill:#0984e3,color:#fff
    style D fill:#6c5ce7,color:#fff

Scenario A

Startup MVP, 1 PyTorch model, ~100 req/day

Answer: FastAPI

Scenario B

Enterprise: 5 TF models, A/B testing, 10k req/sec

Answer: TF Serving

Scenario C

PyTorch + ONNX + TensorRT, GPU cluster

Answer: Triton

TensorFlow Serving

Features

Production-grade serving
Model versioning
Automatic batching
gRPC & REST APIs
GPU support

Model Format

Requires SavedModel format:

model_dir/
  1/                # Version
    saved_model.pb
    variables/
      variables.data
      variables.index

TensorFlow Serving with Docker

# First, save model in SavedModel format
model.save('models/cv_model/1')  # (#1:Version 1)

# Pull TF Serving image
docker pull tensorflow/serving:latest-gpu  # (#2:GPU version)

# Run TF Serving container
docker run -d --name tf-serving \
  -p 8501:8501 \
  -p 8500:8500 \
  --gpus all \
  -v "$(pwd)/models:/models" \
  -e MODEL_NAME=cv_model \
  tensorflow/serving:latest-gpu  # (#3:Start serving)

# Test REST API
curl -X POST http://localhost:8501/v1/models/cv_model:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[...image_data...]]}' # (#4:REST prediction)

NVIDIA Triton Inference Server

Features

Multi-framework: TF, PyTorch, ONNX, TensorRT
Dynamic batching
Model ensemble
Concurrent model execution
Model versioning

Model Repository

model_repository/
  model_name/
    config.pbtxt
    1/
      model.onnx

Triton Configuration

# config.pbtxt
name: "image_classifier"
platform: "onnxruntime_onnx"  # (#1:Backend)
max_batch_size: 32  # (#2:Max batch)

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]  # (#3:Input shape)
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]  # (#4:Output classes)
  }
]

dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]  # (#5:Batch sizes)
  max_queue_delay_microseconds: 100
}

instance_group [
  { count: 2, kind: KIND_GPU }  # (#6:GPU instances)
]

Running Triton Server

# Pull Triton image
docker pull nvcr.io/nvidia/tritonserver:23.10-py3  # (#1:Official image)

# Run Triton server
docker run --gpus all -d --name triton \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  -v $(pwd)/model_repository:/models \
  nvcr.io/nvidia/tritonserver:23.10-py3 \
  tritonserver --model-repository=/models  # (#2:Start server)

# Check model status
curl localhost:8000/v2/models/image_classifier  # (#3:Health check)

Ports: 8000 (HTTP), 8001 (gRPC), 8002 (Metrics)

Docker for ML Applications

Why Docker?

Reproducibility: Same environment everywhere
Isolation: No dependency conflicts
Portability: Run anywhere
Scalability: Easy orchestration

ML-Specific Benefits

Pin CUDA/cuDNN versions
Package model with code
Consistent preprocessing
GPU passthrough support

Dockerfile for CV Model

# Use NVIDIA CUDA base image
FROM nvidia/cuda:11.8-cudnn8-runtime-ubuntu22.04  # (#1:CUDA base)

# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV DEBIAN_FRONTEND=noninteractive

# Install system dependencies
RUN apt-get update && apt-get install -y \
    python3.10 python3-pip libgl1-mesa-glx libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*  # (#2:System deps)

# Set working directory
WORKDIR /app

# Copy requirements first (for caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt  # (#3:Python deps)

# Copy application code
COPY app/ ./app/
COPY models/ ./models/  # (#4:Include model)

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s \
  CMD curl -f http://localhost:8000/health || exit 1  # (#5:Health check)

# Run application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]  # (#6:Start app)

Multi-stage Build for Smaller Images

# Stage 1: Builder
FROM python:3.10-slim as builder  # (#1:Build stage)

WORKDIR /app
COPY requirements.txt .

RUN pip wheel --no-cache-dir --wheel-dir /wheels \
    -r requirements.txt  # (#2:Build wheels)

# Stage 2: Runtime
FROM nvidia/cuda:11.8-cudnn8-runtime-ubuntu22.04  # (#3:Runtime stage)

# Install Python
RUN apt-get update && apt-get install -y python3.10 python3-pip \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy wheels from builder
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/*  # (#4:Install from wheels)

# Copy only necessary files
COPY app/ ./app/
COPY models/ ./models/

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

GPU Docker Containers

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list \
  | sudo tee /etc/apt/sources.list.d/nvidia-docker.list  # (#1:Add repo)

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit  # (#2:Install toolkit)
sudo systemctl restart docker

# Run container with GPU access
docker run --gpus all nvidia/cuda:11.8-base nvidia-smi  # (#3:Test GPU)

# Run with specific GPUs
docker run --gpus '"device=0,1"' my-ml-app  # (#4:Select GPUs)

# Docker Compose with GPU
# In docker-compose.yml:
#   deploy:
#     resources:
#       reservations:
#         devices:
#           - capabilities: [gpu]  # (#5:Compose GPU)

Kubernetes for ML Deployment

flowchart TB
    subgraph K8s["Kubernetes Cluster"]
        LB["LoadBalancer"] --> SVC["Service"]
        SVC --> P1["Pod 1"]
        SVC --> P2["Pod 2"]
        SVC --> P3["Pod 3"]
        HPA["HPA"] -.->|"scale"| DEP["Deployment"]
        DEP --> P1
        DEP --> P2
        DEP --> P3
    end
    Client --> LB

    style LB fill:#e67e22,color:#fff
    style SVC fill:#3498db,color:#fff
    style DEP fill:#27ae60,color:#fff
    style HPA fill:#9b59b6,color:#fff

Why Kubernetes?

Auto-scaling: Handle varying load
Self-healing: Restart failed pods
Rolling updates: Zero-downtime

Key Concepts

Pod: Smallest deployable unit
Deployment: Manages replicas
Service: Network endpoint

Kubernetes Deployment YAML

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cv-model-api
spec:
  replicas: 3  # (#1:3 replicas)
  selector:
    matchLabels:
      app: cv-model
  template:
    metadata:
      labels:
        app: cv-model
    spec:
      containers:
      - name: cv-model
        image: registry.example.com/cv-model:v1.0  # (#2:Container image)
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
            nvidia.com/gpu: 1  # (#3:GPU request)
          limits:
            memory: "8Gi"
            nvidia.com/gpu: 1
        livenessProbe:  # (#4:Health probes)
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000

Kubernetes Service & HPA

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: cv-model-service
spec:
  selector:
    app: cv-model
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer  # (#1:External access)
---
# hpa.yaml - Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cv-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: cv-model-api
  minReplicas: 2  # (#2:Min pods)
  maxReplicas: 10  # (#3:Max pods)
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # (#4:Scale at 70% CPU)

MLflow: Experiment Tracking

flowchart LR
    A["Experiment"] -->|"log"| B["MLflow Tracking"]
    B -->|"register"| C["Model Registry"]
    C -->|"Staging"| D["Validate"]
    D -->|"Production"| E["Deploy"]
    E -->|"monitor"| F["Retrain"]
    F --> A

    style B fill:#0194e2,color:#fff
    style C fill:#27ae60,color:#fff
    style E fill:#e67e22,color:#fff

MLflow Components

Tracking: Log parameters, metrics
Projects: Reproducible runs
Registry: Model versioning

Benefits

Compare experiments
Reproduce results
Share models

MLflow Experiment Tracking

import mlflow
import mlflow.pytorch

# Set tracking URI (can be remote server)
mlflow.set_tracking_uri("http://mlflow-server:5000")  # (#1:MLflow server)
mlflow.set_experiment("cv-classification")  # (#2:Experiment name)

with mlflow.start_run(run_name="resnet50-v1"):  # (#3:Start run)
    # Log parameters
    mlflow.log_param("model", "resnet50")
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 32)
    mlflow.log_param("epochs", 50)  # (#4:Log params)

    # Train model...
    for epoch in range(epochs):
        train_loss, val_loss, val_acc = train_epoch(...)
        mlflow.log_metrics({  # (#5:Log metrics)
            "train_loss": train_loss,
            "val_loss": val_loss,
            "val_accuracy": val_acc
        }, step=epoch)

    # Log model
    mlflow.pytorch.log_model(model, "model")  # (#6:Save model)

    # Log artifacts
    mlflow.log_artifact("confusion_matrix.png")  # (#7:Save artifacts)

MLflow Model Registry

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register a model
model_uri = "runs:/abc123/model"  # (#1:Run ID + artifact path)
model_version = mlflow.register_model(
    model_uri,
    "image-classifier"  # (#2:Model name)
)

# Transition model stage
client.transition_model_version_stage(
    name="image-classifier",
    version=model_version.version,
    stage="Staging"  # (#3:Staging/Production/Archived)
)

# Load model by stage
model = mlflow.pytorch.load_model(
    "models:/image-classifier/Production"  # (#4:Load prod model)
)

# Add model description
client.update_model_version(
    name="image-classifier",
    version=model_version.version,
    description="ResNet50 trained on custom dataset, 95% accuracy"  # (#5:Document)

Production Monitoring

What to Monitor

System metrics: CPU, GPU, memory
Application metrics: Latency, throughput
Model metrics: Predictions distribution
Data drift: Input distribution changes

Tools

Prometheus: Metrics collection
Grafana: Visualization
Evidently: ML monitoring
WhyLabs: Data observability

Data Drift Detection

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset

# Define column mapping
column_mapping = ColumnMapping(
    prediction='prediction',
    target='actual_label'
)

# Create drift report
report = Report(metrics=[
    DataDriftPreset(),  # (#1:Input drift)
    TargetDriftPreset()  # (#2:Prediction drift)
])

# Generate report
report.run(
    reference_data=training_data,  # (#3:Reference distribution)
    current_data=production_data,  # (#4:Current distribution)
    column_mapping=column_mapping
)

# Save report
report.save_html("drift_report.html")  # (#5:HTML report)

# Get results programmatically
results = report.as_dict()
if results['metrics'][0]['result']['dataset_drift']:  # (#6:Check drift)
    alert_team("Data drift detected!")

Prometheus Metrics in FastAPI

from prometheus_client import Counter, Histogram, generate_latest
from fastapi import FastAPI, Response

app = FastAPI()

# Define metrics
PREDICTIONS = Counter(
    'model_predictions_total',
    'Total predictions',
    ['model', 'class']  # (#1:Labels)
)

LATENCY = Histogram(
    'model_inference_latency_seconds',
    'Inference latency',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0]  # (#2:Buckets)
)

@app.post("/predict")
async def predict(file: UploadFile):
    with LATENCY.time():  # (#3:Measure latency)
        result = model.predict(file)

    PREDICTIONS.labels(model='resnet50', class_=result['class']).inc()  # (#4:Count)
    return result

@app.get("/metrics")  # (#5:Metrics endpoint)
async def metrics():
    return Response(generate_latest(), media_type="text/plain")

CI/CD for ML Models

flowchart LR
    A["Push Code"] --> B["Test"]
    B --> C["Build Image"]
    C --> D["Push Registry"]
    D --> E["Deploy Staging"]
    E --> F{"Validate?"}
    F -->|Pass| G["Deploy Prod"]
    F -->|Fail| H["Rollback"]

    style B fill:#3498db,color:#fff
    style C fill:#9b59b6,color:#fff
    style D fill:#e67e22,color:#fff
    style G fill:#27ae60,color:#fff
    style H fill:#e74c3c,color:#fff

CI/CD Stages

Test: Unit + model validation
Build: Docker image
Deploy: Staging → Production

ML-Specific

Model performance tests
Data validation
Canary deployments

GitHub Actions CI/CD Pipeline

# .github/workflows/ml-pipeline.yml
name: ML Model CI/CD
on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: pip install -r requirements.txt  # (#1:Install deps)
    - name: Run tests
      run: pytest tests/ --cov=app  # (#2:Run tests)
    - name: Model validation
      run: python scripts/validate_model.py  # (#3:Validate model)

GitHub Actions CI/CD (continued)

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Login to Registry
      uses: docker/login-action@v3
      with:
        registry: ghcr.io
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}  # (#1:Auth to registry)
    - name: Build and push
      uses: docker/build-push-action@v5
      with:
        push: true
        tags: ghcr.io/${{ github.repository }}:${{ github.sha }}  # (#2:Tag with SHA)

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
    - name: Deploy to Kubernetes
      uses: azure/k8s-deploy@v4
      with:
        manifests: k8s/
        images: ghcr.io/${{ github.repository }}:${{ github.sha }}  # (#3:Deploy)

Edge Deployment

flowchart TB
    A{"Edge Platform?"} -->|"Mobile"| B["TFLite / CoreML"]
    A -->|"Low Power"| C["Raspberry Pi
TFLite"]
    A -->|"GPU Required"| D["NVIDIA Jetson
TensorRT"]
    A -->|"Ultra Low Power"| E["Google Coral
Edge TPU"]

    style B fill:#3498db,color:#fff
    style C fill:#e74c3c,color:#fff
    style D fill:#27ae60,color:#fff
    style E fill:#f39c12,color:#fff

Why Edge?

Low latency: No network trip
Privacy: Data stays local
Offline: No internet needed

Edge Platforms

Mobile: iOS, Android
GPU Edge: NVIDIA Jetson
TPU Edge: Google Coral

TensorFlow Lite on Edge

# TFLite inference on edge device
import tflite_runtime.interpreter as tflite
import numpy as np
from PIL import Image

# Load TFLite model
interpreter = tflite.Interpreter(model_path="model.tflite")  # (#1:Load model)
interpreter.allocate_tensors()  # (#2:Allocate memory)

# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Preprocess image
image = Image.open("test.jpg").resize((224, 224))
input_data = np.expand_dims(np.array(image), axis=0)
input_data = (input_data / 255.0).astype(np.float32)  # (#3:Normalize)

# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data)  # (#4:Set input)
interpreter.invoke()  # (#5:Run inference)
output_data = interpreter.get_tensor(output_details[0]['index'])  # (#6:Get output)

predicted_class = np.argmax(output_data)

NVIDIA Jetson Deployment

Jetson Family

Jetson Nano: Entry-level, 472 GFLOPS
Jetson Xavier NX: 21 TOPS AI
Jetson AGX Orin: 275 TOPS AI

Deployment Steps

Convert to TensorRT
Flash JetPack OS
Deploy optimized model
Run with DeepStream SDK

# Convert ONNX to TensorRT on Jetson
/usr/src/tensorrt/bin/trtexec \
  --onnx=model.onnx \
  --saveEngine=model.trt \
  --fp16 \
  --workspace=1024  # (#1:Build TRT engine)

Google Coral TPU

# Coral Edge TPU inference
from pycoral.utils.edgetpu import make_interpreter
from pycoral.adapters import common, classify
from PIL import Image

# Load Edge TPU model (must be compiled for Edge TPU)
interpreter = make_interpreter('model_edgetpu.tflite')  # (#1:Edge TPU model)
interpreter.allocate_tensors()

# Get model input size
size = common.input_size(interpreter)  # (#2:Input size)

# Load and preprocess image
image = Image.open('test.jpg').resize(size, Image.LANCZOS)
common.set_input(interpreter, image)  # (#3:Set input)

# Run inference
interpreter.invoke()  # (#4:Run on TPU)

# Get classification results
classes = classify.get_classes(interpreter, top_k=3)  # (#5:Top 3 predictions)
for c in classes:
    print(f"Class {c.id}: {c.score:.4f}")

Performance: Up to 4 TOPS at 2W power consumption

Final Project Requirements

Model Development

Custom dataset (min 1000 images)
Data augmentation pipeline
Transfer learning or custom architecture
Performance optimization

Production Deployment

REST API with FastAPI
Docker containerization
Model optimization (quantization/ONNX)
Basic monitoring

Documentation

README with setup instructions
API documentation
Model card (performance, limitations)
Demo video (2-3 minutes)

Suggested Project Ideas

Project	Task	Difficulty
Plant Disease Detection	Classification	Intermediate
Traffic Sign Recognition	Classification	Beginner
Face Mask Detection	Object Detection	Intermediate
Document Layout Analysis	Segmentation	Advanced
Defect Detection (Manufacturing)	Detection/Classification	Intermediate
Medical Image Analysis	Segmentation	Advanced

Tip: Choose a project aligned with your industry interest!

Evaluation Criteria

Category	Weight	Key Points
Model Quality	30%	Accuracy, robustness, appropriate architecture
Data Pipeline	20%	Data collection, augmentation, preprocessing
Deployment	25%	API design, containerization, optimization
Documentation	15%	Code quality, README, model card
Presentation	10%	Demo, explanation of choices

Key Takeaways

Optimize Before Deploy

Quantization, pruning, and ONNX/TensorRT can dramatically improve inference speed

Containerize Everything

Docker ensures reproducibility and simplifies deployment to any environment

Monitor in Production

Track system metrics, model performance, and data drift continuously

Resources

Type	Resource
Documentation	TensorFlow Lite Guide
Documentation	ONNX Runtime
Tool	MLflow Documentation
Framework	FastAPI Documentation
Platform	NVIDIA TensorRT
Tutorial	Triton Tutorials
Course	Full Stack Deep Learning

Questions?

Project Time

Start working on your final project

Office Hours

Schedule time for project guidance

Submission

GitHub repository with documentation

Deadline: Final project due 2 weeks after last session. Good luck!