Computer Vision

Session 6 - Production Deployment

Model optimization, serving infrastructure, and final project

Today's Agenda

Production Deployment Pipeline

    flowchart LR
        A(["Train
500MB FP32"]) -->|export| B["Optimize
INT8 125MB
4x smaller"] B -->|package| C["Docker
2GB image"] C -->|scale| D["K8s/Cloud
3 replicas"] D -->|observe| E{"Monitor
P99 < 100ms
drift < 5%"} E -->|degrade| F["Retrain
Weekly/Monthly"] F -->|improve| A classDef train fill:#78909c,stroke:#546e7a,color:#fff classDef optimize fill:#4a90d9,stroke:#2e6da4,color:#fff classDef container fill:#7cb342,stroke:#558b2f,color:#fff classDef deploy fill:#9c27b0,stroke:#7b1fa2,color:#fff classDef monitor fill:#ff9800,stroke:#ef6c00,color:#fff classDef retrain fill:#f44336,stroke:#c62828,color:#fff class A train class B optimize class C container class D deploy class E monitor class F retrain

Model Quantization

What is Quantization?

Converting model weights from 32-bit floats to lower precision (16-bit, 8-bit, or 4-bit)

flowchart LR
    A["FP32 Model
100 MB"] -->|Quantize| B["INT8 Model
25 MB"] B -->|"4x smaller"| C["Deploy"] style A fill:#e74c3c,color:#fff style B fill:#27ae60,color:#fff style C fill:#3498db,color:#fff

Types

  • Post-Training Quantization: Applied after training
  • Quantization-Aware Training: During training
  • Dynamic Quantization: Weights only
  • Static Quantization: Weights + activations

TensorFlow Lite Quantization

import tensorflow as tf

# Load your trained model
model = tf.keras.models.load_model('model.h5')  # (#1:Load Keras model)

# Create TFLite converter
converter = tf.lite.TFLiteConverter.from_keras_model(model)  # (#2:Initialize converter)

# Post-training quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # (#3:Enable optimization)

# Full integer quantization (requires representative dataset)
def representative_dataset():  # (#4:Calibration data)
    for i in range(100):
        yield [X_train[i:i+1].astype(np.float32)]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_types = [tf.int8]  # (#5:INT8 precision)

# Convert and save
tflite_model = converter.convert()
with open('model_quantized.tflite', 'wb') as f:  # (#6:Save TFLite model)
    f.write(tflite_model)

PyTorch Quantization

import torch
from torch.quantization import quantize_dynamic, prepare, convert

# Dynamic quantization (easiest)
model_fp32 = load_model()
model_int8 = quantize_dynamic(  # (#1:Dynamic quant)
    model_fp32,
    {torch.nn.Linear, torch.nn.Conv2d},  # (#2:Layers to quantize)
    dtype=torch.qint8
)

# Static quantization (better accuracy)
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')  # (#3:Set config)
model_prepared = prepare(model_fp32)  # (#4:Prepare model)

# Calibrate with representative data
with torch.no_grad():
    for batch in calibration_loader:  # (#5:Run calibration)
        model_prepared(batch)

model_quantized = convert(model_prepared)  # (#6:Convert to INT8)

# Save quantized model
torch.save(model_quantized.state_dict(), 'model_quantized.pt')

Quick Exercise: Model Size Calculation

A ResNet-50 model has 25.6 million parameters stored as FP32.

Question 1

What is the model size in MB with FP32 weights?

Hint: FP32 = 4 bytes per parameter

Question 2

What is the model size after INT8 quantization?

Hint: INT8 = 1 byte per parameter

Question 3

If we also prune 50% of weights, what's the final size?

Think: Pruning + Quantization combined

Model Pruning

What is Pruning?

Removing unimportant weights (near-zero values) from the model

Types

  • Unstructured: Remove individual weights
  • Structured: Remove entire filters/neurons
  • Magnitude-based: Remove smallest weights

Benefits

  • Smaller model size
  • Faster inference (with sparse ops)
  • Often 50-90% sparsity with minimal accuracy loss

Note: Requires sparse inference support for speed gains

Pruning with TensorFlow

import tensorflow_model_optimization as tfmot

# Define pruning parameters
pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(  # (#1:Gradual pruning)
        initial_sparsity=0.0,
        final_sparsity=0.5,  # (#2:50% weights pruned)
        begin_step=1000,
        end_step=5000
    )
}

# Apply pruning to model
model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude(  # (#3:Wrap model)
    model, **pruning_params
)

# Compile and train with pruning callbacks
model_for_pruning.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
callbacks = [tfmot.sparsity.keras.UpdatePruningStep()]  # (#4:Update pruning)

model_for_pruning.fit(X_train, y_train, epochs=10, callbacks=callbacks)

# Strip pruning wrappers for deployment
final_model = tfmot.sparsity.keras.strip_pruning(model_for_pruning)  # (#5:Clean model)

ONNX: Open Neural Network Exchange

flowchart LR
    subgraph Train["Training Frameworks"]
        A[PyTorch]
        B[TensorFlow]
        C[Keras]
    end
    subgraph Format["Universal Format"]
        D[ONNX Model]
    end
    subgraph Deploy["Deployment Targets"]
        E[ONNX Runtime]
        F[TensorRT]
        G[OpenVINO]
        H[CoreML]
    end
    A --> D
    B --> D
    C --> D
    D --> E
    D --> F
    D --> G
    D --> H

    style D fill:#3498db,color:#fff
    

ONNX enables: Train once in any framework, deploy everywhere with optimized inference

Exporting to ONNX

# PyTorch to ONNX
import torch
import torch.onnx

model = load_pytorch_model()
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)  # (#1:Example input shape)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,  # (#2:Include weights)
    opset_version=13,  # (#3:ONNX version)
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={  # (#4:Variable batch size)
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

# Verify ONNX model
import onnx
model_onnx = onnx.load("model.onnx")
onnx.checker.check_model(model_onnx)  # (#5:Validate model)

ONNX Runtime Inference

import onnxruntime as ort
import numpy as np

# Create inference session
session = ort.InferenceSession(  # (#1:Load ONNX model)
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']  # (#2:GPU first, CPU fallback)
)

# Get input/output names
input_name = session.get_inputs()[0].name  # (#3:Input tensor name)
output_name = session.get_outputs()[0].name

# Prepare input
image = preprocess_image("test.jpg")  # (#4:Your preprocessing)
input_data = np.expand_dims(image, axis=0).astype(np.float32)

# Run inference
outputs = session.run([output_name], {input_name: input_data})  # (#5:Run prediction)
predictions = outputs[0]

# Post-process
class_id = np.argmax(predictions)
confidence = np.max(predictions)

TensorRT Optimization

What is TensorRT?

NVIDIA's high-performance deep learning inference optimizer

Optimizations

  • Layer fusion
  • Kernel auto-tuning
  • Precision calibration (FP16/INT8)
  • Memory optimization

Performance Gains

  • 2-6x faster than native frameworks
  • Up to 40x faster with INT8
  • Reduced GPU memory

Requires: NVIDIA GPU (Pascal or newer)

Converting to TensorRT

# Method 1: Using trtexec CLI
# trtexec --onnx=model.onnx --saveEngine=model.trt --fp16

# Method 2: Using Python API
import tensorrt as trt

logger = trt.Logger(trt.Logger.WARNING)  # (#1:TRT logger)
builder = trt.Builder(logger)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)  # (#2:Explicit batch)
)
parser = trt.OnnxParser(network, logger)

# Parse ONNX model
with open("model.onnx", "rb") as f:
    parser.parse(f.read())  # (#3:Load ONNX)

# Configure builder
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)  # (#4:Enable FP16)
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # (#5:1GB workspace)

# Build engine
engine = builder.build_serialized_network(network, config)  # (#6:Build TRT engine)
with open("model.trt", "wb") as f:
    f.write(engine)

FastAPI for ML APIs

Why FastAPI?

  • High performance (async support)
  • Automatic OpenAPI documentation
  • Type hints & validation
  • Easy to learn and use

Key Features

  • Automatic request validation
  • JSON serialization
  • File upload handling
  • Background tasks
  • WebSocket support

FastAPI REST API for Image Classification

# app.py
from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import JSONResponse
import torch
import torchvision.transforms as transforms
from PIL import Image
import io

app = FastAPI(title="CV Model API")  # (#1:Create FastAPI app)

# Load model at startup
model = None

@app.on_event("startup")  # (#2:Load on startup)
async def load_model():
    global model
    model = torch.load("model.pt")
    model.eval()

transform = transforms.Compose([  # (#3:Preprocessing)
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

FastAPI REST API (continued)

CLASSES = ["cat", "dog", "bird", "car", "plane"]  # (#1:Class labels)

@app.post("/predict")  # (#2:POST endpoint)
async def predict(file: UploadFile = File(...)):
    if not file.content_type.startswith("image/"):
        raise HTTPException(400, "File must be an image")  # (#3:Validate input)

    contents = await file.read()
    image = Image.open(io.BytesIO(contents)).convert("RGB")  # (#4:Load image)

    input_tensor = transform(image).unsqueeze(0)  # (#5:Preprocess)

    with torch.no_grad():
        outputs = model(input_tensor)
        probabilities = torch.softmax(outputs, dim=1)  # (#6:Get probabilities)
        confidence, predicted = torch.max(probabilities, 1)

    return {
        "class": CLASSES[predicted.item()],
        "confidence": confidence.item(),
        "all_probabilities": {
            CLASSES[i]: prob.item()
            for i, prob in enumerate(probabilities[0])
        }
    }

@app.get("/health")  # (#7:Health check)
async def health():
    return {"status": "healthy", "model_loaded": model is not None}

Running FastAPI

# Install dependencies
pip install fastapi uvicorn python-multipart pillow torch torchvision

# Run development server
uvicorn app:app --reload --host 0.0.0.0 --port 8000  # (#1:Dev server)

# Production with Gunicorn
pip install gunicorn
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000  # (#2:Production)

Testing the API

# Using curl
curl -X POST "http://localhost:8000/predict" \
     -F "file=@test_image.jpg"

# API docs available at: http://localhost:8000/docs

Quick Exercise: Choose Deployment Architecture

flowchart TB
    A{"Requirements?"} -->|"Simple API
Low traffic"| B["FastAPI"] A -->|"TF models
Auto-batching"| C["TF Serving"] A -->|"Multi-framework
GPU cluster"| D["Triton"] style B fill:#00b894,color:#fff style C fill:#0984e3,color:#fff style D fill:#6c5ce7,color:#fff

Scenario A

Startup MVP, 1 PyTorch model, ~100 req/day

Answer: FastAPI

Scenario B

Enterprise: 5 TF models, A/B testing, 10k req/sec

Answer: TF Serving

Scenario C

PyTorch + ONNX + TensorRT, GPU cluster

Answer: Triton

TensorFlow Serving

Features

  • Production-grade serving
  • Model versioning
  • Automatic batching
  • gRPC & REST APIs
  • GPU support

Model Format

Requires SavedModel format:

model_dir/
  1/                # Version
    saved_model.pb
    variables/
      variables.data
      variables.index

TensorFlow Serving with Docker

# First, save model in SavedModel format
model.save('models/cv_model/1')  # (#1:Version 1)
# Pull TF Serving image
docker pull tensorflow/serving:latest-gpu  # (#2:GPU version)

# Run TF Serving container
docker run -d --name tf-serving \
  -p 8501:8501 \
  -p 8500:8500 \
  --gpus all \
  -v "$(pwd)/models:/models" \
  -e MODEL_NAME=cv_model \
  tensorflow/serving:latest-gpu  # (#3:Start serving)

# Test REST API
curl -X POST http://localhost:8501/v1/models/cv_model:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[...image_data...]]}' # (#4:REST prediction)

NVIDIA Triton Inference Server

Features

  • Multi-framework: TF, PyTorch, ONNX, TensorRT
  • Dynamic batching
  • Model ensemble
  • Concurrent model execution
  • Model versioning

Model Repository

model_repository/
  model_name/
    config.pbtxt
    1/
      model.onnx

Triton Configuration

# config.pbtxt
name: "image_classifier"
platform: "onnxruntime_onnx"  # (#1:Backend)
max_batch_size: 32  # (#2:Max batch)

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]  # (#3:Input shape)
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]  # (#4:Output classes)
  }
]

dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]  # (#5:Batch sizes)
  max_queue_delay_microseconds: 100
}

instance_group [
  { count: 2, kind: KIND_GPU }  # (#6:GPU instances)
]

Running Triton Server

# Pull Triton image
docker pull nvcr.io/nvidia/tritonserver:23.10-py3  # (#1:Official image)

# Run Triton server
docker run --gpus all -d --name triton \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  -v $(pwd)/model_repository:/models \
  nvcr.io/nvidia/tritonserver:23.10-py3 \
  tritonserver --model-repository=/models  # (#2:Start server)

# Check model status
curl localhost:8000/v2/models/image_classifier  # (#3:Health check)

Ports: 8000 (HTTP), 8001 (gRPC), 8002 (Metrics)

Docker for ML Applications

Why Docker?

  • Reproducibility: Same environment everywhere
  • Isolation: No dependency conflicts
  • Portability: Run anywhere
  • Scalability: Easy orchestration

ML-Specific Benefits

  • Pin CUDA/cuDNN versions
  • Package model with code
  • Consistent preprocessing
  • GPU passthrough support

Dockerfile for CV Model

# Use NVIDIA CUDA base image
FROM nvidia/cuda:11.8-cudnn8-runtime-ubuntu22.04  # (#1:CUDA base)

# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV DEBIAN_FRONTEND=noninteractive

# Install system dependencies
RUN apt-get update && apt-get install -y \
    python3.10 python3-pip libgl1-mesa-glx libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*  # (#2:System deps)

# Set working directory
WORKDIR /app

# Copy requirements first (for caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt  # (#3:Python deps)

# Copy application code
COPY app/ ./app/
COPY models/ ./models/  # (#4:Include model)

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s \
  CMD curl -f http://localhost:8000/health || exit 1  # (#5:Health check)

# Run application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]  # (#6:Start app)

Multi-stage Build for Smaller Images

# Stage 1: Builder
FROM python:3.10-slim as builder  # (#1:Build stage)

WORKDIR /app
COPY requirements.txt .

RUN pip wheel --no-cache-dir --wheel-dir /wheels \
    -r requirements.txt  # (#2:Build wheels)

# Stage 2: Runtime
FROM nvidia/cuda:11.8-cudnn8-runtime-ubuntu22.04  # (#3:Runtime stage)

# Install Python
RUN apt-get update && apt-get install -y python3.10 python3-pip \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy wheels from builder
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/*  # (#4:Install from wheels)

# Copy only necessary files
COPY app/ ./app/
COPY models/ ./models/

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

GPU Docker Containers

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list \
  | sudo tee /etc/apt/sources.list.d/nvidia-docker.list  # (#1:Add repo)

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit  # (#2:Install toolkit)
sudo systemctl restart docker

# Run container with GPU access
docker run --gpus all nvidia/cuda:11.8-base nvidia-smi  # (#3:Test GPU)

# Run with specific GPUs
docker run --gpus '"device=0,1"' my-ml-app  # (#4:Select GPUs)

# Docker Compose with GPU
# In docker-compose.yml:
#   deploy:
#     resources:
#       reservations:
#         devices:
#           - capabilities: [gpu]  # (#5:Compose GPU)

Kubernetes for ML Deployment

flowchart TB
    subgraph K8s["Kubernetes Cluster"]
        LB["LoadBalancer"] --> SVC["Service"]
        SVC --> P1["Pod 1"]
        SVC --> P2["Pod 2"]
        SVC --> P3["Pod 3"]
        HPA["HPA"] -.->|"scale"| DEP["Deployment"]
        DEP --> P1
        DEP --> P2
        DEP --> P3
    end
    Client --> LB

    style LB fill:#e67e22,color:#fff
    style SVC fill:#3498db,color:#fff
    style DEP fill:#27ae60,color:#fff
    style HPA fill:#9b59b6,color:#fff
    

Why Kubernetes?

  • Auto-scaling: Handle varying load
  • Self-healing: Restart failed pods
  • Rolling updates: Zero-downtime

Key Concepts

  • Pod: Smallest deployable unit
  • Deployment: Manages replicas
  • Service: Network endpoint

Kubernetes Deployment YAML

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cv-model-api
spec:
  replicas: 3  # (#1:3 replicas)
  selector:
    matchLabels:
      app: cv-model
  template:
    metadata:
      labels:
        app: cv-model
    spec:
      containers:
      - name: cv-model
        image: registry.example.com/cv-model:v1.0  # (#2:Container image)
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
            nvidia.com/gpu: 1  # (#3:GPU request)
          limits:
            memory: "8Gi"
            nvidia.com/gpu: 1
        livenessProbe:  # (#4:Health probes)
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000

Kubernetes Service & HPA

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: cv-model-service
spec:
  selector:
    app: cv-model
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer  # (#1:External access)
---
# hpa.yaml - Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cv-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: cv-model-api
  minReplicas: 2  # (#2:Min pods)
  maxReplicas: 10  # (#3:Max pods)
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # (#4:Scale at 70% CPU)

MLflow: Experiment Tracking

flowchart LR
    A["Experiment"] -->|"log"| B["MLflow Tracking"]
    B -->|"register"| C["Model Registry"]
    C -->|"Staging"| D["Validate"]
    D -->|"Production"| E["Deploy"]
    E -->|"monitor"| F["Retrain"]
    F --> A

    style B fill:#0194e2,color:#fff
    style C fill:#27ae60,color:#fff
    style E fill:#e67e22,color:#fff
    

MLflow Components

  • Tracking: Log parameters, metrics
  • Projects: Reproducible runs
  • Registry: Model versioning

Benefits

  • Compare experiments
  • Reproduce results
  • Share models

MLflow Experiment Tracking

import mlflow
import mlflow.pytorch

# Set tracking URI (can be remote server)
mlflow.set_tracking_uri("http://mlflow-server:5000")  # (#1:MLflow server)
mlflow.set_experiment("cv-classification")  # (#2:Experiment name)

with mlflow.start_run(run_name="resnet50-v1"):  # (#3:Start run)
    # Log parameters
    mlflow.log_param("model", "resnet50")
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 32)
    mlflow.log_param("epochs", 50)  # (#4:Log params)

    # Train model...
    for epoch in range(epochs):
        train_loss, val_loss, val_acc = train_epoch(...)
        mlflow.log_metrics({  # (#5:Log metrics)
            "train_loss": train_loss,
            "val_loss": val_loss,
            "val_accuracy": val_acc
        }, step=epoch)

    # Log model
    mlflow.pytorch.log_model(model, "model")  # (#6:Save model)

    # Log artifacts
    mlflow.log_artifact("confusion_matrix.png")  # (#7:Save artifacts)

MLflow Model Registry

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register a model
model_uri = "runs:/abc123/model"  # (#1:Run ID + artifact path)
model_version = mlflow.register_model(
    model_uri,
    "image-classifier"  # (#2:Model name)
)

# Transition model stage
client.transition_model_version_stage(
    name="image-classifier",
    version=model_version.version,
    stage="Staging"  # (#3:Staging/Production/Archived)
)

# Load model by stage
model = mlflow.pytorch.load_model(
    "models:/image-classifier/Production"  # (#4:Load prod model)
)

# Add model description
client.update_model_version(
    name="image-classifier",
    version=model_version.version,
    description="ResNet50 trained on custom dataset, 95% accuracy"  # (#5:Document)

Production Monitoring

What to Monitor

  • System metrics: CPU, GPU, memory
  • Application metrics: Latency, throughput
  • Model metrics: Predictions distribution
  • Data drift: Input distribution changes

Tools

  • Prometheus: Metrics collection
  • Grafana: Visualization
  • Evidently: ML monitoring
  • WhyLabs: Data observability

Data Drift Detection

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset

# Define column mapping
column_mapping = ColumnMapping(
    prediction='prediction',
    target='actual_label'
)

# Create drift report
report = Report(metrics=[
    DataDriftPreset(),  # (#1:Input drift)
    TargetDriftPreset()  # (#2:Prediction drift)
])

# Generate report
report.run(
    reference_data=training_data,  # (#3:Reference distribution)
    current_data=production_data,  # (#4:Current distribution)
    column_mapping=column_mapping
)

# Save report
report.save_html("drift_report.html")  # (#5:HTML report)

# Get results programmatically
results = report.as_dict()
if results['metrics'][0]['result']['dataset_drift']:  # (#6:Check drift)
    alert_team("Data drift detected!")

Prometheus Metrics in FastAPI

from prometheus_client import Counter, Histogram, generate_latest
from fastapi import FastAPI, Response

app = FastAPI()

# Define metrics
PREDICTIONS = Counter(
    'model_predictions_total',
    'Total predictions',
    ['model', 'class']  # (#1:Labels)
)

LATENCY = Histogram(
    'model_inference_latency_seconds',
    'Inference latency',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0]  # (#2:Buckets)
)

@app.post("/predict")
async def predict(file: UploadFile):
    with LATENCY.time():  # (#3:Measure latency)
        result = model.predict(file)

    PREDICTIONS.labels(model='resnet50', class_=result['class']).inc()  # (#4:Count)
    return result

@app.get("/metrics")  # (#5:Metrics endpoint)
async def metrics():
    return Response(generate_latest(), media_type="text/plain")

CI/CD for ML Models

flowchart LR
    A["Push Code"] --> B["Test"]
    B --> C["Build Image"]
    C --> D["Push Registry"]
    D --> E["Deploy Staging"]
    E --> F{"Validate?"}
    F -->|Pass| G["Deploy Prod"]
    F -->|Fail| H["Rollback"]

    style B fill:#3498db,color:#fff
    style C fill:#9b59b6,color:#fff
    style D fill:#e67e22,color:#fff
    style G fill:#27ae60,color:#fff
    style H fill:#e74c3c,color:#fff
    

CI/CD Stages

  • Test: Unit + model validation
  • Build: Docker image
  • Deploy: Staging → Production

ML-Specific

  • Model performance tests
  • Data validation
  • Canary deployments

GitHub Actions CI/CD Pipeline

# .github/workflows/ml-pipeline.yml
name: ML Model CI/CD
on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: pip install -r requirements.txt  # (#1:Install deps)
    - name: Run tests
      run: pytest tests/ --cov=app  # (#2:Run tests)
    - name: Model validation
      run: python scripts/validate_model.py  # (#3:Validate model)

GitHub Actions CI/CD (continued)

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Login to Registry
      uses: docker/login-action@v3
      with:
        registry: ghcr.io
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}  # (#1:Auth to registry)
    - name: Build and push
      uses: docker/build-push-action@v5
      with:
        push: true
        tags: ghcr.io/${{ github.repository }}:${{ github.sha }}  # (#2:Tag with SHA)

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
    - name: Deploy to Kubernetes
      uses: azure/k8s-deploy@v4
      with:
        manifests: k8s/
        images: ghcr.io/${{ github.repository }}:${{ github.sha }}  # (#3:Deploy)

Edge Deployment

flowchart TB
    A{"Edge Platform?"} -->|"Mobile"| B["TFLite / CoreML"]
    A -->|"Low Power"| C["Raspberry Pi
TFLite"] A -->|"GPU Required"| D["NVIDIA Jetson
TensorRT"] A -->|"Ultra Low Power"| E["Google Coral
Edge TPU"] style B fill:#3498db,color:#fff style C fill:#e74c3c,color:#fff style D fill:#27ae60,color:#fff style E fill:#f39c12,color:#fff

Why Edge?

  • Low latency: No network trip
  • Privacy: Data stays local
  • Offline: No internet needed

Edge Platforms

  • Mobile: iOS, Android
  • GPU Edge: NVIDIA Jetson
  • TPU Edge: Google Coral

TensorFlow Lite on Edge

# TFLite inference on edge device
import tflite_runtime.interpreter as tflite
import numpy as np
from PIL import Image

# Load TFLite model
interpreter = tflite.Interpreter(model_path="model.tflite")  # (#1:Load model)
interpreter.allocate_tensors()  # (#2:Allocate memory)

# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Preprocess image
image = Image.open("test.jpg").resize((224, 224))
input_data = np.expand_dims(np.array(image), axis=0)
input_data = (input_data / 255.0).astype(np.float32)  # (#3:Normalize)

# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data)  # (#4:Set input)
interpreter.invoke()  # (#5:Run inference)
output_data = interpreter.get_tensor(output_details[0]['index'])  # (#6:Get output)

predicted_class = np.argmax(output_data)

NVIDIA Jetson Deployment

Jetson Family

  • Jetson Nano: Entry-level, 472 GFLOPS
  • Jetson Xavier NX: 21 TOPS AI
  • Jetson AGX Orin: 275 TOPS AI

Deployment Steps

  1. Convert to TensorRT
  2. Flash JetPack OS
  3. Deploy optimized model
  4. Run with DeepStream SDK
# Convert ONNX to TensorRT on Jetson
/usr/src/tensorrt/bin/trtexec \
  --onnx=model.onnx \
  --saveEngine=model.trt \
  --fp16 \
  --workspace=1024  # (#1:Build TRT engine)

Google Coral TPU

# Coral Edge TPU inference
from pycoral.utils.edgetpu import make_interpreter
from pycoral.adapters import common, classify
from PIL import Image

# Load Edge TPU model (must be compiled for Edge TPU)
interpreter = make_interpreter('model_edgetpu.tflite')  # (#1:Edge TPU model)
interpreter.allocate_tensors()

# Get model input size
size = common.input_size(interpreter)  # (#2:Input size)

# Load and preprocess image
image = Image.open('test.jpg').resize(size, Image.LANCZOS)
common.set_input(interpreter, image)  # (#3:Set input)

# Run inference
interpreter.invoke()  # (#4:Run on TPU)

# Get classification results
classes = classify.get_classes(interpreter, top_k=3)  # (#5:Top 3 predictions)
for c in classes:
    print(f"Class {c.id}: {c.score:.4f}")

Performance: Up to 4 TOPS at 2W power consumption

Final Project Requirements

Model Development

  • Custom dataset (min 1000 images)
  • Data augmentation pipeline
  • Transfer learning or custom architecture
  • Performance optimization

Production Deployment

  • REST API with FastAPI
  • Docker containerization
  • Model optimization (quantization/ONNX)
  • Basic monitoring

Documentation

  • README with setup instructions
  • API documentation
  • Model card (performance, limitations)
  • Demo video (2-3 minutes)

Suggested Project Ideas

Project Task Difficulty
Plant Disease Detection Classification Intermediate
Traffic Sign Recognition Classification Beginner
Face Mask Detection Object Detection Intermediate
Document Layout Analysis Segmentation Advanced
Defect Detection (Manufacturing) Detection/Classification Intermediate
Medical Image Analysis Segmentation Advanced

Tip: Choose a project aligned with your industry interest!

Evaluation Criteria

Category Weight Key Points
Model Quality 30% Accuracy, robustness, appropriate architecture
Data Pipeline 20% Data collection, augmentation, preprocessing
Deployment 25% API design, containerization, optimization
Documentation 15% Code quality, README, model card
Presentation 10% Demo, explanation of choices

Key Takeaways

Optimize Before Deploy

Quantization, pruning, and ONNX/TensorRT can dramatically improve inference speed

Containerize Everything

Docker ensures reproducibility and simplifies deployment to any environment

Monitor in Production

Track system metrics, model performance, and data drift continuously

Resources

Type Resource
Documentation TensorFlow Lite Guide
Documentation ONNX Runtime
Tool MLflow Documentation
Framework FastAPI Documentation
Platform NVIDIA TensorRT
Tutorial Triton Tutorials
Course Full Stack Deep Learning

Questions?

Project Time

Start working on your final project

Office Hours

Schedule time for project guidance

Submission

GitHub repository with documentation

Deadline: Final project due 2 weeks after last session. Good luck!

1 / 1

Slide Overview