Model optimization, serving infrastructure, and final project
flowchart LR
A(["Train
500MB FP32"]) -->|export| B["Optimize
INT8 125MB
4x smaller"]
B -->|package| C["Docker
2GB image"]
C -->|scale| D["K8s/Cloud
3 replicas"]
D -->|observe| E{"Monitor
P99 < 100ms
drift < 5%"}
E -->|degrade| F["Retrain
Weekly/Monthly"]
F -->|improve| A
classDef train fill:#78909c,stroke:#546e7a,color:#fff
classDef optimize fill:#4a90d9,stroke:#2e6da4,color:#fff
classDef container fill:#7cb342,stroke:#558b2f,color:#fff
classDef deploy fill:#9c27b0,stroke:#7b1fa2,color:#fff
classDef monitor fill:#ff9800,stroke:#ef6c00,color:#fff
classDef retrain fill:#f44336,stroke:#c62828,color:#fff
class A train
class B optimize
class C container
class D deploy
class E monitor
class F retrain
Converting model weights from 32-bit floats to lower precision (16-bit, 8-bit, or 4-bit)
flowchart LR
A["FP32 Model
100 MB"] -->|Quantize| B["INT8 Model
25 MB"]
B -->|"4x smaller"| C["Deploy"]
style A fill:#e74c3c,color:#fff
style B fill:#27ae60,color:#fff
style C fill:#3498db,color:#fff
import tensorflow as tf
# Load your trained model
model = tf.keras.models.load_model('model.h5') # (#1:Load Keras model)
# Create TFLite converter
converter = tf.lite.TFLiteConverter.from_keras_model(model) # (#2:Initialize converter)
# Post-training quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT] # (#3:Enable optimization)
# Full integer quantization (requires representative dataset)
def representative_dataset(): # (#4:Calibration data)
for i in range(100):
yield [X_train[i:i+1].astype(np.float32)]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_types = [tf.int8] # (#5:INT8 precision)
# Convert and save
tflite_model = converter.convert()
with open('model_quantized.tflite', 'wb') as f: # (#6:Save TFLite model)
f.write(tflite_model)
import torch
from torch.quantization import quantize_dynamic, prepare, convert
# Dynamic quantization (easiest)
model_fp32 = load_model()
model_int8 = quantize_dynamic( # (#1:Dynamic quant)
model_fp32,
{torch.nn.Linear, torch.nn.Conv2d}, # (#2:Layers to quantize)
dtype=torch.qint8
)
# Static quantization (better accuracy)
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm') # (#3:Set config)
model_prepared = prepare(model_fp32) # (#4:Prepare model)
# Calibrate with representative data
with torch.no_grad():
for batch in calibration_loader: # (#5:Run calibration)
model_prepared(batch)
model_quantized = convert(model_prepared) # (#6:Convert to INT8)
# Save quantized model
torch.save(model_quantized.state_dict(), 'model_quantized.pt')
A ResNet-50 model has 25.6 million parameters stored as FP32.
What is the model size in MB with FP32 weights?
Hint: FP32 = 4 bytes per parameter
What is the model size after INT8 quantization?
Hint: INT8 = 1 byte per parameter
If we also prune 50% of weights, what's the final size?
Think: Pruning + Quantization combined
Removing unimportant weights (near-zero values) from the model
Note: Requires sparse inference support for speed gains
import tensorflow_model_optimization as tfmot
# Define pruning parameters
pruning_params = {
'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay( # (#1:Gradual pruning)
initial_sparsity=0.0,
final_sparsity=0.5, # (#2:50% weights pruned)
begin_step=1000,
end_step=5000
)
}
# Apply pruning to model
model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude( # (#3:Wrap model)
model, **pruning_params
)
# Compile and train with pruning callbacks
model_for_pruning.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
callbacks = [tfmot.sparsity.keras.UpdatePruningStep()] # (#4:Update pruning)
model_for_pruning.fit(X_train, y_train, epochs=10, callbacks=callbacks)
# Strip pruning wrappers for deployment
final_model = tfmot.sparsity.keras.strip_pruning(model_for_pruning) # (#5:Clean model)
flowchart LR
subgraph Train["Training Frameworks"]
A[PyTorch]
B[TensorFlow]
C[Keras]
end
subgraph Format["Universal Format"]
D[ONNX Model]
end
subgraph Deploy["Deployment Targets"]
E[ONNX Runtime]
F[TensorRT]
G[OpenVINO]
H[CoreML]
end
A --> D
B --> D
C --> D
D --> E
D --> F
D --> G
D --> H
style D fill:#3498db,color:#fff
ONNX enables: Train once in any framework, deploy everywhere with optimized inference
# PyTorch to ONNX
import torch
import torch.onnx
model = load_pytorch_model()
model.eval()
dummy_input = torch.randn(1, 3, 224, 224) # (#1:Example input shape)
torch.onnx.export(
model,
dummy_input,
"model.onnx",
export_params=True, # (#2:Include weights)
opset_version=13, # (#3:ONNX version)
input_names=['input'],
output_names=['output'],
dynamic_axes={ # (#4:Variable batch size)
'input': {0: 'batch_size'},
'output': {0: 'batch_size'}
}
)
# Verify ONNX model
import onnx
model_onnx = onnx.load("model.onnx")
onnx.checker.check_model(model_onnx) # (#5:Validate model)
import onnxruntime as ort
import numpy as np
# Create inference session
session = ort.InferenceSession( # (#1:Load ONNX model)
"model.onnx",
providers=['CUDAExecutionProvider', 'CPUExecutionProvider'] # (#2:GPU first, CPU fallback)
)
# Get input/output names
input_name = session.get_inputs()[0].name # (#3:Input tensor name)
output_name = session.get_outputs()[0].name
# Prepare input
image = preprocess_image("test.jpg") # (#4:Your preprocessing)
input_data = np.expand_dims(image, axis=0).astype(np.float32)
# Run inference
outputs = session.run([output_name], {input_name: input_data}) # (#5:Run prediction)
predictions = outputs[0]
# Post-process
class_id = np.argmax(predictions)
confidence = np.max(predictions)
NVIDIA's high-performance deep learning inference optimizer
Requires: NVIDIA GPU (Pascal or newer)
# Method 1: Using trtexec CLI
# trtexec --onnx=model.onnx --saveEngine=model.trt --fp16
# Method 2: Using Python API
import tensorrt as trt
logger = trt.Logger(trt.Logger.WARNING) # (#1:TRT logger)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH) # (#2:Explicit batch)
)
parser = trt.OnnxParser(network, logger)
# Parse ONNX model
with open("model.onnx", "rb") as f:
parser.parse(f.read()) # (#3:Load ONNX)
# Configure builder
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16) # (#4:Enable FP16)
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # (#5:1GB workspace)
# Build engine
engine = builder.build_serialized_network(network, config) # (#6:Build TRT engine)
with open("model.trt", "wb") as f:
f.write(engine)
# app.py
from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import JSONResponse
import torch
import torchvision.transforms as transforms
from PIL import Image
import io
app = FastAPI(title="CV Model API") # (#1:Create FastAPI app)
# Load model at startup
model = None
@app.on_event("startup") # (#2:Load on startup)
async def load_model():
global model
model = torch.load("model.pt")
model.eval()
transform = transforms.Compose([ # (#3:Preprocessing)
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
CLASSES = ["cat", "dog", "bird", "car", "plane"] # (#1:Class labels)
@app.post("/predict") # (#2:POST endpoint)
async def predict(file: UploadFile = File(...)):
if not file.content_type.startswith("image/"):
raise HTTPException(400, "File must be an image") # (#3:Validate input)
contents = await file.read()
image = Image.open(io.BytesIO(contents)).convert("RGB") # (#4:Load image)
input_tensor = transform(image).unsqueeze(0) # (#5:Preprocess)
with torch.no_grad():
outputs = model(input_tensor)
probabilities = torch.softmax(outputs, dim=1) # (#6:Get probabilities)
confidence, predicted = torch.max(probabilities, 1)
return {
"class": CLASSES[predicted.item()],
"confidence": confidence.item(),
"all_probabilities": {
CLASSES[i]: prob.item()
for i, prob in enumerate(probabilities[0])
}
}
@app.get("/health") # (#7:Health check)
async def health():
return {"status": "healthy", "model_loaded": model is not None}
# Install dependencies
pip install fastapi uvicorn python-multipart pillow torch torchvision
# Run development server
uvicorn app:app --reload --host 0.0.0.0 --port 8000 # (#1:Dev server)
# Production with Gunicorn
pip install gunicorn
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000 # (#2:Production)
# Using curl
curl -X POST "http://localhost:8000/predict" \
-F "file=@test_image.jpg"
# API docs available at: http://localhost:8000/docs
flowchart TB
A{"Requirements?"} -->|"Simple API
Low traffic"| B["FastAPI"]
A -->|"TF models
Auto-batching"| C["TF Serving"]
A -->|"Multi-framework
GPU cluster"| D["Triton"]
style B fill:#00b894,color:#fff
style C fill:#0984e3,color:#fff
style D fill:#6c5ce7,color:#fff
Startup MVP, 1 PyTorch model, ~100 req/day
Answer: FastAPI
Enterprise: 5 TF models, A/B testing, 10k req/sec
Answer: TF Serving
PyTorch + ONNX + TensorRT, GPU cluster
Answer: Triton
Requires SavedModel format:
model_dir/
1/ # Version
saved_model.pb
variables/
variables.data
variables.index
# First, save model in SavedModel format
model.save('models/cv_model/1') # (#1:Version 1)
# Pull TF Serving image
docker pull tensorflow/serving:latest-gpu # (#2:GPU version)
# Run TF Serving container
docker run -d --name tf-serving \
-p 8501:8501 \
-p 8500:8500 \
--gpus all \
-v "$(pwd)/models:/models" \
-e MODEL_NAME=cv_model \
tensorflow/serving:latest-gpu # (#3:Start serving)
# Test REST API
curl -X POST http://localhost:8501/v1/models/cv_model:predict \
-H "Content-Type: application/json" \
-d '{"instances": [[...image_data...]]}' # (#4:REST prediction)
model_repository/
model_name/
config.pbtxt
1/
model.onnx
# config.pbtxt
name: "image_classifier"
platform: "onnxruntime_onnx" # (#1:Backend)
max_batch_size: 32 # (#2:Max batch)
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ] # (#3:Input shape)
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ 1000 ] # (#4:Output classes)
}
]
dynamic_batching {
preferred_batch_size: [ 8, 16, 32 ] # (#5:Batch sizes)
max_queue_delay_microseconds: 100
}
instance_group [
{ count: 2, kind: KIND_GPU } # (#6:GPU instances)
]
# Pull Triton image
docker pull nvcr.io/nvidia/tritonserver:23.10-py3 # (#1:Official image)
# Run Triton server
docker run --gpus all -d --name triton \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
-v $(pwd)/model_repository:/models \
nvcr.io/nvidia/tritonserver:23.10-py3 \
tritonserver --model-repository=/models # (#2:Start server)
# Check model status
curl localhost:8000/v2/models/image_classifier # (#3:Health check)
Ports: 8000 (HTTP), 8001 (gRPC), 8002 (Metrics)
# Use NVIDIA CUDA base image
FROM nvidia/cuda:11.8-cudnn8-runtime-ubuntu22.04 # (#1:CUDA base)
# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV DEBIAN_FRONTEND=noninteractive
# Install system dependencies
RUN apt-get update && apt-get install -y \
python3.10 python3-pip libgl1-mesa-glx libglib2.0-0 \
&& rm -rf /var/lib/apt/lists/* # (#2:System deps)
# Set working directory
WORKDIR /app
# Copy requirements first (for caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt # (#3:Python deps)
# Copy application code
COPY app/ ./app/
COPY models/ ./models/ # (#4:Include model)
# Expose port
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=10s \
CMD curl -f http://localhost:8000/health || exit 1 # (#5:Health check)
# Run application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"] # (#6:Start app)
# Stage 1: Builder
FROM python:3.10-slim as builder # (#1:Build stage)
WORKDIR /app
COPY requirements.txt .
RUN pip wheel --no-cache-dir --wheel-dir /wheels \
-r requirements.txt # (#2:Build wheels)
# Stage 2: Runtime
FROM nvidia/cuda:11.8-cudnn8-runtime-ubuntu22.04 # (#3:Runtime stage)
# Install Python
RUN apt-get update && apt-get install -y python3.10 python3-pip \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Copy wheels from builder
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/* # (#4:Install from wheels)
# Copy only necessary files
COPY app/ ./app/
COPY models/ ./models/
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list \
| sudo tee /etc/apt/sources.list.d/nvidia-docker.list # (#1:Add repo)
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit # (#2:Install toolkit)
sudo systemctl restart docker
# Run container with GPU access
docker run --gpus all nvidia/cuda:11.8-base nvidia-smi # (#3:Test GPU)
# Run with specific GPUs
docker run --gpus '"device=0,1"' my-ml-app # (#4:Select GPUs)
# Docker Compose with GPU
# In docker-compose.yml:
# deploy:
# resources:
# reservations:
# devices:
# - capabilities: [gpu] # (#5:Compose GPU)
flowchart TB
subgraph K8s["Kubernetes Cluster"]
LB["LoadBalancer"] --> SVC["Service"]
SVC --> P1["Pod 1"]
SVC --> P2["Pod 2"]
SVC --> P3["Pod 3"]
HPA["HPA"] -.->|"scale"| DEP["Deployment"]
DEP --> P1
DEP --> P2
DEP --> P3
end
Client --> LB
style LB fill:#e67e22,color:#fff
style SVC fill:#3498db,color:#fff
style DEP fill:#27ae60,color:#fff
style HPA fill:#9b59b6,color:#fff
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cv-model-api
spec:
replicas: 3 # (#1:3 replicas)
selector:
matchLabels:
app: cv-model
template:
metadata:
labels:
app: cv-model
spec:
containers:
- name: cv-model
image: registry.example.com/cv-model:v1.0 # (#2:Container image)
ports:
- containerPort: 8000
resources:
requests:
memory: "4Gi"
cpu: "2"
nvidia.com/gpu: 1 # (#3:GPU request)
limits:
memory: "8Gi"
nvidia.com/gpu: 1
livenessProbe: # (#4:Health probes)
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: cv-model-service
spec:
selector:
app: cv-model
ports:
- port: 80
targetPort: 8000
type: LoadBalancer # (#1:External access)
---
# hpa.yaml - Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: cv-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cv-model-api
minReplicas: 2 # (#2:Min pods)
maxReplicas: 10 # (#3:Max pods)
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # (#4:Scale at 70% CPU)
flowchart LR
A["Experiment"] -->|"log"| B["MLflow Tracking"]
B -->|"register"| C["Model Registry"]
C -->|"Staging"| D["Validate"]
D -->|"Production"| E["Deploy"]
E -->|"monitor"| F["Retrain"]
F --> A
style B fill:#0194e2,color:#fff
style C fill:#27ae60,color:#fff
style E fill:#e67e22,color:#fff
import mlflow
import mlflow.pytorch
# Set tracking URI (can be remote server)
mlflow.set_tracking_uri("http://mlflow-server:5000") # (#1:MLflow server)
mlflow.set_experiment("cv-classification") # (#2:Experiment name)
with mlflow.start_run(run_name="resnet50-v1"): # (#3:Start run)
# Log parameters
mlflow.log_param("model", "resnet50")
mlflow.log_param("learning_rate", 0.001)
mlflow.log_param("batch_size", 32)
mlflow.log_param("epochs", 50) # (#4:Log params)
# Train model...
for epoch in range(epochs):
train_loss, val_loss, val_acc = train_epoch(...)
mlflow.log_metrics({ # (#5:Log metrics)
"train_loss": train_loss,
"val_loss": val_loss,
"val_accuracy": val_acc
}, step=epoch)
# Log model
mlflow.pytorch.log_model(model, "model") # (#6:Save model)
# Log artifacts
mlflow.log_artifact("confusion_matrix.png") # (#7:Save artifacts)
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Register a model
model_uri = "runs:/abc123/model" # (#1:Run ID + artifact path)
model_version = mlflow.register_model(
model_uri,
"image-classifier" # (#2:Model name)
)
# Transition model stage
client.transition_model_version_stage(
name="image-classifier",
version=model_version.version,
stage="Staging" # (#3:Staging/Production/Archived)
)
# Load model by stage
model = mlflow.pytorch.load_model(
"models:/image-classifier/Production" # (#4:Load prod model)
)
# Add model description
client.update_model_version(
name="image-classifier",
version=model_version.version,
description="ResNet50 trained on custom dataset, 95% accuracy" # (#5:Document)
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
# Define column mapping
column_mapping = ColumnMapping(
prediction='prediction',
target='actual_label'
)
# Create drift report
report = Report(metrics=[
DataDriftPreset(), # (#1:Input drift)
TargetDriftPreset() # (#2:Prediction drift)
])
# Generate report
report.run(
reference_data=training_data, # (#3:Reference distribution)
current_data=production_data, # (#4:Current distribution)
column_mapping=column_mapping
)
# Save report
report.save_html("drift_report.html") # (#5:HTML report)
# Get results programmatically
results = report.as_dict()
if results['metrics'][0]['result']['dataset_drift']: # (#6:Check drift)
alert_team("Data drift detected!")
from prometheus_client import Counter, Histogram, generate_latest
from fastapi import FastAPI, Response
app = FastAPI()
# Define metrics
PREDICTIONS = Counter(
'model_predictions_total',
'Total predictions',
['model', 'class'] # (#1:Labels)
)
LATENCY = Histogram(
'model_inference_latency_seconds',
'Inference latency',
buckets=[0.01, 0.05, 0.1, 0.5, 1.0] # (#2:Buckets)
)
@app.post("/predict")
async def predict(file: UploadFile):
with LATENCY.time(): # (#3:Measure latency)
result = model.predict(file)
PREDICTIONS.labels(model='resnet50', class_=result['class']).inc() # (#4:Count)
return result
@app.get("/metrics") # (#5:Metrics endpoint)
async def metrics():
return Response(generate_latest(), media_type="text/plain")
flowchart LR
A["Push Code"] --> B["Test"]
B --> C["Build Image"]
C --> D["Push Registry"]
D --> E["Deploy Staging"]
E --> F{"Validate?"}
F -->|Pass| G["Deploy Prod"]
F -->|Fail| H["Rollback"]
style B fill:#3498db,color:#fff
style C fill:#9b59b6,color:#fff
style D fill:#e67e22,color:#fff
style G fill:#27ae60,color:#fff
style H fill:#e74c3c,color:#fff
# .github/workflows/ml-pipeline.yml
name: ML Model CI/CD
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: pip install -r requirements.txt # (#1:Install deps)
- name: Run tests
run: pytest tests/ --cov=app # (#2:Run tests)
- name: Model validation
run: python scripts/validate_model.py # (#3:Validate model)
build-and-push:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Login to Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }} # (#1:Auth to registry)
- name: Build and push
uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/${{ github.repository }}:${{ github.sha }} # (#2:Tag with SHA)
deploy:
needs: build-and-push
runs-on: ubuntu-latest
steps:
- name: Deploy to Kubernetes
uses: azure/k8s-deploy@v4
with:
manifests: k8s/
images: ghcr.io/${{ github.repository }}:${{ github.sha }} # (#3:Deploy)
flowchart TB
A{"Edge Platform?"} -->|"Mobile"| B["TFLite / CoreML"]
A -->|"Low Power"| C["Raspberry Pi
TFLite"]
A -->|"GPU Required"| D["NVIDIA Jetson
TensorRT"]
A -->|"Ultra Low Power"| E["Google Coral
Edge TPU"]
style B fill:#3498db,color:#fff
style C fill:#e74c3c,color:#fff
style D fill:#27ae60,color:#fff
style E fill:#f39c12,color:#fff
# TFLite inference on edge device
import tflite_runtime.interpreter as tflite
import numpy as np
from PIL import Image
# Load TFLite model
interpreter = tflite.Interpreter(model_path="model.tflite") # (#1:Load model)
interpreter.allocate_tensors() # (#2:Allocate memory)
# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Preprocess image
image = Image.open("test.jpg").resize((224, 224))
input_data = np.expand_dims(np.array(image), axis=0)
input_data = (input_data / 255.0).astype(np.float32) # (#3:Normalize)
# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data) # (#4:Set input)
interpreter.invoke() # (#5:Run inference)
output_data = interpreter.get_tensor(output_details[0]['index']) # (#6:Get output)
predicted_class = np.argmax(output_data)
# Convert ONNX to TensorRT on Jetson
/usr/src/tensorrt/bin/trtexec \
--onnx=model.onnx \
--saveEngine=model.trt \
--fp16 \
--workspace=1024 # (#1:Build TRT engine)
# Coral Edge TPU inference
from pycoral.utils.edgetpu import make_interpreter
from pycoral.adapters import common, classify
from PIL import Image
# Load Edge TPU model (must be compiled for Edge TPU)
interpreter = make_interpreter('model_edgetpu.tflite') # (#1:Edge TPU model)
interpreter.allocate_tensors()
# Get model input size
size = common.input_size(interpreter) # (#2:Input size)
# Load and preprocess image
image = Image.open('test.jpg').resize(size, Image.LANCZOS)
common.set_input(interpreter, image) # (#3:Set input)
# Run inference
interpreter.invoke() # (#4:Run on TPU)
# Get classification results
classes = classify.get_classes(interpreter, top_k=3) # (#5:Top 3 predictions)
for c in classes:
print(f"Class {c.id}: {c.score:.4f}")
Performance: Up to 4 TOPS at 2W power consumption
| Project | Task | Difficulty |
|---|---|---|
| Plant Disease Detection | Classification | Intermediate |
| Traffic Sign Recognition | Classification | Beginner |
| Face Mask Detection | Object Detection | Intermediate |
| Document Layout Analysis | Segmentation | Advanced |
| Defect Detection (Manufacturing) | Detection/Classification | Intermediate |
| Medical Image Analysis | Segmentation | Advanced |
Tip: Choose a project aligned with your industry interest!
| Category | Weight | Key Points |
|---|---|---|
| Model Quality | 30% | Accuracy, robustness, appropriate architecture |
| Data Pipeline | 20% | Data collection, augmentation, preprocessing |
| Deployment | 25% | API design, containerization, optimization |
| Documentation | 15% | Code quality, README, model card |
| Presentation | 10% | Demo, explanation of choices |
Quantization, pruning, and ONNX/TensorRT can dramatically improve inference speed
Docker ensures reproducibility and simplifies deployment to any environment
Track system metrics, model performance, and data drift continuously
| Type | Resource |
|---|---|
| Documentation | TensorFlow Lite Guide |
| Documentation | ONNX Runtime |
| Tool | MLflow Documentation |
| Framework | FastAPI Documentation |
| Platform | NVIDIA TensorRT |
| Tutorial | Triton Tutorials |
| Course | Full Stack Deep Learning |
Start working on your final project
Schedule time for project guidance
GitHub repository with documentation
Deadline: Final project due 2 weeks after last session. Good luck!