SageMaker Inference & Deployment

1. Inference Options

After training, you deploy a model to generate predictions (inference). SageMaker offers multiple deployment options depending on latency, cost, and traffic requirements.

2. Endpoint Features

Multi-Model Endpoints (MME)

Host THOUSANDS of models on a single endpoint
Models are loaded/unloaded dynamically from S3 based on demand
Saves cost: share infrastructure across many models
Use for: multi-tenant SaaS (one model per customer), many similar models

Multi-Container Endpoints

Run multiple containers on one endpoint (serial or parallel)
Serial: request goes through container A, then B (inference pipeline)
Use for: pre-processing → model → post-processing in one endpoint

Shadow Testing (Production Variants)

Deploy multiple model versions on one endpoint with traffic splitting
A/B testing: send 90% to Model-A, 10% to Model-B
Shadow mode: send 100% to production model, duplicate to shadow model for comparison
Use for: safe model updates, compare model performance in production

Auto Scaling

Scale endpoint instances based on: InvocationsPerInstance, CPUUtilization, or custom metrics
Target tracking: maintain target invocations per instance
Scheduled scaling: scale up before known traffic spikes
Scale-in cooldown: prevent too-fast scale-in after scale-out

3. SageMaker Neo

Compile and optimize ML models for specific hardware
Target: ARM, Intel, NVIDIA, custom edge devices
Up to 2x performance improvement with no accuracy loss
Supports: TensorFlow, PyTorch, MXNet, XGBoost, ONNX
Use for: edge deployment (IoT, mobile), inference cost reduction

4. SageMaker Edge Manager

Deploy, manage, and monitor models on edge devices
Model packaging for edge hardware
OTA (over-the-air) model updates to the edge fleet
Data capture from the edge for model retraining
Use for: IoT devices, cameras, industrial equipment running ML locally

Exam Tip

Inference: "Real-time API" = Real-Time Endpoint. "Intermittent traffic, no idle cost" = Serverless Inference. "Offline batch predictions" = Batch Transform. "Large payloads (>6MB)" = Asynchronous Inference. "1000s of models on one endpoint" = Multi-Model Endpoint. "A/B test models" = Production Variants. "Optimize for edge hardware" = SageMaker Neo. "Deploy to IoT devices" = Edge Manager.