1. Inference Options
After training, you deploy a model to generate predictions (inference). SageMaker offers multiple deployment options depending on latency, cost, and traffic requirements.
2. Endpoint Features
Multi-Model Endpoints (MME)
- Host THOUSANDS of models on a single endpoint
- Models are loaded/unloaded dynamically from S3 based on demand
- Saves cost: share infrastructure across many models
- Use for: multi-tenant SaaS (one model per customer), many similar models
Multi-Container Endpoints
- Run multiple containers on one endpoint (serial or parallel)
- Serial: request goes through container A, then B (inference pipeline)
- Use for: pre-processing → model → post-processing in one endpoint
Shadow Testing (Production Variants)
- Deploy multiple model versions on one endpoint with traffic splitting
- A/B testing: send 90% to Model-A, 10% to Model-B
- Shadow mode: send 100% to production model, duplicate to shadow model for comparison
- Use for: safe model updates, compare model performance in production
Auto Scaling
- Scale endpoint instances based on: InvocationsPerInstance, CPUUtilization, or custom metrics
- Target tracking: maintain target invocations per instance
- Scheduled scaling: scale up before known traffic spikes
- Scale-in cooldown: prevent too-fast scale-in after scale-out
3. SageMaker Neo
- Compile and optimize ML models for specific hardware
- Target: ARM, Intel, NVIDIA, custom edge devices
- Up to 2x performance improvement with no accuracy loss
- Supports: TensorFlow, PyTorch, MXNet, XGBoost, ONNX
- Use for: edge deployment (IoT, mobile), inference cost reduction
4. SageMaker Edge Manager
- Deploy, manage, and monitor models on edge devices
- Model packaging for edge hardware
- OTA (over-the-air) model updates to the edge fleet
- Data capture from the edge for model retraining
- Use for: IoT devices, cameras, industrial equipment running ML locally
Exam Tip
Inference: "Real-time API" = Real-Time Endpoint. "Intermittent traffic, no idle cost" = Serverless Inference. "Offline batch predictions" = Batch Transform. "Large payloads (>6MB)" = Asynchronous Inference. "1000s of models on one endpoint" = Multi-Model Endpoint. "A/B test models" = Production Variants. "Optimize for edge hardware" = SageMaker Neo. "Deploy to IoT devices" = Edge Manager.