1. Inference Options

After training, you deploy a model to generate predictions (inference). SageMaker offers multiple deployment options depending on latency, cost, and traffic requirements.

2. Endpoint Features


Multi-Model Endpoints (MME)

  1. Host THOUSANDS of models on a single endpoint
  2. Models are loaded/unloaded dynamically from S3 based on demand
  3. Saves cost: share infrastructure across many models
  4. Use for: multi-tenant SaaS (one model per customer), many similar models


Multi-Container Endpoints

  1. Run multiple containers on one endpoint (serial or parallel)
  2. Serial: request goes through container A, then B (inference pipeline)
  3. Use for: pre-processing → model → post-processing in one endpoint


Shadow Testing (Production Variants)

  1. Deploy multiple model versions on one endpoint with traffic splitting
  2. A/B testing: send 90% to Model-A, 10% to Model-B
  3. Shadow mode: send 100% to production model, duplicate to shadow model for comparison
  4. Use for: safe model updates, compare model performance in production


Auto Scaling

  1. Scale endpoint instances based on: InvocationsPerInstance, CPUUtilization, or custom metrics
  2. Target tracking: maintain target invocations per instance
  3. Scheduled scaling: scale up before known traffic spikes
  4. Scale-in cooldown: prevent too-fast scale-in after scale-out

3. SageMaker Neo

  1. Compile and optimize ML models for specific hardware
  2. Target: ARM, Intel, NVIDIA, custom edge devices
  3. Up to 2x performance improvement with no accuracy loss
  4. Supports: TensorFlow, PyTorch, MXNet, XGBoost, ONNX
  5. Use for: edge deployment (IoT, mobile), inference cost reduction

4. SageMaker Edge Manager

  1. Deploy, manage, and monitor models on edge devices
  2. Model packaging for edge hardware
  3. OTA (over-the-air) model updates to the edge fleet
  4. Data capture from the edge for model retraining
  5. Use for: IoT devices, cameras, industrial equipment running ML locally
Exam Tip
Inference: "Real-time API" = Real-Time Endpoint. "Intermittent traffic, no idle cost" = Serverless Inference. "Offline batch predictions" = Batch Transform. "Large payloads (>6MB)" = Asynchronous Inference. "1000s of models on one endpoint" = Multi-Model Endpoint. "A/B test models" = Production Variants. "Optimize for edge hardware" = SageMaker Neo. "Deploy to IoT devices" = Edge Manager.