Amazon EMR

1. What is Amazon EMR?

Amazon EMR (Elastic MapReduce) is a managed big data platform for processing vast amounts of data using open-source frameworks like Apache Spark, Hadoop, Hive, Presto, HBase, and Flink.

Core Concept

EMR = managed Hadoop/Spark cluster on AWS. You choose the frameworks, EMR provisions the cluster (EC2 instances or EKS pods or Serverless), you run your big data jobs, then terminate the cluster. Use for: large-scale data processing, ML training on big data, log analysis, genomics, ETL at petabyte scale.

2. EMR Deployment Options

3. EMR on EC2 Architecture

EMR Cluster Architecture:

  Primary Node (Master)
    - Manages the cluster
    - Runs: YARN ResourceManager, HDFS NameNode
    - Coordinates job scheduling

  Core Nodes
    - Run tasks AND store data (HDFS)
    - Runs: YARN NodeManager, HDFS DataNode
    - Losing core nodes = losing HDFS data

  Task Nodes (optional)
    - Run tasks ONLY (no HDFS storage)
    - Can use Spot Instances (interruptible, cheap)
    - Add/remove for scaling compute without affecting storage

  Storage:
    - HDFS (on cluster, temporary, lost on termination)
    - EMRFS (S3-backed, persistent, recommended)
    - Local file system (instance store)

HDFS vs EMRFS (S3)

EMRFS Best Practice

Store data in S3 (EMRFS), not HDFS. This decouples storage from compute: you can terminate clusters when not in use and spin up new clusters when needed. Data persists in S3 regardless of cluster lifecycle. This is the modern data lake pattern and dramatically reduces costs.

4. EMR Frameworks

5. EMR Cost Optimization

Spot Instances for Task Nodes (up to 90% savings, interruptible OK for compute-only)
Reserved Instances for Primary and Core Nodes (long-running clusters)
Transient clusters: spin up for a job, terminate when done (data in S3)
EMR Serverless: no idle costs, pay per job
Auto-scaling: scale Core and Task nodes based on YARN metrics
S3 storage (EMRFS) instead of HDFS: cheaper, persistent, decouple compute

6. Glue ETL vs EMR

7. Complete Analytics Pipeline

AWS Analytics Pipeline:

Ingest:    Kinesis / Firehose / DMS / DataSync / S3 Upload
Store:     S3 Data Lake (raw zone → processed zone → curated zone)
Catalog:   Glue Data Catalog (crawlers auto-discover schemas)
Transform: Glue ETL (simple/medium) or EMR (complex/big data)
Govern:    Lake Formation (fine-grained access, cross-account sharing)
Query:     Athena (serverless SQL) or Redshift (warehouse)
Visualize: QuickSight (dashboards) or Grafana
ML:        SageMaker (train models on data lake data)

Exam Tip

EMR: "Managed Hadoop/Spark cluster" = EMR. "Custom big data frameworks" = EMR. "Spot for task nodes" = EMR cost optimization. "Store data in S3, not HDFS" = EMRFS best practice. "Serverless Spark" = EMR Serverless or Glue ETL. "Simple ETL" = Glue. "Complex big data" = EMR. "Interactive SQL on data lake" = Athena (serverless) or Presto on EMR.