1 What is a Data Lake?
A data lake is a centralized repository that stores all your structured and unstructured data at any scale. On AWS, S3 is the foundation of the data lake — you store raw data in S3 and use various services to process, catalog, and analyze it.
Core Concept
Lake Formation = build a secure data lake in days instead of months. It automates the complex steps: data ingestion, cataloging, security (fine-grained access), data transformation, and deduplication. It sits on top of S3 + Glue Data Catalog and adds centralized security and governance.
2 Key Features
- Data Ingestion: Import data from S3, RDS, on-premises databases via blueprints (pre-built workflows).
- Centralized Security: Fine-grained access control at the database, table, column, row, and cell level.
- Data Catalog: Uses AWS Glue Data Catalog as the central metadata store. Tables, schemas, partitions.
- Governed Tables: ACID transactions on S3 data lake tables. Automatic compaction and deduplication.
- Data Sharing: Share data across accounts and organizations securely without copying.
- Blueprints: Pre-built workflows to ingest data from: databases (JDBC), CloudTrail logs, and other S3 locations.
3. Lake Formation Security
Lake Formation provides centralized permission management that replaces the complex combination of S3 bucket policies, IAM policies, and Glue Data Catalog policies.
Lake Formation vs IAM/S3 Policies
Without Lake Formation, you manage access via S3 bucket policies + IAM policies + Glue catalog policies separately. Complex and error-prone. With Lake Formation, GRANT/REVOKE permissions are centrally managed in one place. Lake Formation handles the underlying S3 and Glue policies automatically. Much simpler for data lake governance.
4 Lake Formation Architecture
Data Lake Architecture with Lake Formation:
Data Sources:
RDS, S3, On-premises DBs, CloudTrail
|
Lake Formation Blueprints (ingest)
|
S3 Data Lake (raw + processed data)
|
Glue Data Catalog (metadata, schemas)
|
Lake Formation Permissions (column/row/cell/tag-based)
|
Consumers:
Athena (SQL queries)
Redshift Spectrum (warehouse queries)
EMR (Spark/Hadoop)
QuickSight (dashboards)Exam Tip
Lake Formation: "Centralized data lake security" = Lake Formation. "Column-level access control on S3 data" = Lake Formation. "Share data across accounts without copying" = LF cross-account sharing. "Fine-grained permissions for data lake" = Lake Formation (not just S3 policies). "Tag-based data governance" = LF-TBAC. Lake Formation uses Glue Data Catalog underneath.