1. What is CloudWatch?
Amazon CloudWatch is a monitoring and observability service for AWS resources and applications. It collects metrics, logs, and events, providing a unified view of operational health.
Core Concept
CloudWatch is the central monitoring hub for AWS. It collects data from 70+ AWS services automatically: CPU, memory, disk, network, request counts, errors, latency. You can set alarms to trigger actions when thresholds are breached, create dashboards for visualization, and collect custom metrics from your own applications.
2. CloudWatch Metrics
What is a Metric?
- A metric is a time-ordered set of data points representing a variable over time
- Metrics belong to a namespace (e.g., AWS/EC2, AWS/RDS, AWS/Lambda)
- Identified by: namespace + metric name + dimensions
- Dimensions: key-value pairs that categorize metrics (e.g., InstanceId=i-abc123)
- Retention: 1-sec data kept 3 hours, 1-min data kept 15 days, 5-min data kept 63 days, 1-hour data kept 455 days (15 months)
Default vs Detailed Monitoring
Common EC2 Metrics (Built-in)
Important Warning
EC2 does NOT natively report memory utilization, disk space, or swap usage to CloudWatch. These require the CloudWatch Agent installed on the instance. This is one of the most tested CloudWatch facts on the exam.
CloudWatch Agent
- Software installed on EC2 (or on-premises servers) to collect additional metrics
- Collects: memory utilization, disk space, swap, per-process metrics, custom app metrics
- Also collects logs (sends to CloudWatch Logs)
- Configuration via JSON file (SSM Parameter Store recommended for config management)
- Uses IAM role for permissions (CloudWatchAgentServerRole)
- Works on Linux and Windows
Custom Metrics
- Publish your own metrics via the PutMetricData API
- Example: application-level metrics like active_users, orders_per_minute, queue_depth
- Resolution: Standard (1-minute) or High Resolution (1-second, higher cost)
- StorageResolution parameter: 1 (high-res) or 60 (standard)
- Use for: business metrics, application KPIs, anything not built-in
3. CloudWatch Alarms
Alarm States
Alarm Actions
Composite Alarms
- Combine multiple alarms using AND/OR logic
- Reduces alarm noise: only trigger when multiple conditions are true simultaneously
- Example: ALARM only when CPU > 80% AND Memory > 90% (composite: both alarms in ALARM state)
- Prevents false positives from single-metric spikes
EC2 Instance Recovery
- Special alarm action: automatically recover a failed EC2 instance
- Trigger: StatusCheckFailed_System alarm
- Recovery: instance is migrated to new hardware with same private IP, Elastic IP, metadata, and placement group
- Requirements: EBS-backed instance (not Instance Store), same instance type
- Alternative: Auto Scaling Group with min=1 (replaces instead of recovers)
4. CloudWatch Dashboards
- Custom visualizations of metrics from multiple services and Regions
- Global: dashboards can display metrics from any Region
- Auto-refresh: configurable interval (10s, 1m, 5m, etc.)
- Widgets: line charts, stacked area, number, text, log query results
- Shareable: share dashboard URL (anonymous access or authenticated)
- Cost: $3/dashboard/month (first 3 free)
Exam Tip
CloudWatch Metrics: "Memory/disk utilization on EC2" = CloudWatch Agent (NOT built-in). "1-minute monitoring" = Detailed Monitoring. "1-second metrics" = High Resolution custom metrics. "Recover failed EC2" = StatusCheckFailed alarm + EC2 Recovery action. "Reduce alarm noise" = Composite Alarms. "Custom business metrics" = PutMetricData API. Metric retention: 15 months for 1-hour data.