Amazon EMR
Managed Spark, Hadoop, and big data processing
EMR / EMR Serverless (Spark/Hadoop scale-out)
Mental model
- EMR = you run big data engines (Spark/Flink/Presto/Hadoop) on managed clusters (EC2).
- EMR Serverless = “run Spark jobs without managing clusters” (still Spark semantics).
What it’s used for in ML/GenAI
- Large-scale batch: embeddings backfills, dedupe, joins, feature generation.
- Transforming massive telemetry/clickstreams into curated datasets.
- Heavy compute workloads that outgrow Glue’s ergonomics or need custom environments.
Knobs that matter (EMR on EC2)
- Instance fleets / groups: mix on-demand + spot for cost.
- Autoscaling: core/task node scaling policies.
- EBS + shuffle: storage and IO tuning can dominate performance.
- Bootstrap actions: install native libs, python deps; prefer reproducible AMIs when stable.
- Managed scaling: reduces manual tuning; still validate for bursty loads.
Knobs that matter (EMR Serverless)
- Initial/min/max capacity: controls cost guardrails.
- Job concurrency: avoid stampedes.
- Runtime environment: keep dependencies pinned and reproducible.
Pricing mental model
- EMR on EC2: (EC2 + EBS + network) + an EMR surcharge; biggest lever is Spot + right-sizing.
- EMR Serverless: cost tracks resources used × time (good for spiky batch), but set min/max to prevent runaway.
Heuristics: Glue vs EMR vs EMR Serverless
- Glue: easiest managed Spark for common ETL; great default for lake transforms.
- EMR Serverless: Spark with fewer ops when workloads are bursty and you want quicker “submit job and go.”
- EMR on EC2: when you need maximum control, custom tuning, huge steady pipelines, or best cost at scale via Spot + optimized clusters.
Terraform template (EMR cluster skeleton)
resource "aws_emr_cluster" "spark" {
name = "${var.name}-emr"
release_label = var.release_label
applications = ["Spark"]
service_role = var.emr_service_role_arn
ec2_attributes {
instance_profile = var.emr_instance_profile_arn
subnet_id = var.subnet_id
}
master_instance_group {
instance_type = var.master_type
instance_count = 1
}
core_instance_group {
instance_type = var.core_type
instance_count = var.core_count
}
# Optional: scale-out task nodes, autoscaling policies, configurations JSON, bootstrap actions, log_uri
tags = var.tags
}
variable "name" { type = string }
variable "release_label" { type = string default = "emr-7.0.0" }
variable "emr_service_role_arn" { type = string }
variable "emr_instance_profile_arn"{ type = string }
variable "subnet_id" { type = string }
variable "master_type" { type = string default = "m6i.xlarge" }
variable "core_type" { type = string default = "m6i.2xlarge" }
variable "core_count" { type = number default = 2 }
variable "tags" { type = map(string) default = {} }
Terraform template (EMR Serverless application skeleton)
resource "aws_emrserverless_application" "app" {
name = "${var.name}-emr-sls"
release_label = var.release_label
type = "SPARK"
initial_capacity {
initial_capacity_type = "Driver"
initial_capacity_config {
worker_count = 1
worker_configuration {
cpu = "2 vCPU"
memory = "4 GB"
}
}
}
initial_capacity {
initial_capacity_type = "Executor"
initial_capacity_config {
worker_count = 2
worker_configuration {
cpu = "4 vCPU"
memory = "8 GB"
}
}
}
maximum_capacity {
cpu = "64 vCPU"
memory = "128 GB"
}
tags = var.tags
}
# Note: Job runs are typically submitted via CI/CD or Step Functions, not as a persistent Terraform resource.
variable "name" { type = string }
variable "release_label" { type = string default = "emr-7.0.0" }
variable "tags" { type = map(string) default = {} }