AWS Glue
Serverless ETL for data transformation and catalog management
Glue (ETL + Data Catalog)
Mental model
- Glue Data Catalog = your lake’s “metastore” (databases/tables/partitions/schemas).
- Glue ETL = managed Spark for batch transformations (S3 → S3) + joins + dedupe + backfills.
- Glue is the “data plane for the lake” when you don’t want to run Spark yourself.
What it’s used for in ML/GenAI
- Build curated datasets (bronze→silver→gold) for training/evals.
- Create “analytics-ready” Parquet/Iceberg datasets from raw logs (e.g., agent traces).
- Maintain partitioned tables for Athena queries.
Knobs that matter (senior knobs)
- Job type: Spark ETL vs Python shell (keep Python shell for lightweight tasks).
- Workers/DPUs: cost + speed; tune for shuffle-heavy workloads.
- Bookmarks: incremental processing; helps avoid reprocessing (but validate correctness).
- Timeout + retries: default to sane limits, DLQ style patterns via Step Functions.
- Connections / VPC: only if you must reach private data sources; add endpoints to avoid NAT bleed.
- Output format: Parquet + partitioning = biggest downstream cost lever.
Pricing mental model
- Glue ETL cost is basically: (allocated compute units) × (runtime hours).
- Biggest cost drivers: backfills, wide joins, shuffle, and small file output (which explodes downstream query costs).
- Rule of thumb: optimize by (1) reducing shuffle, (2) increasing file sizes, (3) partition smartly.
Heuristics
- If the job is simple and small, do it in Lambda/ECS.
- If it’s Spark-worthy (joins/large transforms/backfills), use Glue or EMR.
- If you need fine control, custom libs, or long-running Spark apps → EMR.
Terraform template (Catalog DB + Crawler + Glue Job)
resource "aws_glue_catalog_database" "db" {
name = var.db_name
}
resource "aws_iam_role" "glue" {
name = "${var.name}-glue-role"
assume_role_policy = data.aws_iam_policy_document.glue_assume.json
}
data "aws_iam_policy_document" "glue_assume" {
statement {
effect = "Allow"
principals { type = "Service", identifiers = ["glue.amazonaws.com"] }
actions = ["sts:AssumeRole"]
}
}
resource "aws_iam_role_policy" "glue_policy" {
role = aws_iam_role.glue.id
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{ Effect="Allow", Action=["s3:GetObject","s3:PutObject","s3:ListBucket"], Resource=[var.s3_bucket_arn, "${var.s3_bucket_arn}/*"] },
{ Effect="Allow", Action=["logs:*"], Resource="*" }
]
})
}
resource "aws_glue_crawler" "crawler" {
name = "${var.name}-crawler"
role = aws_iam_role.glue.arn
database_name = aws_glue_catalog_database.db.name
s3_target { path = "s3://${var.s3_bucket_name}/${var.raw_prefix}/" }
# Optional: schedule, schema change policy, table prefix, etc.
}
resource "aws_glue_job" "etl" {
name = "${var.name}-etl"
role_arn = aws_iam_role.glue.arn
command {
name = "glueetl"
script_location = "s3://${var.s3_bucket_name}/${var.scripts_prefix}/job.py"
python_version = "3"
}
glue_version = "4.0"
number_of_workers = 5
worker_type = "G.1X"
timeout = 60
default_arguments = {
"--job-language" = "python"
"--enable-continuous-cloudwatch-log" = "true"
"--enable-metrics" = "true"
}
}
variable "name" { type = string }
variable "db_name" { type = string }
variable "s3_bucket_name" { type = string }
variable "s3_bucket_arn" { type = string }
variable "raw_prefix" { type = string default = "raw" }
variable "scripts_prefix" { type = string default = "glue-scripts" }