Amazon Athena
Serverless SQL queries on S3 data lakes
Athena (SQL on S3)
Mental model
- “Serverless Presto/Trino-style SQL on files in S3.”
- You pay primarily for data scanned, so format + partitioning is everything.
What it’s used for in ML/GenAI
- Ad-hoc analytics on agent telemetry, prompt/response logs, eval results.
- Build datasets for offline analysis and labeling queues.
- Quick slicing: latency by model, tool failure rates, safety outcomes, cohort analysis.
Knobs that matter
- File format: Parquet/ORC beats JSON/CSV by a mile.
- Partitioning:
dt=YYYY-MM-DD/(and sometimestenant_id=) to prune scans. - Workgroups: enforce limits, isolate teams, separate prod vs dev.
- Result location: dedicated S3 prefix; enable encryption.
- CTAS / INSERT INTO: create optimized tables from raw data.
Pricing mental model
- Cost ≈ TB scanned.
- Senior heuristic: if you’re scanning “the whole bucket,” you’ve already lost—fix partitioning or format.
- Common lever: convert raw JSON to Parquet daily; queries become cheap and fast.
Heuristics
- Use Athena for interactive querying and lightweight ETL (CTAS).
- If transforms are heavy (joins, complex logic, big backfills), use Glue/EMR then query with Athena.
Terraform template (Athena workgroup + database)
resource "aws_s3_bucket" "athena_results" {
bucket = var.athena_results_bucket
}
resource "aws_athena_workgroup" "wg" {
name = var.workgroup_name
configuration {
enforce_workgroup_configuration = true
result_configuration {
output_location = "s3://${aws_s3_bucket.athena_results.bucket}/results/"
encryption_configuration {
encryption_option = "SSE_S3"
}
}
}
}
resource "aws_athena_database" "db" {
name = var.athena_db_name
bucket = var.data_lake_bucket # Athena stores metadata; tables point to S3 paths
}
variable "athena_results_bucket" { type = string }
variable "workgroup_name" { type = string }
variable "athena_db_name" { type = string }
variable "data_lake_bucket" { type = string }