Amazon S3
Object storage for data lakes, models, and artifacts
S3
(Data lake + artifacts backbone)
What seniors use it for (ML/GenAI)
- Raw/processed datasets, parquet lakes, training corpora
- Model artifacts (checkpoints), tokenizer files, eval outputs
- RAG corpora (PDFs/HTML chunks), embeddings exports (often not the online vector DB)
- Logs (app logs → Firehose → S3), audit trails
Config knobs that actually matter
-
Block Public Access: almost always ON.
-
Encryption
- SSE-S3: simplest, low ops.
- SSE-KMS: governance + key policies + audit; adds KMS call cost/latency.
-
Versioning: ON for artifacts / critical data; consider lifecycle cleanup of old versions.
-
Lifecycle rules: the #1 cost lever (transition old data → IA/Glacier; expire temp prefixes).
-
Object Ownership:
BucketOwnerEnforcedto avoid ACL mess. -
Access logs / CloudTrail data events: enable for sensitive buckets (cost trade-off).
-
Prefix layout: optimize for humans + analytics (e.g.,
domain/entity/dt=YYYY-MM-DD/...).
Pricing mental models (back-of-envelope)
-
Storage dominates unless you’re doing huge request volumes.
- Think “~$20–$25 per TB-month” (standard storage ballpark) in many regions.
-
Requests: relevant only at scale.
- Think “a few dollars per 10–100M requests” depending on request type.
-
Data egress: the silent killer.
- Treat as “egress is expensive; avoid moving data out of AWS/region.”
-
Cold tiers: cheap storage, not cheap to retrieve.
- Use for rarely read data only; model retrieval fees/time in design reviews.
Heuristics
- If data is accessed often → Standard or Intelligent-Tiering.
- If accessed rarely but must be online → Standard-IA.
- If “archive / compliance” → Glacier tiers, but price in retrieval + restore latency.
- If you’re paying for NAT egress to S3 from private subnets: fix your VPC (gateway endpoint).
Terraform template (S3 bucket with sane defaults)
# s3.tf
resource "aws_s3_bucket" "data" {
bucket = var.bucket_name
tags = var.tags
}
resource "aws_s3_bucket_public_access_block" "data" {
bucket = aws_s3_bucket.data.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_ownership_controls" "data" {
bucket = aws_s3_bucket.data.id
rule { object_ownership = "BucketOwnerEnforced" }
}
resource "aws_s3_bucket_versioning" "data" {
bucket = aws_s3_bucket.data.id
versioning_configuration { status = "Enabled" }
}
# Choose ONE: SSE-S3 (simpler) OR SSE-KMS (governance)
resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
# kms_master_key_id = aws_kms_key.s3.arn # for SSE-KMS
}
}
}
# Lifecycle: tune per prefixes (temp/, logs/, artifacts/)
resource "aws_s3_bucket_lifecycle_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
id = "expire-temp"
status = "Enabled"
filter { prefix = "temp/" }
expiration { days = 7 }
}
rule {
id = "transition-cold"
status = "Enabled"
filter { prefix = "datasets/" }
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 180
storage_class = "GLACIER"
}
}
}
variable "bucket_name" { type = string }
variable "tags" { type = map(string) default = {} }