Databases & Storage

Amazon S3

Object storage for data lakes, models, and artifacts

S3

(Data lake + artifacts backbone)

What seniors use it for (ML/GenAI)

  • Raw/processed datasets, parquet lakes, training corpora
  • Model artifacts (checkpoints), tokenizer files, eval outputs
  • RAG corpora (PDFs/HTML chunks), embeddings exports (often not the online vector DB)
  • Logs (app logs → Firehose → S3), audit trails

Config knobs that actually matter

  • Block Public Access: almost always ON.

  • Encryption

    • SSE-S3: simplest, low ops.
    • SSE-KMS: governance + key policies + audit; adds KMS call cost/latency.
  • Versioning: ON for artifacts / critical data; consider lifecycle cleanup of old versions.

  • Lifecycle rules: the #1 cost lever (transition old data → IA/Glacier; expire temp prefixes).

  • Object Ownership: BucketOwnerEnforced to avoid ACL mess.

  • Access logs / CloudTrail data events: enable for sensitive buckets (cost trade-off).

  • Prefix layout: optimize for humans + analytics (e.g., domain/entity/dt=YYYY-MM-DD/...).

Pricing mental models (back-of-envelope)

  • Storage dominates unless you’re doing huge request volumes.

    • Think “~$20–$25 per TB-month” (standard storage ballpark) in many regions.
  • Requests: relevant only at scale.

    • Think “a few dollars per 10–100M requests” depending on request type.
  • Data egress: the silent killer.

    • Treat as “egress is expensive; avoid moving data out of AWS/region.”
  • Cold tiers: cheap storage, not cheap to retrieve.

    • Use for rarely read data only; model retrieval fees/time in design reviews.

Heuristics

  • If data is accessed often → Standard or Intelligent-Tiering.
  • If accessed rarely but must be online → Standard-IA.
  • If “archive / compliance” → Glacier tiers, but price in retrieval + restore latency.
  • If you’re paying for NAT egress to S3 from private subnets: fix your VPC (gateway endpoint).

Terraform template (S3 bucket with sane defaults)

# s3.tf
resource "aws_s3_bucket" "data" {
  bucket = var.bucket_name
  tags   = var.tags
}

resource "aws_s3_bucket_public_access_block" "data" {
  bucket                  = aws_s3_bucket.data.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_ownership_controls" "data" {
  bucket = aws_s3_bucket.data.id
  rule { object_ownership = "BucketOwnerEnforced" }
}

resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration { status = "Enabled" }
}

# Choose ONE: SSE-S3 (simpler) OR SSE-KMS (governance)
resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
  bucket = aws_s3_bucket.data.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
      # kms_master_key_id = aws_kms_key.s3.arn  # for SSE-KMS
    }
  }
}

# Lifecycle: tune per prefixes (temp/, logs/, artifacts/)
resource "aws_s3_bucket_lifecycle_configuration" "data" {
  bucket = aws_s3_bucket.data.id

  rule {
    id     = "expire-temp"
    status = "Enabled"

    filter { prefix = "temp/" }

    expiration { days = 7 }
  }

  rule {
    id     = "transition-cold"
    status = "Enabled"

    filter { prefix = "datasets/" }

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    transition {
      days          = 180
      storage_class = "GLACIER"
    }
  }
}

variable "bucket_name" { type = string }
variable "tags"        { type = map(string) default = {} }