Observability

Amazon CloudWatch

Monitoring, logging, and alerting for AWS resources

CloudWatch

(Logs / Metrics / Alarms)

Mental model

  • Default telemetry bus for AWS.
  • Logs are for deep forensics, Metrics/Alarms are for paging, Dashboards are for shared visibility.

Where it’s must-have in GenAI/agents

  • Request logs (redacted), tool-call logs, model latency breakdown, token usage stats, retries/timeouts.
  • “Golden signals”: RPS, p50/p95/p99 latency, 4xx/5xx, throttle rate, queue age, DLQ count, cache hit rate.

Senior knobs (the ones that move outcomes)

Logs

  • Retention: set per log group (cost + compliance).
  • Structured JSON logs (queryable; avoid unstructured blobs).
  • Sampling + redaction: log summaries always, full payloads rarely.
  • Subscription filters: stream logs to S3/OpenSearch/Firehose for long retention and analytics.

Metrics

  • Prefer service metrics first; add custom only for SLIs/SLOs.
  • Cardinality control: never make high-cardinality labels (user_id, request_id) into metrics.
  • EMF (embedded metrics) can be a pragmatic middle ground (metrics from logs).

Alarms

  • Alarm on symptoms (SLO burn / error rate / queue backlog), not every internal metric.
  • Use composite alarms to reduce alert storms.

Pricing mental model (back-of-envelope)

  • Logs ingestion: think ~$0.50 per GB ingested$50 per 100 GB.
  • Logs storage: think ~$0.03 per GB-month$3 per 100 GB-month.
  • Logs Insights: think charged per GB scanned ⇒ optimize with short time windows + indexed fields.
  • Custom metrics: think ~$0.30 per metric-month for small scale ⇒ $30 per 100 metrics-month (cardinality is the silent killer).
  • Alarms: think ~$0.10 per alarm-metric-month ⇒ alarms scale with “how many time series you alarm on”.

Terraform template (logs + metric filter + alarm + dashboard)

resource "aws_cloudwatch_log_group" "app" {
  name              = "/app/${var.name}"
  retention_in_days = 14
}

# Example: create a metric from structured logs (e.g., count errors)
resource "aws_cloudwatch_log_metric_filter" "errors" {
  name           = "${var.name}-error-count"
  log_group_name = aws_cloudwatch_log_group.app.name
  pattern        = "{ $.level = \"ERROR\" }"

  metric_transformation {
    name      = "AppErrorCount"
    namespace = "App/${var.name}"
    value     = "1"
  }
}

resource "aws_cloudwatch_metric_alarm" "error_alarm" {
  alarm_name          = "${var.name}-errors-high"
  namespace           = "App/${var.name}"
  metric_name         = "AppErrorCount"
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 5
  threshold           = 10
  comparison_operator = "GreaterThanOrEqualToThreshold"
  treat_missing_data  = "notBreaching"

  alarm_actions = var.alarm_topic_arns
}

resource "aws_cloudwatch_dashboard" "main" {
  dashboard_name = "${var.name}-dashboard"
  dashboard_body = jsonencode({
    widgets = [
      {
        type = "metric",
        width = 12, height = 6, x = 0, y = 0,
        properties = {
          metrics = [[ "App/${var.name}", "AppErrorCount" ]],
          period = 60,
          stat = "Sum",
          title = "Errors/min"
        }
      }
    ]
  })
}

variable "name" { type = string }
variable "alarm_topic_arns" { type = list(string) default = [] }