Observability

Amazon Managed Prometheus

Managed Prometheus for container and Kubernetes monitoring

Prometheus

Mental model

  • AMP is a managed Prometheus backend (remote_write target + long-term store).
  • You still run collectors (Prometheus/ADOT) on EKS/ECS/EC2 to scrape/export metrics.

Where it’s must-have in GenAI/agents

  • Infra + platform metrics at scale: ECS/EKS/EC2, GPU/CPU/mem, queue depth, autoscaling behavior.
  • Best when you want PromQL + alert rules + standard OSS exporters.

Senior knobs

  • Scrape interval: 60s is common; 15s increases cost fast.
  • Label/cardinality hygiene: this is the #1 cost and performance risk.
  • Recording rules: precompute expensive queries; reduce query load.
  • Query limits (QSP-style): prevent a single dashboard from nuking costs.

Pricing mental model

  • Your bill is roughly:

    • Ingestion: priced per samples ingested (10M samples is the unit).
    • Storage: priced per GB-month.
    • Queries: priced by samples processed (PromQL cost).
  • Rule of thumb: high-frequency scrapes + high cardinality = surprise bill.

Terraform template (AMP workspace + basic IAM policy for query/write)

resource "aws_prometheus_workspace" "amp" {
  alias = var.name
  tags  = var.tags
}

# Minimal policy (attach to IRSA role / ECS task role used by collector and/or query clients)
resource "aws_iam_policy" "amp_access" {
  name = "${var.name}-amp-access"
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      { Effect="Allow", Action=["aps:RemoteWrite"], Resource=aws_prometheus_workspace.amp.arn },
      { Effect="Allow", Action=["aps:QueryMetrics","aps:GetSeries","aps:GetLabels","aps:GetMetricMetadata"], Resource=aws_prometheus_workspace.amp.arn }
    ]
  })
}

variable "name" { type = string }
variable "tags" { type = map(string) default = {} }