Observability Stack
Complete monitoring architecture for production ML systems
Observability stack mental model (GenAI / agentic production)
- Logs answer: “what happened?” (inputs/outputs, tool calls, errors)
- Metrics answer: “is it healthy?” (SLIs/SLOs, rate/latency/error, cost signals)
- Traces answer: “where is time spent?” (end-to-end request + tool spans)
- In agentic systems, observability is also auditability: “why did the agent do that?”, “what did it call?”, “can I replay?”
flowchart LR
U[User Request] --> GW[Inference Gateway]
GW -->|logs| CWL[CloudWatch Logs]
GW -->|metrics| CWM[CloudWatch Metrics/Alarms]
GW -->|traces| XR[X-Ray Traces]
GW --> TOOLS[Tool Executors: Lambda/ECS/Bedrock/DB]
TOOLS -->|logs/metrics/traces| CWL
TOOLS --> XR
EKS[EKS/ECS/EC2 Infra] -->|Prometheus scrape/OTel| AMP[Managed Prometheus]
AMP --> AMG[Managed Grafana Dashboards/Alerts]
CWM --> AMG
XR --> AMG
Production “minimum viable observability” checklist (GenAI/agents)
-
Correlation ID everywhere: request_id propagated to logs + traces + tool calls.
-
Redaction policy: never log secrets/PII; sample payloads only with explicit gate.
-
Three dashboards:
- Golden signals (RPS/latency/errors)
- Dependency health (vector DB, LLM provider, tools)
- Cost signals (token usage, cache hit rate, retries)
-
Paging alarms: error-rate + p95 latency + queue age + DLQ > 0.
-
Tracing: sample baseline + 100% on errors/timeouts.