Events in Agentic AI Systems
Architectural patterns for event-driven agent coordination
SQS vs SNS vs EventBridge for Agentic AI Systems
The “job-to-be-done” mental model
- SQS = Work queue (buffer + backpressure + DLQ + pull-based workers)
- SNS = Broadcast (fanout notifications, “tell everyone something happened”)
- EventBridge = Event router + auditability (content-based routing, cross-account, archive/replay, schema discipline)
Quick decision map (ask these in order)
1) Is there work that must be executed reliably (tools, jobs, agents)?
✅ Yes → SQS (usually Standard)
- You want: buffering, retry control, DLQ, worker autoscaling, “don’t drop tasks.”
2) Is it “notify many consumers” (alerts, downstream systems, UI updates)?
✅ Yes → SNS
- You want: fanout, simple filtering, push delivery (often SNS → SQS per consumer for reliability).
3) Do you need “evented architecture” features (routing by event content, governance, replay, cross-account)?
✅ Yes → EventBridge
- You want: event bus, rules, archive/replay, consistent event envelope, easier multi-consumer routing.
Most agentic systems use all three together: EventBridge as the backbone, SQS for execution, SNS for notifications.
What each service does in an Agentic AI architecture
SQS in agents: “Tool execution engine”
Used for
- Tool calls:
ToolInvocationRequested→ tool worker - Long-running steps: scraping, RAG ingestion, batch embedding, report generation
- Rate-limited external APIs (queue smooths bursts)
Why
- Deterministic retries + DLQ = robust tool execution
- Workers can autoscale off queue depth
- Handles spikes without overprovisioning inference/gateway
Senior heuristics
- Standard + idempotency is default (accept duplicates)
- FIFO only for strict per-entity ordering (e.g., same
user_idstate updates) - Visibility timeout ≈ p95 tool runtime + buffer, DLQ
maxReceiveCount3–10
SNS in agents: “Notifications + fanout”
Used for
- Human notifications: “Approval needed”, “Tool failed”, “Job complete”
- Multi-team consumers: monitoring, billing, analytics, email/slack, webhook consumers
Why
- Cheap/simple broadcast; subscribers can opt in/out
- With SNS → SQS per consumer, each consumer gets independent buffering + DLQ
Senior heuristics
- Use SNS for “tell N parties,” not as the durable work queue
- Keep payload small; include IDs and fetch details from a store if needed
EventBridge in agents: “Event bus + traceability + replay”
Used for
- Domain events:
AgentRunStarted,ToolInvocationCompleted,HITLApproved,PolicyViolationDetected - Routing: send events to the right executors/observers based on
detail-type,tool_name,tenant_id, etc. - Audit: archive/replay (and/or route all events to S3)
Why
- Gives you event-sourced audit trail patterns
- Encourages a stable event envelope and schema versioning
- Best for multi-consumer, cross-account, multi-environment routing
Senior heuristics
- Treat EventBridge as the control plane for events; SQS as data plane for execution
- Archive/replay + immutable sink (S3) = strong incident recovery story
The canonical “agentic plumbing” pattern
Recommended architecture
- Orchestrator emits events to EventBridge
- Rules route execution events to SQS queues (by tool type / tenant / priority)
- Tool workers poll SQS, execute, emit completion events back to EventBridge
- Observers (metrics, analytics, audit) subscribe off EventBridge
- SNS handles “notify humans/external systems,” often via SNS→SQS
flowchart LR
O[Agent Orchestrator / Planner] -->|Publish events| EB[(EventBridge Bus)]
EB -->|Rule: tool=search| Q1[[SQS: tool.search]]
EB -->|Rule: tool=db| Q2[[SQS: tool.db]]
EB -->|Rule: tool=email| Q3[[SQS: tool.email]]
Q1 --> W1[Tool Worker: Search]
Q2 --> W2[Tool Worker: DB/CRM]
Q3 --> W3[Tool Worker: Email/API]
W1 -->|ToolInvocationCompleted/Failed| EB
W2 -->|ToolInvocationCompleted/Failed| EB
W3 -->|ToolInvocationCompleted/Failed| EB
EB -->|All events| AUD[(Audit sink: S3 / OpenSearch)]
EB -->|Rule: HITL required| SNS[(SNS Topic)]
SNS --> N1[Human/Slack/Email]
SNS --> QH[[SQS: approvals]] --> HITL[Approval Service] --> EB
Traceability + event-sourced audit: what to standardize
Event envelope (put this on every event)
trace_id(end-to-end),conversation_id,agent_run_idstep_id(per plan step),tool_call_ididempotency_key(for dedupe),causation_id/parent_event_idtenant_id,schema_version,timestamp,producer,env
Audit strategy (choose one, often both)
- EventBridge Archive + Replay for operational reprocessing
- Immutable sink to S3 (partitioned by dt, tenant, event_type) for compliance + analytics
Retries & failure handling: who should own it?
Tool execution (SQS-owned)
- Retries: worker-level + visibility timeout + DLQ
- Poison pills: DLQ + alarm; store failure reason + payload hash
Routing delivery (EventBridge-owned)
- EventBridge retries target delivery, but don’t rely on it for long-running execution semantics
Notifications (SNS-owned)
- SNS retries delivery, but for reliability prefer SNS → SQS for consumers that must not miss notifications
The “choose this if…” summary (agentic-specific)
- SQS: “I must execute tool calls reliably, at scale, with backpressure + DLQ.”
- SNS: “I must notify multiple systems/humans; each consumer can process independently.”
- EventBridge: “I want a governable event backbone: routing, replay, cross-account, consistent audit trail.”