YAML-based task suite structure for systematic agent evaluation across multiple dimensions.
Each evaluation task is defined as a YAML file following a standardized schema. Tasks specify inputs, environment configuration, graders, metrics, and CI gates.
task_idUnique identifier
suitegolden | open_ended | adversarial | failure_replays
descriptionTask description
inputsTask-specific inputs
gradersArray of grader configurations
tracked_metricsMetrics to collect
Four specialized suites for different evaluation purposes, each with specific grading strategies and budget constraints.
Deterministic regression tests with code-based assertions
Path
evals/goldenDefault Trials
3 (pass^k)Graders
Deterministic tests, static analysis, state checks
Budget Limits
Strict (20 steps max)
Example Task
fix-auth-bypass_1.yamlSecurity testing and attack scenarios
Path
evals/adversarialDefault Trials
10 (pass^k)Graders
Deterministic safety checks, policy validators
Budget Limits
Very strict (8 steps)
Example Task
prompt-injection_tool-misuse_1.yamlCapability evals with subjective rubric scoring
Path
evals/open_endedDefault Trials
5 (pass@k)Graders
LLM rubric scoring, model-based evaluation
Budget Limits
Relaxed (12 steps)
Example Task
support-chat_resolution_1.yamlProduction incident reproduction tests
Path
evals/failure_replaysDefault Trials
5 (pass^k)Graders
Trajectory metrics, robustness checks
Budget Limits
Medium (10 steps)
Example Task
incident-2026-01-xx_looping-retries_1.yamlThree-tier pipeline with progressive evaluation at merge, nightly, and release stages.
Must pass before merging to main. Includes golden, adversarial, and failure replay suites.
Fail conditions:
Budget constraints:
Comprehensive evaluation including all suites (golden, adversarial, open-ended, failure replays) for trend monitoring.
Full suite validation before production deployment. All four suites must pass with strict thresholds.
At least one success in k trials (OR logic)
Used for: Open-ended tasks where any valid solution counts
All k trials must succeed (AND logic)
Used for: Safety and regression tests requiring consistency
Full JSON schema defining the structure of evaluation tasks. All tasks must validate against this schema.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Agent Eval Task",
"type": "object",
"required": ["task_id", "suite", "description", "inputs", "graders", "tracked_metrics"],
"properties": {
"task_id": { "type": "string" },
"suite": {
"type": "string",
"enum": ["golden", "open_ended", "adversarial", "failure_replays"]
},
"description": { "type": "string" },
"tags": { "type": "array", "items": { "type": "string" } },
"inputs": { "type": "object" },
"environment": {
"type": "object",
"properties": {
"sandbox": { "type": "string" },
"reset": { "type": "object" },
"budgets": { "type": "object" }
}
},
"trials": {
"type": "object",
"properties": {
"k": { "type": "integer", "minimum": 1 },
"metric": { "type": "string", "enum": ["pass@k", "pass^k"] }
}
},
"graders": { "type": "array", "minItems": 1 },
"tracked_metrics": { "type": "array", "minItems": 1 },
"gates": {
"type": "object",
"properties": {
"ci_merge_gate": { "type": "boolean" },
"nightly": { "type": "boolean" },
"release_candidate": { "type": "boolean" }
}
}
}
}