Evaluation Framework

YAML-based task suite structure for systematic agent evaluation across multiple dimensions.

Task Structure

Each evaluation task is defined as a YAML file following a standardized schema. Tasks specify inputs, environment configuration, graders, metrics, and CI gates.

Required Fields

task_id

Unique identifier

suite

golden | open_ended | adversarial | failure_replays

description

Task description

inputs

Task-specific inputs

graders

Array of grader configurations

tracked_metrics

Metrics to collect

Evaluation Suites

Four specialized suites for different evaluation purposes, each with specific grading strategies and budget constraints.

Golden Suite

CI Required

Deterministic regression tests with code-based assertions

Path

evals/golden

Default Trials

3 (pass^k)

Graders

Deterministic tests, static analysis, state checks

Budget Limits

Strict (20 steps max)

Example Task

fix-auth-bypass_1.yaml

Adversarial Suite

CI Required

Security testing and attack scenarios

Path

evals/adversarial

Default Trials

10 (pass^k)

Graders

Deterministic safety checks, policy validators

Budget Limits

Very strict (8 steps)

Example Task

prompt-injection_tool-misuse_1.yaml

Open-Ended Suite

Capability evals with subjective rubric scoring

Path

evals/open_ended

Default Trials

5 (pass@k)

Graders

LLM rubric scoring, model-based evaluation

Budget Limits

Relaxed (12 steps)

Example Task

support-chat_resolution_1.yaml

Failure Replays Suite

CI Required

Production incident reproduction tests

Path

evals/failure_replays

Default Trials

5 (pass^k)

Graders

Trajectory metrics, robustness checks

Budget Limits

Medium (10 steps)

Example Task

incident-2026-01-xx_looping-retries_1.yaml

CI Gates & Execution Flow

Three-tier pipeline with progressive evaluation at merge, nightly, and release stages.

CI Merge Gate

Must pass before merging to main. Includes golden, adversarial, and failure replay suites.

Fail conditions:

  • Any safety violation
  • Golden regression
  • Forbidden tool called

Budget constraints:

  • Max latency P95 increase: 15ms
  • Max cost increase: 10%

Nightly Runs

Comprehensive evaluation including all suites (golden, adversarial, open-ended, failure replays) for trend monitoring.

Release Candidate

Full suite validation before production deployment. All four suites must pass with strict thresholds.

Trial Metrics

pass@k

At least one success in k trials (OR logic)

Used for: Open-ended tasks where any valid solution counts

pass^k

All k trials must succeed (AND logic)

Used for: Safety and regression tests requiring consistency

Task Schema Reference

Full JSON schema defining the structure of evaluation tasks. All tasks must validate against this schema.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Agent Eval Task",
  "type": "object",
  "required": ["task_id", "suite", "description", "inputs", "graders", "tracked_metrics"],
  "properties": {
    "task_id": { "type": "string" },
    "suite": {
      "type": "string",
      "enum": ["golden", "open_ended", "adversarial", "failure_replays"]
    },
    "description": { "type": "string" },
    "tags": { "type": "array", "items": { "type": "string" } },
    "inputs": { "type": "object" },
    "environment": {
      "type": "object",
      "properties": {
        "sandbox": { "type": "string" },
        "reset": { "type": "object" },
        "budgets": { "type": "object" }
      }
    },
    "trials": {
      "type": "object",
      "properties": {
        "k": { "type": "integer", "minimum": 1 },
        "metric": { "type": "string", "enum": ["pass@k", "pass^k"] }
      }
    },
    "graders": { "type": "array", "minItems": 1 },
    "tracked_metrics": { "type": "array", "minItems": 1 },
    "gates": {
      "type": "object",
      "properties": {
        "ci_merge_gate": { "type": "boolean" },
        "nightly": { "type": "boolean" },
        "release_candidate": { "type": "boolean" }
      }
    }
  }
}