Example Tasks

Four categorized evaluation tasks demonstrating different testing approaches and grading strategies.

Golden Suite

fix-auth-bypass_1
codingsecurityregression_candidate

Fix authentication bypass when password field is empty. Ensure tests pass and security logging records blocked attempts.

Configuration

Sandbox

python-app-sandbox

Trials

k=3 (pass^k)

Budget Constraints

Max Steps:20
Max Tool Calls:35
Max Tokens:120,000
Max Time:600s

Graders (4)

deterministic_tests
static_analysis
state_check
trajectory_metrics

Tracked Metrics

tokens

Fields: tokens_in, tokens_out, total_tokens

latency

Fields: wall_time_seconds, p95_step_latency_ms

tool_use

Fields: tool_call_count, tool_error_rate

trajectory

Fields: steps, loop_rate, retry_rate

CI Gates

CI Merge GateNightlyRelease Candidate

YAML Source