Four categorized evaluation tasks demonstrating different testing approaches and grading strategies.
fix-auth-bypass_1Fix authentication bypass when password field is empty. Ensure tests pass and security logging records blocked attempts.
Sandbox
python-app-sandboxTrials
k=3 (pass^k)deterministic_testsstatic_analysisstate_checktrajectory_metricstokensFields: tokens_in, tokens_out, total_tokens
latencyFields: wall_time_seconds, p95_step_latency_ms
tool_useFields: tool_call_count, tool_error_rate
trajectoryFields: steps, loop_rate, retry_rate