Berkeley Researchers Expose AI Agent Benchmark Crisis: Perfect Scores Without Solving Tasks

Researchers at UC Berkeley's RDI Center have systematically exploited vulnerabilities in eight major AI agent benchmarks to achieve high scores without actually solving any tasks, exposing a fundamental crisis in how the AI community evaluates agent capabilities. The study, which gained 202 points on Hacker News, demonstrates that current benchmark leaderboards may be fundamentally misleading.

Simple Exploits Achieved Perfect Scores on Major Benchmarks

The researchers demonstrated critical flaws across industry-standard benchmarks. On SWE-bench, they created a simple pytest configuration file that forced every test to report as passing without fixing any bugs. On WebArena, they navigated to local file URLs to read answer keys directly from task configurations, bypassing actual problem-solving entirely.

These exploits worked on eight major benchmarks including SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench—representing a comprehensive failure of evaluation infrastructure across the AI agent ecosystem.

Seven Recurring Vulnerability Patterns Identified

The Berkeley team identified seven systematic vulnerability patterns plaguing agent benchmarks:

Shared evaluation environments where agent code runs in the same space evaluators inspect
Exposed answers with reference solutions shipped with tests or publicly available
Unsafe code execution using eval() on agent-controlled strings
Unprotected LLM judges where agent content is interpolated into prompts without sanitization, enabling prompt injection
Weak validation logic with substring matching and overly permissive normalization
Broken scoring code with evaluation functions that skip critical checks
Trusting untrusted outputs by accepting test results from compromised environments

Proposed Solutions: Agent-Eval Checklist and BenchJack Scanner

The researchers propose an "Agent-Eval Checklist" as a minimum standard requiring isolation between agent and evaluator, secret answers, adversarial testing before publication, and robust scoring mechanisms. They're developing BenchJack, an automated benchmark vulnerability scanner designed to become standard in evaluation development.

Community reaction highlights the severity of the problem. Developers noted that "AI coding agent benchmarks are dead" and that only benchmarks like METR and GDPval may still provide reliable signals. The findings suggest that current leaderboards may be fundamentally unreliable for comparing agent capabilities.

Key Takeaways

UC Berkeley researchers achieved perfect scores on eight major AI agent benchmarks without solving any tasks
Simple exploits included forcing tests to pass on SWE-bench and reading answer keys directly on WebArena
Seven systematic vulnerability patterns were identified across all major agent evaluation frameworks
The researchers propose an Agent-Eval Checklist and are developing BenchJack, an automated vulnerability scanner
Current agent benchmark leaderboards may be fundamentally unreliable for comparing real capabilities

Simple Exploits Achieved Perfect Scores on Major Benchmarks

Seven Recurring Vulnerability Patterns Identified

The Berkeley team identified seven systematic vulnerability patterns plaguing agent benchmarks:

Shared evaluation environments where agent code runs in the same space evaluators inspect

Exposed answers with reference solutions shipped with tests or publicly available

Unsafe code execution using eval() on agent-controlled strings

Unprotected LLM judges where agent content is interpolated into prompts without sanitization, enabling prompt injection

Weak validation logic with substring matching and overly permissive normalization

Broken scoring code with evaluation functions that skip critical checks

Trusting untrusted outputs by accepting test results from compromised environments

Proposed Solutions: Agent-Eval Checklist and BenchJack Scanner

Key Takeaways

UC Berkeley researchers achieved perfect scores on eight major AI agent benchmarks without solving any tasks

Simple exploits included forcing tests to pass on SWE-bench and reading answer keys directly on WebArena

Seven systematic vulnerability patterns were identified across all major agent evaluation frameworks

The researchers propose an Agent-Eval Checklist and are developing BenchJack, an automated vulnerability scanner

Current agent benchmark leaderboards may be fundamentally unreliable for comparing real capabilities