WildClawBench Shows Frontier AI Models Struggle at Long-Horizon Agent Tasks

Researchers from Stanford have released WildClawBench, a new benchmark evaluating AI agents on real-world, long-horizon tasks in native runtime environments. Released via arXiv on May 11, 2026, the benchmark reveals that even the best-performing model, Claude Opus 4.7, achieves only 62.2% success rate, with all other frontier models falling below 60%.

Real-World Tasks Replace Synthetic Sandboxes

WildClawBench consists of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages approximately 8 minutes of wall-clock time and requires over 20 tool calls. The researchers note that "most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed."

Key technical features include:

Tasks run inside reproducible Docker containers hosting actual CLI agent harnesses
Support for OpenClaw, Claude Code, Codex, and Hermes Agent harnesses
Access to real tools rather than mock services
Hybrid grading combining deterministic rule-based checks, environment-state auditing, and LLM/VLM judges for semantic verification

Agent Harness Choice Dramatically Impacts Performance

The benchmark revealed significant sensitivity to implementation details. Switching the agent harness alone can shift a single model's performance by up to 18 percentage points, highlighting the substantial role that execution infrastructure plays in agent capabilities beyond raw model performance.

Results Across 19 Frontier Models

The evaluation tested 19 frontier models under the WildClawBench framework. Claude Opus 4.7 achieved the highest overall score at 62.2% under the OpenClaw harness, while every other tested model remained below the 60% threshold. These results suggest that long-horizon agentic tasks in realistic environments remain challenging even for the most advanced language models currently available.

Open-Source Release for Reproducible Research

The research team has released the benchmark tasks, evaluation code, and containerized tooling to support reproducible evaluation of agent systems. This infrastructure enables researchers to test agent capabilities in realistic deployment environments rather than simplified synthetic settings.

Key Takeaways

WildClawBench evaluates AI agents on 60 human-authored, bilingual, multimodal tasks averaging 8 minutes and over 20 tool calls each
Claude Opus 4.7 achieved the highest score at 62.2%, with all other frontier models scoring below 60%
Switching agent harness implementations can shift performance by up to 18 percentage points for the same model
Tasks run in Docker containers with real tools and actual CLI harnesses rather than mock services
The benchmark code, tasks, and containerized tooling have been open-sourced for reproducible research

Real-World Tasks Replace Synthetic Sandboxes

Key technical features include:

Tasks run inside reproducible Docker containers hosting actual CLI agent harnesses

Support for OpenClaw, Claude Code, Codex, and Hermes Agent harnesses

Access to real tools rather than mock services

Hybrid grading combining deterministic rule-based checks, environment-state auditing, and LLM/VLM judges for semantic verification

Results Across 19 Frontier Models

Key Takeaways

WildClawBench evaluates AI agents on 60 human-authored, bilingual, multimodal tasks averaging 8 minutes and over 20 tool calls each

Claude Opus 4.7 achieved the highest score at 62.2%, with all other frontier models scoring below 60%

Switching agent harness implementations can shift performance by up to 18 percentage points for the same model

Tasks run in Docker containers with real tools and actual CLI harnesses rather than mock services

The benchmark code, tasks, and containerized tooling have been open-sourced for reproducible research