Researchers from Stanford have released WildClawBench, a new benchmark evaluating AI agents on real-world, long-horizon tasks in native runtime environments. Released via arXiv on May 11, 2026, the benchmark reveals that even the best-performing model, Claude Opus 4.7, achieves only 62.2% success rate, with all other frontier models falling below 60%.
Real-World Tasks Replace Synthetic Sandboxes
WildClawBench consists of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages approximately 8 minutes of wall-clock time and requires over 20 tool calls. The researchers note that "most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed."
Key technical features include:
- Tasks run inside reproducible Docker containers hosting actual CLI agent harnesses
- Support for OpenClaw, Claude Code, Codex, and Hermes Agent harnesses
- Access to real tools rather than mock services
- Hybrid grading combining deterministic rule-based checks, environment-state auditing, and LLM/VLM judges for semantic verification
Agent Harness Choice Dramatically Impacts Performance
The benchmark revealed significant sensitivity to implementation details. Switching the agent harness alone can shift a single model's performance by up to 18 percentage points, highlighting the substantial role that execution infrastructure plays in agent capabilities beyond raw model performance.
Results Across 19 Frontier Models
The evaluation tested 19 frontier models under the WildClawBench framework. Claude Opus 4.7 achieved the highest overall score at 62.2% under the OpenClaw harness, while every other tested model remained below the 60% threshold. These results suggest that long-horizon agentic tasks in realistic environments remain challenging even for the most advanced language models currently available.
Open-Source Release for Reproducible Research
The research team has released the benchmark tasks, evaluation code, and containerized tooling to support reproducible evaluation of agent systems. This infrastructure enables researchers to test agent capabilities in realistic deployment environments rather than simplified synthetic settings.
Key Takeaways
- WildClawBench evaluates AI agents on 60 human-authored, bilingual, multimodal tasks averaging 8 minutes and over 20 tool calls each
- Claude Opus 4.7 achieved the highest score at 62.2%, with all other frontier models scoring below 60%
- Switching agent harness implementations can shift performance by up to 18 percentage points for the same model
- Tasks run in Docker containers with real tools and actual CLI harnesses rather than mock services
- The benchmark code, tasks, and containerized tooling have been open-sourced for reproducible research