A new research paper published on arXiv exposes a critical limitation in current AI agents: despite matching human accuracy on document navigation tasks, they rely on brute-force search rather than strategic reasoning. The MADQA benchmark, containing 2,250 human-authored questions across 800 diverse PDF documents, reveals that agents compensate for weak strategic planning through exhaustive trial-and-error rather than calibrated decision-making.
Agents Match Human Accuracy But Lack Strategic Planning
The research, authored by Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, and 12+ co-authors from multiple institutions, introduces a novel evaluation protocol measuring the accuracy-effort trade-off. This framework assesses not just whether agents reach correct answers, but how efficiently they do so. The findings show that while top-performing agents match human searchers in raw accuracy, they succeed on largely different questions and demonstrate fundamentally different problem-solving approaches.
Most critically, agents fail to close the nearly 20% gap to oracle performance, persistently falling into unproductive loops. This gap persists even as agents achieve human-level accuracy on certain tasks, indicating that current systems lack the strategic navigation abilities that enable humans to work efficiently through document collections.
Benchmark Designed to Distinguish Capability Levels
MADQA was specifically designed using Classical Test Theory to maximize discriminative power across varying levels of agentic abilities. Rather than simply measuring average performance, the benchmark clearly distinguishes between different capability levels, making it easier to identify where agents fall short of genuine reasoning.
The researchers evaluated how well multimodal agents navigate document collections, specifically testing whether systems employ strategic reasoning or default to computational brute force. The results challenge the narrative that current agents possess human-like reasoning capabilities, instead revealing their dependence on exhaustive search methods.
Dataset Released to Advance Agent Development
The research team has released both the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning in AI systems. By providing standardized tools for measuring the accuracy-effort trade-off, the researchers aim to shift development focus toward agents that demonstrate genuine strategic planning rather than simply compensating through computational power.
Key Takeaways
- MADQA benchmark contains 2,250 questions across 800 PDF documents, designed to test strategic reasoning versus brute-force search in AI agents
- Top agents match human accuracy but succeed on different questions and rely on exhaustive trial-and-error rather than strategic planning
- Agents fail to close a nearly 20% gap to oracle performance, persistently falling into unproductive loops
- The benchmark uses Classical Test Theory to maximize discriminative power, clearly distinguishing between different levels of agentic capability
- Researchers released the dataset and evaluation harness to facilitate development of agents with genuine strategic reasoning abilities