Frontier language models including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt an average of 25% of document content by the end of long workflows, according to research published on arXiv. The study, authored by Philippe Laban, Tobias Schnabel, and Jennifer Neville, evaluated 19 LLMs using a new benchmark called DELEGATE-52 and found that agentic tool use did not improve reliability.
DELEGATE-52 Benchmark Tests Multi-Step Document Editing
The researchers introduced DELEGATE-52, a benchmark examining how reliably AI systems perform delegated tasks across extended workflows. The benchmark simulates professional document editing scenarios across 52 domains including coding, crystallography, music notation, and other specialized fields. The evaluation tests whether models can faithfully execute tasks without introducing errors that accumulate through extended interactions.
Key findings from the research include:
- Frontier models corrupt an average of 25% of document content by workflow end
- Degradation worsened with larger documents, longer interactions, and presence of distractor files
- Errors tend to be sparse but severe, compounding silently over time
- Agentic tool use did NOT improve results
- Other models beyond the frontier tier showed even worse performance
Implications for AI Agent Reliability
The finding that agentic tool use did not improve results is particularly significant, as it suggests architectural improvements alone will not solve the fundamental problem of models losing track of document state over time. The 25% corruption rate makes these systems unreliable for production use without human verification at each step, directly challenging the premise of AI agents performing extended autonomous work.
The research gained significant attention on Hacker News, reaching 347 points with 133 comments on May 9, 2026. Discussion focused on real-world experiences with LLM document corruption and whether the benchmark accurately reflects actual usage patterns. Developers shared anecdotes of similar corruption issues in production systems, validating the research findings.
Current LLMs Deemed Unreliable for Delegated Work
The researchers concluded that current LLMs are unreliable for delegated work scenarios due to silent document corruption. The sparse but severe nature of errors means problems may not be immediately apparent but compound over time, creating technical debt and potential data loss. The finding applies even to the most capable frontier models, suggesting this is a fundamental limitation rather than a training or scale issue.
The research highlights a critical gap between LLM capabilities in isolated tasks versus extended workflows. While models may perform well on individual editing operations, maintaining document integrity across multi-step processes remains an unsolved challenge that limits practical deployment of AI agents.
Key Takeaways
- Frontier LLMs including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt 25% of document content in extended workflows
- The DELEGATE-52 benchmark evaluated 19 LLMs across 52 professional domains to measure reliability in delegated tasks
- Agentic tool use did not improve reliability, suggesting architectural changes alone cannot solve the problem
- Errors are sparse but severe, compounding silently over time and worsening with larger documents and longer interactions
- Current LLMs are deemed unreliable for production delegated work without human verification at each step