Frontier LLMs Corrupt 25% of Documents in Delegated Workflows, Research Finds

Frontier language models including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt an average of 25% of document content by the end of long workflows, according to research published on arXiv. The study, authored by Philippe Laban, Tobias Schnabel, and Jennifer Neville, evaluated 19 LLMs using a new benchmark called DELEGATE-52 and found that agentic tool use did not improve reliability.

DELEGATE-52 Benchmark Tests Multi-Step Document Editing

The researchers introduced DELEGATE-52, a benchmark examining how reliably AI systems perform delegated tasks across extended workflows. The benchmark simulates professional document editing scenarios across 52 domains including coding, crystallography, music notation, and other specialized fields. The evaluation tests whether models can faithfully execute tasks without introducing errors that accumulate through extended interactions.

Key findings from the research include:

Frontier models corrupt an average of 25% of document content by workflow end
Degradation worsened with larger documents, longer interactions, and presence of distractor files
Errors tend to be sparse but severe, compounding silently over time
Agentic tool use did NOT improve results
Other models beyond the frontier tier showed even worse performance

Implications for AI Agent Reliability

The finding that agentic tool use did not improve results is particularly significant, as it suggests architectural improvements alone will not solve the fundamental problem of models losing track of document state over time. The 25% corruption rate makes these systems unreliable for production use without human verification at each step, directly challenging the premise of AI agents performing extended autonomous work.

The research gained significant attention on Hacker News, reaching 347 points with 133 comments on May 9, 2026. Discussion focused on real-world experiences with LLM document corruption and whether the benchmark accurately reflects actual usage patterns. Developers shared anecdotes of similar corruption issues in production systems, validating the research findings.

Current LLMs Deemed Unreliable for Delegated Work

The researchers concluded that current LLMs are unreliable for delegated work scenarios due to silent document corruption. The sparse but severe nature of errors means problems may not be immediately apparent but compound over time, creating technical debt and potential data loss. The finding applies even to the most capable frontier models, suggesting this is a fundamental limitation rather than a training or scale issue.

The research highlights a critical gap between LLM capabilities in isolated tasks versus extended workflows. While models may perform well on individual editing operations, maintaining document integrity across multi-step processes remains an unsolved challenge that limits practical deployment of AI agents.

Key Takeaways

Frontier LLMs including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt 25% of document content in extended workflows
The DELEGATE-52 benchmark evaluated 19 LLMs across 52 professional domains to measure reliability in delegated tasks
Agentic tool use did not improve reliability, suggesting architectural changes alone cannot solve the problem
Errors are sparse but severe, compounding silently over time and worsening with larger documents and longer interactions
Current LLMs are deemed unreliable for production delegated work without human verification at each step

DELEGATE-52 Benchmark Tests Multi-Step Document Editing

Key findings from the research include:

Frontier models corrupt an average of 25% of document content by workflow end

Degradation worsened with larger documents, longer interactions, and presence of distractor files

Errors tend to be sparse but severe, compounding silently over time

Agentic tool use did NOT improve results

Other models beyond the frontier tier showed even worse performance

Implications for AI Agent Reliability

Current LLMs Deemed Unreliable for Delegated Work

Key Takeaways

Frontier LLMs including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt 25% of document content in extended workflows

The DELEGATE-52 benchmark evaluated 19 LLMs across 52 professional domains to measure reliability in delegated tasks

Agentic tool use did not improve reliability, suggesting architectural changes alone cannot solve the problem

Errors are sparse but severe, compounding silently over time and worsening with larger documents and longer interactions

Current LLMs are deemed unreliable for production delegated work without human verification at each step