Researchers from Johns Hopkins University and the University of Washington have identified a critical weakness in LLM agents: their inability to properly handle instruction conflicts across many privilege levels. In a paper published April 10, 2026, the team introduced ManyIH-Bench, the first benchmark specifically designed to test how well models resolve conflicts between instructions carrying different authority levels. Frontier models achieved only 40% accuracy when conflicts scaled to 12 or more privilege tiers.
Current Privilege Models Assume Too Few Authority Levels
LLM agents receive instructions from numerous sources: system messages, user prompts, tool outputs, environment observations, API responses, configuration files, code comments, and documentation snippets. Each source carries different trust levels and authority, but current approaches assume a fixed, small set of privilege levels—typically fewer than five—with rigid role labels like "system > user."
This simple hierarchy breaks down in real-world agentic scenarios. A coding agent, for example, might encounter conflicting instructions from system prompts, user requirements, tool error messages, test outputs, and inline code comments simultaneously. Current models lack the capability to properly order and resolve these conflicts.
ManyIH-Bench Tests Up to 12 Conflicting Instruction Tiers
The Many-Tier Instruction Hierarchy (ManyIH) paradigm addresses conflicts among instructions with arbitrarily many privilege levels, not just three to five. The associated ManyIH-Bench benchmark includes:
- 853 total agentic tasks: 427 coding tasks and 426 instruction-following tasks
- 12 levels of conflicting instructions with varying privileges in some scenarios
- 46 real-world agent types and scenarios tested
- LLM-generated constraints verified by human evaluators for realistic difficulty
The benchmark creates scenarios where models must identify and follow the highest-privilege instruction while ignoring lower-authority conflicting directives—a fundamental requirement for safe agent deployment.
Frontier Models Struggle With Privilege Escalation
Testing revealed significant performance degradation as the number of privilege tiers increased. State-of-the-art models achieved approximately 40% accuracy when instruction conflicts scaled to many tiers, far below the reliability threshold needed for production deployment.
The research highlights why agentic settings fundamentally change the challenge of instruction hierarchy:
- Stochastic transitions between states introduce unpredictability
- Partial observability limits what models can see about instruction sources
- Long horizons of 100+ turns (100,000 to 1,000,000 tokens) amplify error propagation
- Many instruction sources create complex privilege ordering requirements
Implications for Agent Safety and Deployment
The findings underscore an urgent need for methods that handle fine-grained, scalable instruction conflict resolution. Current instruction hierarchy approaches are inadequate for real-world agent deployment where privilege levels are numerous, dynamic, and context-dependent.
The shift from reasoning to agentic reinforcement learning fundamentally changes the credit assignment landscape. Models must not only understand privilege hierarchies conceptually but maintain that understanding across extended interactions with multiple conflicting information sources.
The paper was authored by Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, and Daniel Khashabi.
Key Takeaways
- Frontier LLMs achieve only 40% accuracy when resolving conflicts between 12 or more instruction privilege levels
- ManyIH-Bench introduces 853 agentic tasks testing instruction conflicts across 46 real-world agent scenarios
- Current privilege models assume too few authority levels (typically under 5) for real-world agentic applications
- Agentic settings introduce stochastic transitions, partial observability, and 100+ turn horizons that amplify instruction conflict challenges
- The research highlights an urgent need for scalable instruction hierarchy methods before widespread agent deployment