Many-Tier Instruction Hierarchy Benchmark Exposes LLM Agent Privilege Escalation Failures

Researchers from Johns Hopkins University and the University of Washington have identified a critical weakness in LLM agents: their inability to properly handle instruction conflicts across many privilege levels. In a paper published April 10, 2026, the team introduced ManyIH-Bench, the first benchmark specifically designed to test how well models resolve conflicts between instructions carrying different authority levels. Frontier models achieved only 40% accuracy when conflicts scaled to 12 or more privilege tiers.

Current Privilege Models Assume Too Few Authority Levels

LLM agents receive instructions from numerous sources: system messages, user prompts, tool outputs, environment observations, API responses, configuration files, code comments, and documentation snippets. Each source carries different trust levels and authority, but current approaches assume a fixed, small set of privilege levels—typically fewer than five—with rigid role labels like "system > user."

This simple hierarchy breaks down in real-world agentic scenarios. A coding agent, for example, might encounter conflicting instructions from system prompts, user requirements, tool error messages, test outputs, and inline code comments simultaneously. Current models lack the capability to properly order and resolve these conflicts.

ManyIH-Bench Tests Up to 12 Conflicting Instruction Tiers

The Many-Tier Instruction Hierarchy (ManyIH) paradigm addresses conflicts among instructions with arbitrarily many privilege levels, not just three to five. The associated ManyIH-Bench benchmark includes:

853 total agentic tasks: 427 coding tasks and 426 instruction-following tasks
12 levels of conflicting instructions with varying privileges in some scenarios
46 real-world agent types and scenarios tested
LLM-generated constraints verified by human evaluators for realistic difficulty

The benchmark creates scenarios where models must identify and follow the highest-privilege instruction while ignoring lower-authority conflicting directives—a fundamental requirement for safe agent deployment.

Frontier Models Struggle With Privilege Escalation

Testing revealed significant performance degradation as the number of privilege tiers increased. State-of-the-art models achieved approximately 40% accuracy when instruction conflicts scaled to many tiers, far below the reliability threshold needed for production deployment.

The research highlights why agentic settings fundamentally change the challenge of instruction hierarchy:

Stochastic transitions between states introduce unpredictability
Partial observability limits what models can see about instruction sources
Long horizons of 100+ turns (100,000 to 1,000,000 tokens) amplify error propagation
Many instruction sources create complex privilege ordering requirements

Implications for Agent Safety and Deployment

The findings underscore an urgent need for methods that handle fine-grained, scalable instruction conflict resolution. Current instruction hierarchy approaches are inadequate for real-world agent deployment where privilege levels are numerous, dynamic, and context-dependent.

The shift from reasoning to agentic reinforcement learning fundamentally changes the credit assignment landscape. Models must not only understand privilege hierarchies conceptually but maintain that understanding across extended interactions with multiple conflicting information sources.

The paper was authored by Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, and Daniel Khashabi.

Key Takeaways

Frontier LLMs achieve only 40% accuracy when resolving conflicts between 12 or more instruction privilege levels
ManyIH-Bench introduces 853 agentic tasks testing instruction conflicts across 46 real-world agent scenarios
Current privilege models assume too few authority levels (typically under 5) for real-world agentic applications
Agentic settings introduce stochastic transitions, partial observability, and 100+ turn horizons that amplify instruction conflict challenges
The research highlights an urgent need for scalable instruction hierarchy methods before widespread agent deployment

Current Privilege Models Assume Too Few Authority Levels

ManyIH-Bench Tests Up to 12 Conflicting Instruction Tiers

853 total agentic tasks: 427 coding tasks and 426 instruction-following tasks

12 levels of conflicting instructions with varying privileges in some scenarios

46 real-world agent types and scenarios tested

LLM-generated constraints verified by human evaluators for realistic difficulty

Frontier Models Struggle With Privilege Escalation

The research highlights why agentic settings fundamentally change the challenge of instruction hierarchy:

Stochastic transitions between states introduce unpredictability

Partial observability limits what models can see about instruction sources

Long horizons of 100+ turns (100,000 to 1,000,000 tokens) amplify error propagation

Many instruction sources create complex privilege ordering requirements

Implications for Agent Safety and Deployment

The paper was authored by Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, and Daniel Khashabi.

Key Takeaways

Frontier LLMs achieve only 40% accuracy when resolving conflicts between 12 or more instruction privilege levels

ManyIH-Bench introduces 853 agentic tasks testing instruction conflicts across 46 real-world agent scenarios

Current privilege models assume too few authority levels (typically under 5) for real-world agentic applications

Agentic settings introduce stochastic transitions, partial observability, and 100+ turn horizons that amplify instruction conflict challenges

The research highlights an urgent need for scalable instruction hierarchy methods before widespread agent deployment