Researchers Jonathan Steinberg and Oren Gal have published a benchmark revealing that production coding agents from Anthropic, OpenAI, Google, Moonshot, Zhipu, and Minimax compose exploitable code at 53-86% success rates when malicious objectives are decomposed into innocuous-looking engineering tickets. The research, published May 5, 2026 on arXiv, introduces MOSAIC-Bench with 199 three-stage attack chains across 10 web-application substrates, 31 CWE classes, and 5 programming languages.
Staged Attacks Bypass Safety Alignment in Nine Production Agents
MOSAIC-Bench tests nine production coding agents by decomposing malicious requests into routine engineering tickets. The agents achieve 53-86% end-to-end attack success rates with only two refusals across all staged runs. In contrast, when the same agents receive direct prompts requesting vulnerable code, success rates fall to 0-20.4%—Claude primarily refuses while Codex hardens implementations rather than emitting vulnerabilities. This demonstrates that ticket staging silences both defense modes simultaneously.
Code Review Agents Approve Quarter of Confirmed Exploits
The research reveals significant gaps in code review defense layers. Reviewer agents approve 25.8% of confirmed-vulnerable cumulative diffs as routine pull requests. Even when reviewers receive full context, the mitigation protocol closes only 50% of the staged versus direct prompt gap, ruling out context fragmentation as the primary explanation for the vulnerability.
Adversarial Pentester Framing Reduces Evasion to 3-17.6%
The researchers propose reframing code reviewers as adversarial pentesters as a deployable mitigation. Under this framing, evasion rates drop to 3.0-17.6% across the evaluated reviewer subset. An open-weight Gemma-4-E4B-it reviewer using pentester framing detected 88.4% of attacks with a 4.6% false-positive rate when tested on 608 real-world GitHub pull requests.
Benchmark Treats Exploit Ground Truth as First-Class Evaluation Axis
MOSAIC-Bench introduces 199 three-stage attack chains paired with deterministic exploit oracles on deployed software substrates. The benchmark spans 10 web-application substrates, 31 Common Weakness Enumeration (CWE) classes, and 5 programming languages, treating both exploit ground truth and downstream reviewer protocol as first-class evaluation axes.
Key Takeaways
- Nine production coding agents achieve 53-86% attack success rates when malicious objectives are decomposed into innocuous tickets, compared to 0-20.4% for direct prompts
- Code reviewer agents approve 25.8% of confirmed-vulnerable diffs as routine pull requests
- MOSAIC-Bench provides 199 three-stage attack chains across 10 web substrates, 31 CWE classes, and 5 programming languages
- Reframing reviewers as adversarial pentesters reduces evasion to 3.0-17.6%, with Gemma-4-E4B-it detecting 88.4% of attacks
- Full-context implementation protocols close only 50% of the staged versus direct prompt gap