A comprehensive study published May 5, 2026, reveals that clinical large language models follow fundamentally different scaling laws for safety compared to accuracy—challenging the assumption that higher benchmark performance implies safer medical AI. The research, conducted by Sebastian Wind and colleagues, introduces SaFE-Scale, a framework demonstrating that clinical LLM safety is "not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, and context construction."
Clean Evidence Produces Dramatic Safety Improvements
The study evaluated 34 locally deployed LLMs using RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined evidence and option-level labels for high-risk errors. The results demonstrate clean evidence's supremacy:
- Accuracy increase: From 73.5% to 94.1%
- High-risk error reduction: From 12.0% to 2.6%
- Contradiction reduction: From 12.7% to 2.3%
- Dangerous overconfidence reduction: From 8.0% to 1.6%
These improvements far exceeded gains from other interventions, including increased model scale or inference-time compute.
RAG Systems Fail to Reproduce Safety Profile
Despite their widespread use in clinical applications, Retrieval-Augmented Generation (RAG) systems showed critical limitations:
- Standard RAG: Did not reproduce clean evidence's safety profile
- Agentic RAG: Improved accuracy over standard RAG and reduced contradictions, but high-risk error and dangerous overconfidence remained elevated
- Max-context prompting: Increased latency without closing the safety gap
- Additional inference-time compute: Produced only limited gains
The research tested six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting.
Worst-Case Analysis Reveals Concentrated Clinical Risk
The study's worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. In medicine, the researchers note, "a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance." This insight suggests that mean accuracy metrics fail to capture the tail-risk behavior most relevant to clinical AI deployment.
Evidence Quality Emerges as Primary Safety Determinant
The core finding challenges current LLM development paradigms: clinical safety is shaped primarily by evidence quality, retrieval design, and context construction rather than model scale alone. The research indicates that deployment decisions—particularly around evidence curation and retrieval strategy—have greater safety implications for medical AI than selecting larger models or allocating more inference-time compute.
The paper, authored by Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa, Mahshad Lotfinia, Sebastian Bickelhaup, Michael Uder, Harald Köstler, Gerhard Wellein, Sven Nebelung, Daniel Truhn, Andreas Maier, and Soroosh Tayebi Arasteh, is available as arXiv:2605.04039.
Key Takeaways
- Clinical LLM safety follows different scaling laws than accuracy, with evidence quality being the primary safety determinant rather than model scale
- Clean evidence increased mean accuracy from 73.5% to 94.1% while reducing high-risk errors from 12.0% to 2.6%
- Standard RAG and agentic RAG failed to reproduce clean evidence's safety profile, with high-risk errors and dangerous overconfidence remaining elevated
- Worst-case analysis showed clinically consequential errors concentrate in a small subset of questions, where a few confident mistakes matter more than average performance
- The study evaluated 34 LLMs using RadSaFE-200, a 200-question radiology benchmark with clinician-defined evidence and option-level safety labels