A new diagnostic framework reveals that LLM-as-judge systems suffer from widespread per-input inconsistencies masked by deceptively low aggregate statistics. The research, published April 16, 2026 on arXiv, introduces a two-pronged toolkit for understanding LLM judge reliability and shows that 33-67% of documents exhibit logical inconsistencies in evaluation.
Transitivity Analysis Reveals Hidden Inconsistencies
The research team analyzed transitivity violations—instances where a judge says A > B and B > C, but also C > A, creating a logical contradiction known as a directed 3-cycle. While aggregate violation rates appear low at 0.8%-4.1%, this masks significant per-document problems. The critical finding: 33-67% of documents exhibit at least one directed 3-cycle, meaning the judge's rankings are internally contradictory for those specific inputs.
The study evaluated 1,918 instances from SummEval, a standard summarization evaluation benchmark, across four different judge models and four evaluation criteria: relevance, coherence, fluency, and consistency.
Conformal Prediction Sets Provide Reliability Guarantees
The second diagnostic approach uses split conformal prediction sets to provide theoretically-guaranteed ≥(1-α) coverage over 1-5 Likert scores. The prediction set width serves as a per-instance reliability indicator, with wider sets indicating lower confidence. The correlation between set width and ground truth disagreement is strong: rs = +0.576, N=1,918, p < 10^-100 pooled across all judges.
Crucially, prediction set width shows consistent cross-judge agreement (r̄ = 0.32-0.38), demonstrating it captures document-level difficulty rather than judge-specific noise. This means the reliability issues are inherent to the documents being judged, not just artifacts of particular judge models.
Evaluation Criterion Matters More Than Judge Choice
Both diagnostic approaches converge on the same conclusion: the evaluation criterion matters more than which judge model is selected. Reliability varies significantly by criterion, measured by average prediction set size (smaller = more reliable):
- Relevance: Most reliable (avg. set size ≈ 3.0)
- Coherence: Moderately reliable (avg. set size ≈ 3.9)
- Fluency: Unreliable (avg. set size ≈ 4.9)
- Consistency: Unreliable (avg. set size ≈ 4.9)
Implications for Automatic Evaluation Systems
LLM-as-judge frameworks are increasingly used for automatic natural language generation evaluation, yet their per-instance reliability remains poorly understood. This research provides concrete methods for measuring reliability and shows that certain evaluation criteria are fundamentally harder to judge reliably than others.
The researchers, Manan Gupta and Dhruv Kumar, released all code, prompts, and cached results to enable reproducibility. The paper is categorized under cs.AI, cs.CL, and cs.LG on arXiv.
Key Takeaways
- 33-67% of documents exhibit at least one transitivity violation (directed 3-cycle) despite aggregate violation rates appearing low at 0.8%-4.1%
- Prediction set width correlates strongly with ground truth disagreement (rs = +0.576, p < 10^-100) and shows consistent cross-judge agreement
- Evaluation criterion matters more than judge choice, with relevance most reliable and fluency/consistency least reliable
- The reliability issues are inherent to document-level difficulty rather than judge-specific artifacts
- All code, prompts, and cached results are publicly available for reproducibility