State-of-the-art language models achieve 96% accuracy on standard probability problems but plummet to just 59% on counterintuitive ones, according to new research published on arXiv. The study, authored by Luca Avena, Gianmarco Bet, and Bernardo Busoni, concludes that current LLMs are not genuine probabilistic reasoners despite their success in advanced mathematical problems.
Accuracy Drops 37 Percentage Points on Counterintuitive Problems
The researchers constructed two datasets to evaluate eight state-of-the-art models: standard probability exercises and counterintuitive exercises designed to trigger heuristic reasoning. Each model was tested with and without Chain-of-Thought prompting on discrete probability problems. The striking gap between 96% and 59% accuracy reveals that LLMs can solve probability problems when they match training data patterns but fail when problems require actual probabilistic reasoning that contradicts common heuristics.
The study found empirical evidence of token bias, with performance dropping by over 20% when canonical formulations were replaced by mathematically equivalent but differently phrased variants. This suggests models rely heavily on pattern matching rather than understanding underlying probabilistic principles.
Models Vulnerable to Misleading Suggestions
Embedding misleading suggestions in prompts reduced model performance by up to 34%, with no model proving immune to this manipulation. The researchers note that LLMs default to heuristic, representativeness-based reasoning in natural contexts, resulting in cognitive biases analogous to human intuition. In naturalistic or context-rich cases, models default to representativeness while neglecting base rates and showing vanishing sensitivity to priors.
Training Data Dependence Explains Systematic Failures
Related research patterns suggest that errors are consistently better explained by training word frequency than by the actual mathematical requirements of problems. Studies structured similarly to the Monty Hall problem provide strong evidence that frontier LLMs like GPT-4o are biased to mimic data most frequently encountered in training, despite this not supporting sound reasoning.
The findings indicate that while LLMs are capable of high-fidelity probabilistic reasoning in controlled settings with explicit, structured cues, they struggle with problems that require reasoning beyond their training distribution patterns.
Key Takeaways
- LLMs achieve 96% accuracy on standard probability problems but only 59% on counterintuitive ones requiring genuine probabilistic reasoning
- Performance drops by over 20% due to token bias when mathematically equivalent problems are phrased differently
- Embedding misleading suggestions in prompts reduces model performance by up to 34% across all tested models
- Models default to heuristic, representativeness-based reasoning that neglects base rates and prior probabilities
- Current LLMs are not yet genuine probabilistic reasoners despite their success on advanced mathematical problems that match training data patterns