LLMs Fail Probabilistic Reasoning Despite Math Prowess, New Study Shows

State-of-the-art language models achieve 96% accuracy on standard probability problems but plummet to just 59% on counterintuitive ones, according to new research published on arXiv. The study, authored by Luca Avena, Gianmarco Bet, and Bernardo Busoni, concludes that current LLMs are not genuine probabilistic reasoners despite their success in advanced mathematical problems.

Accuracy Drops 37 Percentage Points on Counterintuitive Problems

The researchers constructed two datasets to evaluate eight state-of-the-art models: standard probability exercises and counterintuitive exercises designed to trigger heuristic reasoning. Each model was tested with and without Chain-of-Thought prompting on discrete probability problems. The striking gap between 96% and 59% accuracy reveals that LLMs can solve probability problems when they match training data patterns but fail when problems require actual probabilistic reasoning that contradicts common heuristics.

The study found empirical evidence of token bias, with performance dropping by over 20% when canonical formulations were replaced by mathematically equivalent but differently phrased variants. This suggests models rely heavily on pattern matching rather than understanding underlying probabilistic principles.

Models Vulnerable to Misleading Suggestions

Embedding misleading suggestions in prompts reduced model performance by up to 34%, with no model proving immune to this manipulation. The researchers note that LLMs default to heuristic, representativeness-based reasoning in natural contexts, resulting in cognitive biases analogous to human intuition. In naturalistic or context-rich cases, models default to representativeness while neglecting base rates and showing vanishing sensitivity to priors.

Training Data Dependence Explains Systematic Failures

Related research patterns suggest that errors are consistently better explained by training word frequency than by the actual mathematical requirements of problems. Studies structured similarly to the Monty Hall problem provide strong evidence that frontier LLMs like GPT-4o are biased to mimic data most frequently encountered in training, despite this not supporting sound reasoning.

The findings indicate that while LLMs are capable of high-fidelity probabilistic reasoning in controlled settings with explicit, structured cues, they struggle with problems that require reasoning beyond their training distribution patterns.

Key Takeaways

LLMs achieve 96% accuracy on standard probability problems but only 59% on counterintuitive ones requiring genuine probabilistic reasoning
Performance drops by over 20% due to token bias when mathematically equivalent problems are phrased differently
Embedding misleading suggestions in prompts reduces model performance by up to 34% across all tested models
Models default to heuristic, representativeness-based reasoning that neglects base rates and prior probabilities
Current LLMs are not yet genuine probabilistic reasoners despite their success on advanced mathematical problems that match training data patterns

Accuracy Drops 37 Percentage Points on Counterintuitive Problems

Models Vulnerable to Misleading Suggestions

Training Data Dependence Explains Systematic Failures

Key Takeaways

LLMs achieve 96% accuracy on standard probability problems but only 59% on counterintuitive ones requiring genuine probabilistic reasoning

Performance drops by over 20% due to token bias when mathematically equivalent problems are phrased differently

Embedding misleading suggestions in prompts reduces model performance by up to 34% across all tested models

Models default to heuristic, representativeness-based reasoning that neglects base rates and prior probabilities

Current LLMs are not yet genuine probabilistic reasoners despite their success on advanced mathematical problems that match training data patterns