A new study from researchers at multiple institutions has uncovered a fundamental flaw in how large language models decide whether to call external tools. Published May 1, 2026 on arXiv, the paper "To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling" reveals that models' perceived need for tool use is frequently misaligned with the actual utility those tools provide.
Models' Internal Confidence Does Not Match Reality
The research team, led by Qinyuan Wu and colleagues, developed a framework analyzing three key factors in tool-use decisions: necessity (whether a tool call is genuinely needed), utility (the actual value gained from the tool response), and affordability (the costs associated with making the call). By comparing what models think they need versus what they actually need, the researchers found consistent misalignment across three tasks and six different models.
The problem manifests in two ways:
- LLMs frequently initiate tool calls when they would perform better without them
- Models sometimes skip tool calls that would significantly improve their output
- Internal confidence signals about tool necessity fail to predict actual benefit
Framework Introduces Two Complementary Evaluation Perspectives
The researchers introduced two lenses for evaluating tool-use decisions. The normative perspective infers true need and utility from optimal allocation of tool calls—essentially determining what should happen in ideal circumstances. The descriptive perspective examines the model's self-perceived need and utility based on observed behaviors—what actually happens when the model makes decisions.
This dual approach revealed the gap between perception and reality in LLM tool-calling behavior, enabling the researchers to quantify exactly where models go wrong.
Lightweight Controllers Improve Decision Quality
To address the misalignment, the team trained lightweight estimators that analyze models' hidden states—the internal representations formed during processing. These estimators predict genuine tool necessity by examining what's happening inside the model, rather than relying on the model's own expressed confidence. The estimators enable simple controllers that override poor self-perceived decisions, resulting in improved task performance across multiple benchmarks.
Implications for AI Agents and Cost Management
As AI agents gain access to dozens or hundreds of external tools including APIs, databases, and specialized services, intelligent tool-use decisions become critical for practical deployment. Poor tool-calling decisions impact multiple dimensions:
- Cost management: API calls consume budget with each invocation
- Latency: Tool calls add delays to response times
- Accuracy: Unnecessary or poorly-timed tool responses can degrade output quality
- User experience: Redundant tool calls waste time and resources
The finding that models are "often misaligned" in their tool-use decisions suggests that simple prompting strategies telling models to "decide when to use tools" are insufficient for reliable agentic systems.
Key Takeaways
- LLMs systematically misjudge when tool calls are necessary, with perceived need frequently misaligned with actual utility
- Researchers developed a framework analyzing necessity, utility, and affordability to evaluate tool-calling decisions from normative and descriptive perspectives
- Lightweight estimators trained on models' hidden states can predict genuine tool necessity better than the models' own confidence signals
- Simple controllers using these estimators improved decision quality and task performance across three tasks and six models
- The misalignment has significant implications for cost, latency, accuracy, and user experience in deployed AI agent systems