Research Reveals LLMs Systematically Misjudge When to Use Tools

A new study from researchers at multiple institutions has uncovered a fundamental flaw in how large language models decide whether to call external tools. Published May 1, 2026 on arXiv, the paper "To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling" reveals that models' perceived need for tool use is frequently misaligned with the actual utility those tools provide.

Models' Internal Confidence Does Not Match Reality

The research team, led by Qinyuan Wu and colleagues, developed a framework analyzing three key factors in tool-use decisions: necessity (whether a tool call is genuinely needed), utility (the actual value gained from the tool response), and affordability (the costs associated with making the call). By comparing what models think they need versus what they actually need, the researchers found consistent misalignment across three tasks and six different models.

The problem manifests in two ways:

LLMs frequently initiate tool calls when they would perform better without them
Models sometimes skip tool calls that would significantly improve their output
Internal confidence signals about tool necessity fail to predict actual benefit

Framework Introduces Two Complementary Evaluation Perspectives

The researchers introduced two lenses for evaluating tool-use decisions. The normative perspective infers true need and utility from optimal allocation of tool calls—essentially determining what should happen in ideal circumstances. The descriptive perspective examines the model's self-perceived need and utility based on observed behaviors—what actually happens when the model makes decisions.

This dual approach revealed the gap between perception and reality in LLM tool-calling behavior, enabling the researchers to quantify exactly where models go wrong.

Lightweight Controllers Improve Decision Quality

To address the misalignment, the team trained lightweight estimators that analyze models' hidden states—the internal representations formed during processing. These estimators predict genuine tool necessity by examining what's happening inside the model, rather than relying on the model's own expressed confidence. The estimators enable simple controllers that override poor self-perceived decisions, resulting in improved task performance across multiple benchmarks.

Implications for AI Agents and Cost Management

As AI agents gain access to dozens or hundreds of external tools including APIs, databases, and specialized services, intelligent tool-use decisions become critical for practical deployment. Poor tool-calling decisions impact multiple dimensions:

Cost management: API calls consume budget with each invocation
Latency: Tool calls add delays to response times
Accuracy: Unnecessary or poorly-timed tool responses can degrade output quality
User experience: Redundant tool calls waste time and resources

The finding that models are "often misaligned" in their tool-use decisions suggests that simple prompting strategies telling models to "decide when to use tools" are insufficient for reliable agentic systems.

Key Takeaways

LLMs systematically misjudge when tool calls are necessary, with perceived need frequently misaligned with actual utility
Researchers developed a framework analyzing necessity, utility, and affordability to evaluate tool-calling decisions from normative and descriptive perspectives
Lightweight estimators trained on models' hidden states can predict genuine tool necessity better than the models' own confidence signals
Simple controllers using these estimators improved decision quality and task performance across three tasks and six models
The misalignment has significant implications for cost, latency, accuracy, and user experience in deployed AI agent systems

Models' Internal Confidence Does Not Match Reality

The problem manifests in two ways:

LLMs frequently initiate tool calls when they would perform better without them

Models sometimes skip tool calls that would significantly improve their output

Internal confidence signals about tool necessity fail to predict actual benefit

Framework Introduces Two Complementary Evaluation Perspectives

This dual approach revealed the gap between perception and reality in LLM tool-calling behavior, enabling the researchers to quantify exactly where models go wrong.

Lightweight Controllers Improve Decision Quality

Implications for AI Agents and Cost Management

Cost management: API calls consume budget with each invocation

Latency: Tool calls add delays to response times

Accuracy: Unnecessary or poorly-timed tool responses can degrade output quality

User experience: Redundant tool calls waste time and resources

Key Takeaways

LLMs systematically misjudge when tool calls are necessary, with perceived need frequently misaligned with actual utility

Researchers developed a framework analyzing necessity, utility, and affordability to evaluate tool-calling decisions from normative and descriptive perspectives

Lightweight estimators trained on models' hidden states can predict genuine tool necessity better than the models' own confidence signals

Simple controllers using these estimators improved decision quality and task performance across three tasks and six models

The misalignment has significant implications for cost, latency, accuracy, and user experience in deployed AI agent systems