Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents

Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents

工具就是我们所需要的一切吗?揭秘大模型智能体中的“工具使用税”

Abstract: Tool-augmented reasoning has become a popular direction for LLM-based agents, and it is widely assumed to improve reasoning and reliability. However, we demonstrate that this consensus does not always hold: in the presence of semantic distractors, tool-augmented reasoning does not necessarily outperform native CoT.

摘要: 工具增强推理已成为基于大模型(LLM)智能体的一个热门研究方向,人们普遍认为它能提升推理能力和可靠性。然而,我们证明了这种共识并不总是成立的:在存在语义干扰项的情况下,工具增强推理并不一定优于原生思维链(CoT)。

To explain this performance gap, we propose a Factorized Intervention Framework that isolates the cost of prompt formatting, the overhead of the tool-calling protocol, and the actual gain from executing tools. Our analysis reveals a critical tradeoff: under semantic noise, the gains from tools often fail to offset the “tool-use tax”, which is the performance degradation introduced by the tool-calling protocol itself.

为了解释这一性能差距,我们提出了一个“因子干预框架”(Factorized Intervention Framework),该框架能够将提示词格式化的成本、工具调用协议的开销以及执行工具所带来的实际收益分离开来。我们的分析揭示了一个关键的权衡:在语义噪声环境下,工具带来的收益往往无法抵消“工具使用税”(tool-use tax)——即由工具调用协议本身所引入的性能下降。

To address this, we introduce G-STEP, a lightweight inference-time gate to mitigate protocol-induced errors. While this yields partial recovery, our findings suggest that more substantial improvements still require strengthening the model’s intrinsic reasoning and tool-interaction capabilities.

为了解决这一问题,我们引入了 G-STEP,这是一种轻量级的推理时门控机制,旨在减轻协议引起的错误。虽然这能带来部分性能恢复,但我们的研究结果表明,若要实现更实质性的改进,仍需加强模型内在的推理能力和工具交互能力。