Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents

工具就是我们所需要的一切吗？揭秘大模型智能体中的“工具使用税”

Abstract: Tool-augmented reasoning has become a popular direction for LLM-based agents, and it is widely assumed to improve reasoning and reliability. However, we demonstrate that this consensus does not always hold: in the presence of semantic distractors, tool-augmented reasoning does not necessarily outperform native CoT.

摘要： 工具增强推理已成为基于大模型（LLM）智能体的一个热门研究方向，人们普遍认为它能提升推理能力和可靠性。然而，我们证明了这种共识并不总是成立的：在存在语义干扰项的情况下，工具增强推理并不一定优于原生思维链（CoT）。

To explain this performance gap, we propose a Factorized Intervention Framework that isolates the cost of prompt formatting, the overhead of the tool-calling protocol, and the actual gain from executing tools. Our analysis reveals a critical tradeoff: under semantic noise, the gains from tools often fail to offset the “tool-use tax”, which is the performance degradation introduced by the tool-calling protocol itself.

为了解释这一性能差距，我们提出了一个“因子干预框架”（Factorized Intervention Framework），该框架能够将提示词格式化的成本、工具调用协议的开销以及执行工具所带来的实际收益分离开来。我们的分析揭示了一个关键的权衡：在语义噪声环境下，工具带来的收益往往无法抵消“工具使用税”（tool-use tax）——即由工具调用协议本身所引入的性能下降。

To address this, we introduce G-STEP, a lightweight inference-time gate to mitigate protocol-induced errors. While this yields partial recovery, our findings suggest that more substantial improvements still require strengthening the model’s intrinsic reasoning and tool-interaction capabilities.

为了解决这一问题，我们引入了 G-STEP，这是一种轻量级的推理时门控机制，旨在减轻协议引起的错误。虽然这能带来部分性能恢复，但我们的研究结果表明，若要实现更实质性的改进，仍需加强模型内在的推理能力和工具交互能力。