ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

ToolSense：用于审计大语言模型参数化工具知识的诊断框架

Abstract: Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks.

摘要： 作为代理部署在大型工具目录上的大语言模型（LLM）面临着关键的工具检索瓶颈。由于基于嵌入（embedding）的检索方法依赖于可能无法充分捕捉专业工具语义的紧凑编码器，参数化工具检索通过将每个工具编码为附加到 LLM 词汇表中的虚拟标记来解决这一问题。该方法通过两个阶段（记忆化和检索 SFT）进行微调，使 LLM 能够充当检索器，并在标准的 ToolBench 检索基准测试中取得了优异的性能。

Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce ToolSense, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark.

然而，这些基准测试使用的是冗长且完全指定的查询，其评估过程应用了将输出限制为有效标记路径的约束解码，这并不能揭示模型是否真正理解其工具。我们引入了 ToolSense，这是一个由 LLM 驱动的开源诊断框架。它以任何工具目录作为输入，并自动生成三个基准测试：包含三个模糊度层级的真实检索基准（RRB）、多项选择题（MCQ）探测基准以及问答（QA）探测基准。

Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at this [https URL].

将 ToolSense 应用于 ToolBench（约 4.7 万个工具）并评估五种参数化模型训练配置后，我们发现了“知识-检索脱节”现象：在 RRB 查询中，几种配置的性能较完全指定的 ToolBench 基准测试下降了约 50-64 个百分点，甚至低于嵌入模型基准线。此外，尽管检索性能强劲，但一些模型在事实探测中的得分接近随机水平，这进一步表明了知识与检索之间的脱节。我们已在 [https URL] 开源了 ToolSense 框架和 ToolBench 诊断基准。