A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

一次重写足矣：生产环境技能描述优化的实证经验

Abstract: Enterprise AI agents route user queries to specialized skills by matching queries against natural language skill descriptions. When two skills share overlapping descriptions, the routing LLM misroutes queries, a failure we term skill collision. As agents scale to dozens of skills, manually tuning descriptions to maintain routing accuracy becomes a significant engineering bottleneck.

摘要： 企业级 AI 智能体通过将用户查询与自然语言技能描述进行匹配，从而将查询路由至特定的技能。当两个技能的描述存在重叠时，路由大模型（LLM）会发生查询误导，我们将这种故障称为“技能冲突”。随着智能体扩展到数十种技能，手动调整描述以维持路由准确性已成为一个重大的工程瓶颈。

We deploy an automated description optimization pipeline on a production enterprise group chat agent (9 skills, 372 regression cases). The pipeline produces descriptions averaging 79.2% F1, matching manually tuned descriptions at 79.4% F1 (average per-skill difference -0.20%, within the 0.78% multi-seed noise floor), while reducing per-skill engineering effort from 120 minutes to 3.8 minutes (32 times speedup).

我们在一个生产环境的企业群聊智能体（包含 9 种技能，372 个回归测试用例）上部署了一个自动化的描述优化流水线。该流水线生成的描述平均 F1 分数为 79.2%，与人工调整后的 79.4% F1 分数相当（每个技能的平均差异为 -0.20%，处于 0.78% 的多随机种子噪声范围内），同时将每个技能的工程耗时从 120 分钟缩短至 3.8 分钟（提速 32 倍）。

We then examine which pipeline components actually drive this match. Systematic ablation on both the production system and ToolBench (16k tools) reveals that a single LLM rewrite using any available false-positive and false-negative cases captures most of the available improvement. Other design choices we tested (iteration budget, feedback signal composition, dual editing of confused pairs, and training set size) each affect final F1 by less than 0.5%.

随后，我们研究了哪些流水线组件真正推动了这一匹配效果。通过在生产系统和 ToolBench（16k 个工具）上进行的系统性消融实验表明，利用任何可用的假阳性和假阴性案例进行一次大模型重写，即可获得大部分的性能提升。我们测试的其他设计选择（迭代预算、反馈信号构成、混淆对的双重编辑以及训练集大小）对最终 F1 分数的影响均小于 0.5%。

Description optimization addresses skill collisions caused by overlapping descriptions but cannot resolve cases where two skills intended scopes genuinely overlap. We identify a diagnostic (a large train-validation F1 gap) that flags the latter cases for architectural rather than text-level intervention.

描述优化可以解决由描述重叠引起的技能冲突，但无法解决两个技能的预期范围本身就存在重叠的情况。我们确定了一种诊断方法（即训练集与验证集之间存在较大的 F1 分数差距），该方法可以标记出后者，从而提示需要进行架构层面而非文本层面的干预。