The Scaling Laws of Skills in LLM Agent Systems

The Scaling Laws of Skills in LLM Agent Systems

LLM 智能体系统中技能的缩放定律

Abstract: As agent systems scale, skills accumulate into large reusable libraries, yet their scaling laws remain poorly understood. Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, we identify two coupled laws.

摘要: 随着智能体系统的扩展,技能会积累成庞大的可重用库,但其缩放定律目前仍知之甚少。通过对 15 个前沿大语言模型(LLM)、1,141 个真实世界技能以及超过 300 万次路由或执行决策的研究,我们确定了两条相互关联的定律。

Routing law: single-step routing accuracy decays logarithmically with library size ($R^2 > 0.97$ for all models), with errors progressing from local skill competition to cross-family drift and capture by overly general “black-hole skills”.

路由定律: 单步路由准确率随库规模的增加呈对数衰减(所有模型的 $R^2 > 0.97$),错误类型从局部的技能竞争演变为跨类别的漂移,并最终被过于通用的“黑洞技能”所捕获。

Execution law: before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about $4\times$. A single parameter, the routing logarithmic decay slope $b$, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability.

执行定律: 在状态实现之前,联合路由近似呈乘法关系,而正确的执行可以将困难的下游决策准确率提高约 4 倍。一个单一参数——路由对数衰减斜率 $b$——将这两条定律耦合在一起:路由侧的拟合结果可以预测跨模型的执行侧补救效果,这表明相同的库属性同时控制着执行前的崩溃和下游的可恢复性。

The laws are actionable: law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and transfers directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark. These results show that agent performance depends not only on model capability, but also on the structure, granularity, and exposure policy of the skill library.

这些定律具有可操作性: 基于定律的优化将留出集(held-out)路由准确率从 71.3% 提高到 91.7%,将劫持率从 22.4% 降低至 4.1%,并能定向迁移至下游的 ClawBench 和 ClawMark 执行环境,使 ClawBench 的平均通过率从 49.3% 提升至 61.6%,ClawMark 的平均通过率从 28.4% 提升至 34.5%。这些结果表明,智能体的性能不仅取决于模型能力,还取决于技能库的结构、粒度和暴露策略。