Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents

在线技能与记忆模块真的值得消耗 Token 吗？一项针对 Web Agent 的预算受限研究

Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor’s inference cost. 在线 Web Agent 通常会通过记忆、工作流或技能模块来增强基础模型（Actor）。这些模块虽然能提升性能，但也会消耗推理阶段的 Token，而这种成本在报告模型推理成本时往往被忽略。

We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. 我们研究了在线增强技术（即在每个任务中都产生额外开销的情况），并在固定的总推理预算下重新评估了其收益。我们将 AWM、ASI 和 ReasoningBank 与一个 Token 数量匹配的“原生基准模型”（Vanilla Baseline）进行了对比，后者将相同的预算用于增加模型自身的推理步骤。

Across three WebArena domains and three models, Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B, the vanilla baseline matches or surpasses all three augmentation methods in aggregate success rate while often using fewer total tokens. We observe a similar trend on WorkArena-L1 with Qwen 3.6-27B, indicating that the effect extends to enterprise knowledge-work tasks. 在 WebArena 的三个领域以及 Gemini 3 Flash、GPT-5.4-mini 和 Qwen 3.6-27B 三个模型上的测试结果显示，原生基准模型在总体成功率上持平或超过了上述三种增强方法，且通常消耗的 Token 总量更少。我们在使用 Qwen 3.6-27B 的 WorkArena-L1 测试中也观察到了类似的趋势，这表明该结论同样适用于企业级知识工作任务。

Our results suggest that skills and workflow memory can be useful in specific domains, but their apparent gains often vanish against a budget-matched actor. We further show that run-to-run variance materially affects outcomes and should be reported as a core evaluation criterion for online web agents. 我们的研究结果表明，技能和工作流记忆在特定领域可能有用，但当与预算匹配的原生模型对比时，它们带来的明显增益往往会消失。此外，我们还指出，运行过程中的方差（run-to-run variance）会对结果产生重大影响，应将其作为在线 Web Agent 的核心评估标准进行报告。