Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows

超越下一个词预测：针对 Atlassian 工作流的工具使用智能体 RLVR 概念验证

Abstract: Large language models are trained to predict the next token, not to act inside a specific API. In niche enterprise SaaS workflows — where success means hitting the right endpoint with the right nested arguments in the right order — this objective mismatch shows up as silent failures: dropped required fields, hallucinated tools, or early stops after a single read.

摘要： 大型语言模型被训练用于预测下一个词，而非在特定的 API 中执行操作。在垂直领域的企业 SaaS 工作流中——成功意味着以正确的顺序、使用正确的嵌套参数调用正确的端点——这种目标不匹配会导致静默失败：例如遗漏必要字段、产生幻觉工具，或在单次读取后过早停止。

We ask whether Reinforcement Learning with Verifiable Rewards (RLVR), applied directly in the target environment, closes the gap. As a proof of concept we build a suite of five synthetic environments emulating the Jira REST v3 and Confluence v2 APIs at schema fidelity; rewards are computed entirely from the tool-call trace, with no live API, no learned judge, and no human label in the loop.

我们探讨了直接在目标环境中应用“带可验证奖励的强化学习”（RLVR）是否能弥补这一差距。作为概念验证，我们构建了一套包含五个合成环境的套件，以模式保真度模拟了 Jira REST v3 和 Confluence v2 API；奖励完全根据工具调用轨迹计算，无需实时 API、无需学习型判别器，也无需人工标注参与。

Scoring prompted Qwen3-1.7B and Qwen3.5-4B on the same checkers that drive GRPO training, we find that on the four scenarios whose rewards are non-degenerate the RL-trained policy lifts average reward from a 4B-baseline range of 0.35—0.92 to 0.95—1.00, with the largest single gain on Confluence page creation ($0.35 \rightarrow 1.00$).

通过在驱动 GRPO 训练的相同检查器上对提示后的 Qwen3-1.7B 和 Qwen3.5-4B 进行评分，我们发现，在奖励非退化的四个场景中，经 RL 训练的策略将平均奖励从 4B 基准的 0.35—0.92 提升至 0.95—1.00，其中在 Confluence 页面创建任务中获得了最大的单项增益（从 0.35 提升至 1.00）。

We position this as a preliminary step toward outcome-optimised small models for niche enterprise APIs, and foreground two limitations a workshop reader should weigh: hand-crafting verifiable rewards does not scale beyond the handful of endpoints reported here, and one of our five scenarios (ticket-transition) has a saturating reward shape that the prompted 4B already maxes out.

我们将此定位为迈向针对垂直企业 API 的“结果优化型小模型”的初步步骤，并强调了研讨会读者应权衡的两个局限性：手工构建可验证奖励的方法无法扩展到本文报告之外的更多端点；此外，五个场景中的一个（工单流转）具有饱和的奖励形态，提示后的 4B 模型已能达到该场景的最高分。