Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents

Ecom-RLVE：面向电子商务对话智能体的自适应可验证环境

TL;DR — We extend the RLVE framework from single-turn reasoning puzzles to multi-turn, tool-augmented e-commerce conversations. EcomRLVE-GYM provides 8 verifiable environments — product discovery, substitution, cart building, returns, order tracking, policy QA, bundle planning, and multi-intent journeys — each with procedural problem generation, a 12-axis difficulty curriculum, and algorithmically verifiable rewards. We train a Qwen 3 8B model with DAPO over 300 steps and present early results demonstrating that environment scaling and adaptive difficulty transfer to agentic, real-world task completion. This project originated in the Pytorch OpenEnv Hackathon and is still evolving, follow us for updates 🔥

简而言之——我们将 RLVE 框架从单轮推理谜题扩展到了多轮、工具增强的电子商务对话中。EcomRLVE-GYM 提供了 8 个可验证的环境，包括产品发现、替代品推荐、购物车构建、退货、订单追踪、政策问答、捆绑规划以及多意图任务流。每个环境都具备程序化问题生成、12 轴难度课程体系以及算法可验证的奖励机制。我们使用 DAPO 对 Qwen 3 8B 模型进行了 300 步训练，初步结果表明，环境扩展和自适应难度能够有效迁移到智能体在现实世界中的任务完成能力上。本项目源于 Pytorch OpenEnv 黑客松，目前仍在持续演进，欢迎关注以获取更新 🔥

Why RL for shopping agents?

为什么购物智能体需要强化学习？

Large language models can hold fluent conversations, yet deploying them as shopping assistants reveals a persistent gap: fluency ≠ task completion. A customer who asks “find me a USB-C charger under $25 that ships in two days” needs an agent that invokes the right catalog search, filters on three hard constraints, avoids hallucinating product IDs it never retrieved, and handles follow-ups when the top result goes out of stock. Supervised fine-tuning can teach surface-level tool use from demonstrations, but it cannot scale to the combinatorial space of constraint configurations, partial-information dialogues, and multi-step transactional workflows that real e-commerce demands. Reinforcement learning with verifiable rewards (RLVR) offers an alternative: the agent optimises for outcomes — did the products satisfy the constraints? Was the cart correct? Was the return initiated for the right order line? The challenge is constructing reward functions that are both verifiable (no LLM-as-a-judge subjectivity) and adaptive (difficulty that grows with the policy’s capability).

大型语言模型能够进行流畅的对话，但将其部署为购物助手时会暴露出一个持续存在的问题：流畅度并不等于任务完成度。当顾客要求“帮我找一个 25 美元以下、两天内发货的 USB-C 充电器”时，智能体需要能够调用正确的目录搜索，过滤三个硬性约束，避免虚构从未检索到的产品 ID，并在首选结果缺货时处理后续跟进。监督微调可以通过演示教授表层的工具使用，但无法扩展到真实电子商务所要求的约束配置、部分信息对话以及多步交易工作流的组合空间中。带有可验证奖励的强化学习（RLVR）提供了一种替代方案：智能体针对结果进行优化——产品是否满足约束？购物车是否正确？退货是否针对正确的订单行发起？其挑战在于构建既可验证（无“大模型作为裁判”的主观性）又具备自适应性（难度随策略能力提升）的奖励函数。

From RLVE-Gym to EcomRLVE-GYM

从 RLVE-Gym 到 EcomRLVE-GYM

RLVE-Gym provides 400 environments for sorting, multiplication, Sudoku, and other algorithmic-reasoning tasks; however, those are all single-turn, text-in / text-out puzzles — extending to agentic domains was left as future work. EcomRLVE-GYM fills that gap: we stay in the verifiable regime (e-commerce outcomes can be checked algorithmically) while extending to multi-turn, tool-augmented, agentic conversations — environments where the agent must act (call tools, modify world state) rather than merely reason (produce a text answer) and compensates for the deficiency of the search system. EcomRLVE-GYM transforms customer-service outcomes structurally verifiable: Every signal above can be evaluated by a program with access to the hidden ground-truth goal. No human annotation or LLM-as-a-judge is needed.

RLVE-Gym 为排序、乘法、数独和其他算法推理任务提供了 400 个环境；然而，这些都是单轮、文本输入/输出的谜题——扩展到智能体领域被留作未来工作。EcomRLVE-GYM 填补了这一空白：我们保持在可验证的范畴内（电子商务结果可以通过算法检查），同时扩展到多轮、工具增强的智能体对话——在这些环境中，智能体必须采取行动（调用工具、修改世界状态）而不仅仅是推理（生成文本答案），并弥补搜索系统的不足。EcomRLVE-GYM 将客户服务结果转化为结构上可验证的：上述每一个信号都可以由能够访问隐藏真实目标（ground-truth goal）的程序进行评估。无需人工标注或“大模型作为裁判”。

What a training episode looks like

训练片段是什么样的

Before we explain the framework, here is what a single EcomRLVE episode looks like at difficulty d = 4. The environment generates a hidden goal, a simulated user opens the chat, and the agent must use tools to satisfy the request. Every action is verified algorithmically — no LLM judge required. The reward is fully computed by code: F1 over (product, variant, qty) tuples, an efficiency bonus for finishing in fewer turns, and a hallucination check that every recommended product ID was actually retrieved. If the agent had picked the Lightning variant instead of USB-C, the simulated user would have corrected it mid-dialogue — and the F1 would have dropped.

在解释框架之前，先看看难度 d = 4 时的一个 EcomRLVE 片段。环境生成一个隐藏目标，模拟用户开启对话，智能体必须使用工具来满足请求。每一个动作都通过算法验证——无需大模型裁判。奖励完全由代码计算：基于（产品、变体、数量）元组的 F1 分数、更少轮次完成任务的效率奖励，以及检查每个推荐的产品 ID 是否确实被检索到的幻觉检查。如果智能体选择了 Lightning 变体而不是 USB-C，模拟用户会在对话中进行纠正——此时 F1 分数会下降。

The eight environments

八大环境

Each environment covers a distinct real-world shopping scenario. The agent must complete the task using tools (catalog search, cart operations, order lookups, policy queries) and is scored by a program — not a human or another LLM.

每个环境涵盖一个独特的现实世界购物场景。智能体必须使用工具（目录搜索、购物车操作、订单查询、政策查询）完成任务，并由程序而非人类或其他大模型进行评分。

Environment	What the agent must do
Product Discovery	Find products that satisfy all the user’s constraints
Substitution	An item is out of stock — find a similar, compatible alternative
Cart Building	Add the exact products, variants, and quantities the user asked for
Return + Replacement	Identify the right order line, initiate a return, suggest a replacement
Order Tracking	Resolve which order the user means and report its current status
Policy QA	Answer a deterministic question about store policy (return window, shipping rules, etc.)
Bundle Planning	Recommend a complete shopping list for a project within a budget
Multi-Intent Journey	Handle a conversation that chains 2–5 of the above tasks in sequence

环境	智能体必须做的事
产品发现	寻找满足用户所有约束的产品
替代品推荐	商品缺货时，寻找相似且兼容的替代品
购物车构建	添加用户要求的确切产品、变体和数量
退货与更换	识别正确的订单行，发起退货，并建议更换
订单追踪	解析用户指的是哪个订单并报告其当前状态
政策问答	回答关于商店政策（退货窗口、运输规则等）的确定性问题
捆绑规划	在预算内为项目推荐完整的购物清单
多意图任务流	处理串联上述 2-5 个任务的对话

Every environment uses the same three-part reward signal: 每个环境都使用相同的三部分奖励信号：

Task reward — did the agent actually complete the goal? (e.g., were the right products recommended, was the cart correct, was the right order tracked?)
Efficiency reward — did the agent complete it without wasting turns? Turns the user caused (asking a follow-up, confirming an action) don’t count against the agent — only turns caused by agent mistakes do.
Hallucination penalty — did the agent only recommend products it actually retrieved during the session? Recommending product IDs that were never looked up is penalised, so the agent cannot invent results from memory. Invalid outputs (malformed JSON, illegal tool calls) trigger an immediate failure score, creating a strong incentive for well-formed responses from step one.
任务奖励 — 智能体是否真正完成了目标？（例如：推荐的产品是否正确，购物车是否正确，追踪的订单是否正确？）
效率奖励 — 智能体是否在没有浪费轮次的情况下完成任务？用户引起的轮次（询问后续、确认动作）不计入智能体的扣分项——只有智能体犯错引起的轮次才会被扣分。
幻觉惩罚 — 智能体是否只推荐了在会话期间实际检索到的产品？推荐从未查询过的产品 ID 会受到惩罚，因此智能体不能凭记忆编造结果。无效输出（格式错误的 JSON、非法工具调用）会触发立即失败评分，从而从第一步开始就激励智能体生成格式规范的响应。

Adaptive difficulty curriculum

自适应难度课程体系

A single difficulty number d controls 12 independent aspects of a task simultaneously. This is important because e-commerce conversations are hard in many different ways at once — not just along one dimension. Here are four representative difficulty axes:

单一的难度数值 d 同时控制任务的 12 个独立方面。这一点很重要，因为电子商务对话在很多方面同时具有挑战性——而不仅仅是在一个维度上。以下是四个具有代表性的难度轴：

What changes	Easy (d = 0)	Medium (d = 6)	Hard (d = 12)
How many constraints the user has	2	5	8
How often the user omits a constraint	5%	70%	~80%
Fraction of search results that are distractors	0%	12%	24%
Items that go out of stock mid-conversation	0%	30%	50%

变化内容	简单 (d = 0)	中等 (d = 6)	困难 (d = 12)
用户有多少约束条件	2	5	8
用户省略约束条件的频率	5%	70%	~80%
搜索结果中干扰项的比例	0%	12%	24%
对话中途缺货的商品比例	0%	30%	50%

The other eight axes cover… 其余八个轴涵盖了……