MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

MindGames Arena 泛化赛道：In2AI 延迟单步奖励归因解决方案

Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents.

训练用于多智能体战略交互的语言模型智能体存在一个核心难题：任何动作的质量都可能取决于从未发生的未来事件、违反游戏规则的走法，或是其他玩家的决策。标准的强化学习假设奖励可以在每一步进行分配，但在结果跨越时间和智能体相互交织的环境中，这一假设往往失效。

We introduce delayed per-step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task-specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM’s continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach enables stable, sample-efficient RL training in multi-agent environments.

我们引入了带有资格门控（eligibility gating）的延迟单步奖励归因机制。这是一种包含剧集生命周期管理和后处理的流水线，仅在剧集结束时计算奖励，根据任务特定的语义将其反向传播至原始步骤，并从训练中剔除缺乏有效依赖信息的步骤。结合通过 vLLM 连续批处理实现的异步展开生成、基于课程的对手采样以及多级分层批次构建，该方法实现了多智能体环境中稳定且样本高效的强化学习训练。

We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8-billion-parameter open-source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play and took first place in both the Open (unrestricted) and Efficient (<=8B参数) tracks.

我们在 NeurIPS 2025 的 MindGames Arena 基准测试中进行了评估。结果显示，一个使用该方法训练的 80 亿参数开源模型，在对战中匹配甚至超越了包括 GPT-5 在内的大规模专有系统，并同时获得了开放（无限制）赛道和高效（<=8B 参数）赛道的第一名。