vLLM V0 to V1: Correctness Before Corrections in RL

vLLM V0 到 V1：强化学习中的“先求正确，再求修正”

vLLM V0 to V1: Correctness Before Corrections in RL

PipelineRL 使用 vLLM 作为生成 rollout（采样轨迹）的推理引擎。推理引擎对 token 进行采样并返回 token 的对数概率（logprobs）；训练器则利用这些对数概率来计算策略比率（policy ratios）、KL 散度、裁剪率（clip rate）、熵（entropy）和奖励（reward）。任何计算这些对数概率方式上的差异，都可能改变训练动态。这就是我们在 vLLM V0 到 V1 迁移过程中必须消除的“训练-推理不匹配”问题。 PipelineRL uses vLLM as the inference engine for rollout generation. The inference engine samples tokens and returns token logprobs; the trainer uses those logprobs to compute policy ratios, KL, clip rate, entropy, and reward. Any discrepancy in how those logprobs are computed can change the training dynamics. This is the train-inference mismatch we needed to eliminate during the vLLM V0 to V1 migration.

简而言之：在修复了四个问题后，我们的 vLLM V1 与 vLLM V0 参考基准保持了一致。这四个问题分别是：处理后的 rollout 对数概率、V1 特有的运行时默认设置、运行中的权重更新路径，以及用于最终投影的 fp32 lm_head。我们在更改强化学习目标函数之前，先修复了后端行为。参考运行使用的是 vLLM 0.8.5 版本，而 V1 运行使用的是 vLLM 0.18.1 版本。图 1 展示了最终结果。红色曲线是最初的 V1 尝试，绿色曲线是应用下述修复后的最终 V1 运行结果。 TL;DR. vLLM V1 matched our vLLM V0 reference after we fixed four things: processed rollout logprobs, V1-specific runtime defaults, the inflight weight-update path, and the fp32 lm_head used for the final projection. We fixed the backend behavior before changing the RL objective. The reference run used vLLM 0.8.5; the V1 runs used vLLM 0.18.1. Figure 1 shows the final result. The red run is the initial V1 attempt, and the green run is the final V1 run after the fixes described below.

图 1：vLLM V0 参考（蓝色）、初始 vLLM V1 尝试（红色）以及应用修复（包括 fp32 lm_head）后的最终 vLLM V1 运行（绿色）的训练器端指标。最终的 V1 运行在裁剪率、KL 散度、熵和奖励方面均回到了接近 V0 的轨迹。 Figure 1. Trainer-side metrics for the vLLM V0 reference (blue), the initial vLLM V1 attempt (red), and the final vLLM V1 run after our fixes (green), including the fp32 lm_head. The final V1 run returns close to the V0 trajectory across clip rate, KL, entropy, and reward.

迁移目标 (Migration Objective)

vLLM V1 是对 V0 引擎的一次重大重写。因此，我们的迁移目标非常明确且有限：验证 V1 返回的 rollout 对数概率是否符合训练器的预期；针对相同的负载重新运行 V0 参考基准；仅在恢复后端一致性后，才评估目标函数层面的变更。 vLLM V1 is a substantial rewrite of the V0 engine. Our migration target was therefore deliberately narrow: verify that V1 returned rollout logprobs in the form the trainer expected; rerun the same workload against the V0 reference; evaluate objective-level changes only after backend parity was restored.

最初的明显症状出现在以下指标中：clamp_log_ratio_new_old_indicator、KL 散度、熵和奖励。这些指标来自 GSPO 训练运行，这是本次实验所使用的目标函数。同样类型的不匹配也可能出现在 PPO、GRPO 或任何将 rollout 端对数概率视为优化目标一部分的在线强化学习系统中。 The first visible symptoms appeared in: clamp_log_ratio_new_old_indicator, KL, entropy, and reward. Those metrics came from a GSPO training run, the objective used for this experiment. The same class of mismatch can surface in PPO, GRPO, or any online RL system that treats rollout-side logprobs as part of the optimization target.

最初的 V1 运行清楚地显示了这个问题。训练器端的对数概率和奖励在训练初期就偏离了 V0 参考基准。 The initial V1 run showed the problem clearly. The trainer-side logprobs and reward moved away from the V0 reference early in training.

图 2：训练器在更新过程中计算的当前策略对数概率（左）和奖励（右）。初始 vLLM V1 运行（红色）与 vLLM V0 参考（蓝色）发生了分离。 Figure 2. Current-policy logprobs computed by the trainer during updates (left) and reward (right). The initial vLLM V1 run (red) separates from the vLLM V0 reference (blue).

同样的模式也出现在训练器指标中。在初步比较中，裁剪率是最容易读取的信号。 The same pattern appears in the trainer metrics. Clip rate is the easiest signal to read in the initial comparison.

图 3：vLLM V0 参考（蓝色）和初始 vLLM V1 尝试（红色）的训练器端指标。裁剪率追踪了 rollout/训练器策略之间的差距；熵和奖励则显示了该差距如何传播到训练过程中。 Figure 3. Trainer-side metrics for the vLLM V0 reference (blue) and the initial vLLM V1 attempt (red). Clip rate tracks the rollout/trainer policy gap; entropy and reward show how that gap propagates into training.

故障模式 (Failure Modes)

我们将可能的原因分为三个层面：

语义不匹配：后端返回的对数概率含义与训练器预期的不同。
推理路径不匹配：后端在缓存、调度或请求处理上使用了不同的运行时默认设置，导致相同的提示词遵循了不同的执行路径。
目标函数不匹配：强化学习目标函数需要针对剩余的陈旧数据或后端不匹配进行修正。 We separated the possible causes into three layers: Semantic mismatch (the backend returns logprobs with different meaning relative to what the trainer expects), Inference-path mismatch (the backend uses different runtime defaults for caching, scheduling, or request handling, so the same prompts follow a different execution path), and Objective mismatch (the RL objective needs correction for the amount of staleness or backend mismatch that remains).

我们最初过早地怀疑了第三类原因。有效的诊断来自于将前两类视为后端行为问题，并优先排除它们。 We initially suspected the third category too early. The useful diagnosis came from treating the first two as backend behavior problems and ruling them out first.

V1 后端修复 (V1 Backend Fixes)

对数概率语义 (Logprob Semantics) 第一个问题是语义上的。vLLM V1 默认返回原始模型输出的对数概率，即在进行温度缩放、惩罚项和 top-k/top-p 过滤等 Logits 后处理之前的数值。而 PipelineRL 期望的是采样器所使用的、经过处理后的分布的对数概率。所需的设置是：logprobs-mode=processed_logprobs。 The first issue was semantic. vLLM V1 returns logprobs from the raw model outputs by default, before logits post-processing such as temperature scaling, penalties, and top-k/top-p filtering. PipelineRL expected logprobs from the processed distribution used by the sampler. The required setting was: logprobs-mode=processed_logprobs.

这消除了 rollout 对数概率中明显的均值偏移。训练曲线相对于已知良好的参考基准仍然显示出差距，因此下一个问题必然出在推理路径上。策略比率图直接显示了这一点。一旦 V1 开启了 processed_logprobs，平均策略比率在所有三次运行中都保持在非常接近 1.0 的位置。这确立了均值偏差的修复。剩余的不匹配体现在裁剪率、KL 散度、熵和下游训练行为中。 This removed the obvious mean offset in rollout logprobs. The training curves still showed a gap relative to the known-good reference, so the next issue had to be in the inference path. The policy-ratio plot shows this directly. Once processed_logprobs is on for V1, the mean policy ratio stays centered extremely close to 1.0 across all three runs. That establishes the mean-bias fix. The remaining mismatch shows up in clip rate, KL, entropy, and downstream training behavior.

图 4：vLLM V0 参考（蓝色）、初始 vLLM V1 运行（红色）和修正后的 vLLM V1 运行（绿色）的 rollout/训练器策略比率相对于 1.0 的每步偏差（缩放 10,000 倍）。 Figure 4. Per-step deviation of the rollout/trainer policy ratio from 1.0, scaled by 10,000, for the vLLM V0 reference (blue), the initial vLLM V1 run (red), and the corrected vLLM V1 run (green).

运行时默认设置 (Runtime Defaults) 早期的 V1 运行将引擎版本与 V1 的运行时默认设置混合在一起：

前缀缓存（Prefix caching）：在早期运行中未设置，因此应用了 vLLM 0.18.1 的默认值。
异步调度（Async scheduling）：在早期运行中未设置，因此应用了 vLLM 0.18.1 的默认值。
一个临时的 disable-cascade-attn 覆盖：通过启动时的 kwarg 透传设置，且不在已提交配置的对等配方（parity recipe）中。 The early V1 run mixed the engine version with V1 runtime defaults: prefix caching (left unset in the early run so the vLLM 0.18.1 default applied), async scheduling (left unset in the early run so the vLLM 0.18.1 default applied), and an ad-hoc disable-cascade-attn override that was set through launch-time kwarg passthrough and sits outside the parity recipe in committed config.

为了进行对等运行（parity run），我们明确了这些选择：

vllm_config:
  use_v1: true
vllm_kwargs:
  logprobs-mode: processed_logprobs
  enable-prefix-caching: false
  async-scheduling: false

For the parity run, we made these choices explicit: [config block].

前缀缓存值得单独说明。对于固定的模型状态，它通常是一种保持正确性的推理优化。但在这种在线强化学习设置中，相对于 V0 参考路径，它是 V1 特有的缓存生命周期和重用差异。Actor 还在处理重复的前缀、并发请求、异步调度和运行中的权重更新。当缓存策略忽略权重更新边界时，前缀缓存命中可能会重用在权重更新之前计算的状态。禁用前缀缓存从对等比较中移除了一个 V1 特有的自由度。 Prefix caching deserves a separate note. It is normally a correctness-preserving inference optimization for a fixed model state. In this online RL setup, it was a V1-only difference in cache lifetime and reuse relative to the V0 reference path. The actor was also handling repeated prefixes, concurrent requests, async scheduling, and inflight weight updates. A prefix-cache hit can reuse state computed before a weight update when the cache policy ignores the weight-update boundary. Disabling prefix caching removed one V1-only degree of freedom from the parity comparison.

运行中的权重更新 (Inflight Weight Updates) 权重同步也必须与在线强化学习的更新模型相匹配。一种选择是让 V1 比 V0 更严格，即在每次更新时清空请求并清除缓存。但这将回答一个不同的问题。我们首先需要验证 V1 是否能匹配现有的 V0 行为。V0 实际上所做的是：在引擎边界处阻塞执行 -> 加载新权重 -> 在没有显式缓存状态失效的情况下恢复。最接近的 V1 模拟是：await engine.pause_generation。 Weight synchronization also had to match the online-RL update model. One option was to make V1 stricter than V0 by draining requests and clearing caches at every update. That would answer a separate question. We first needed to verify that V1 could match the existing V0 behavior. What V0 effectively did was closer to: block execution at an engine boundary, load the new weights, resume without an explicit cached-state invalidation. The nearest V1 analogue was: await engine.pause_generation.