To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

干预还是不干预:通过概率模型融合引导推理时对齐

Abstract: The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes (i.e., offers guidances) only during output generation.

摘要: 大语言模型(LLM)的广泛部署使得模型对齐变得至关重要,以确保新训练的模型能够安全且有效地响应用户指令。在各种方法中,推理时对齐(inference-time alignment)通常成本更低,因为它仅在输出生成过程中进行干预(即提供引导)。

Existing proposals apply guidances extracted from certain aligned models without properly assessing their reliability. Nonetheless, our systematic evaluation reveals that guidance effectiveness varies drastically across models; since ineffective guidances lead to further confusion and thus further interventions, the resulting excessive interventions typically indicate poor performance.

现有的方案在应用从特定对齐模型中提取的引导时,往往缺乏对其可靠性的充分评估。然而,我们的系统性评估表明,引导的有效性在不同模型间存在巨大差异;由于无效的引导会导致进一步的混乱,进而引发更多的干预,因此由此产生的过度干预通常意味着模型性能不佳。

To make interventions more effective and thus more efficient, we introduce BlendIn, an inference-time alignment framework that shifts from binary decisions to creating hybrid distributions integrating both models’ knowledge. BlendIn stabilizes inference-time alignment by performing quality-aware alignment and proportionally weighting each model’s contribution based on reliability.

为了使干预更有效且更高效,我们引入了 BlendIn,这是一个推理时对齐框架。它将决策方式从二元选择转变为创建融合了两个模型知识的混合分布。BlendIn 通过执行质量感知对齐,并根据可靠性按比例加权每个模型的贡献,从而稳定了推理时的对齐过程。

Compared with existing works, it preserves beneficial guidance while downweighting unreliable suggestions. BlendIn provides both diagnostic signals and mitigation strategies for misaligned guidance, achieving consistent and up to 50% performance improvement on challenging model pairs. Our code is available at: this https URL.

与现有工作相比,它在保留有益引导的同时,降低了不可靠建议的权重。BlendIn 为对齐偏差的引导提供了诊断信号和缓解策略,在具有挑战性的模型对上实现了持续的性能提升,最高可达 50%。我们的代码可在以下链接获取:[此链接]。