Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

三思而后行:具身智能体的验证器引导动作选择

Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios.

构建能够解决复杂现实任务的通用具身智能体仍然是人工智能领域的一项根本性挑战。多模态大语言模型(MLLMs)凭借强大的视觉-语言知识和思维链(CoT)推理能力,显著提升了此类智能体的推理水平,但在面对具有挑战性的分布外(out-of-distribution)场景时,它们依然显得脆弱。

To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy.

为了解决这一问题,我们提出了“验证器引导动作选择”(Verifier-Guided Action Selection, VegAS)。这是一个推理时框架,旨在通过显式的验证步骤来提高基于 MLLM 的具身智能体的鲁棒性。在推理阶段,VeGAS 不会直接执行单一的解码动作,而是采样一组候选动作,并利用生成式验证器来识别最可靠的选择,且无需修改底层的策略模型。

Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement, motivating our LLM-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time.

至关重要的是,我们发现直接使用现成的 MLLM 作为验证器并不能带来性能提升。这促使我们采用了由 LLM 驱动的数据合成策略,该策略能够自动构建多样化的失败案例课程,从而使验证器在训练阶段就能接触到分布广泛的潜在错误。

Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments, VeGAS consistently improves generalization, achieving up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks.

在涵盖 Habitat 和 ALFRED 环境的具身推理基准测试中,VeGAS 持续提升了模型的泛化能力;在最具挑战性的多目标、长程任务中,相比于强大的 CoT 基准模型,其性能提升幅度高达 36%。