Physically Viable World Models: A Case for Query-Conditioned Embodied AI

物理可行性世界模型：查询条件化具身智能的案例

Abstract: World models for embodied AI must be physically viable: constructed to answer intervention queries by representing the physical structure governing action outcomes, rather than merely predicting future observations. Existing observation-predictive world models can produce visually plausible but physically wrong rollouts. This failure is structural; distinct physical systems can look identical yet diverge under intervention. We expose this problem with controlled benchmarks that fix the visible scene while varying latent physics. We show that such models may recommend infeasible actions, mispredict interaction outcomes, or certify unsafe behavior.

摘要： 具身智能的世界模型必须具备物理可行性：即通过表征支配行动结果的物理结构来回答干预查询，而不仅仅是预测未来的观测结果。现有的基于观测预测的世界模型虽然能生成视觉上合理但物理上错误的推演结果。这种失败是结构性的；不同的物理系统在外观上可能完全相同，但在干预下却会产生截然不同的结果。我们通过受控基准测试揭示了这一问题，在保持可见场景不变的同时改变潜在物理规律。研究表明，此类模型可能会推荐不可行的行动、错误预测交互结果，或对不安全行为进行错误认证。

We argue that embodied AI requires world models that identify the simplest physical abstraction sufficient to answer an intervention query. Such a model comprises modular components, including environment representation, latent state and parameter estimation, action specification, interventional dynamics, and query-level response. An autonomous orchestrator should identify the relevant abstraction and compose compatible learned and structured components per query. When closed-form physics is unavailable, uncertain, or costly, the transition model may be analytic, simulated, learned, or hybrid, but it must preserve the structure that determines interventional outcomes.

我们认为，具身智能需要能够识别足以回答干预查询的最简单物理抽象的世界模型。这种模型由模块化组件组成，包括环境表征、潜在状态与参数估计、行动规范、干预动力学以及查询级响应。一个自主编排器应识别相关的抽象，并根据每个查询组合兼容的学习型与结构化组件。当闭式物理模型不可用、不确定或成本过高时，转换模型可以是解析的、模拟的、学习的或混合的，但它必须保留决定干预结果的结构。

This decomposition makes the model interpretable, its components verifiable, and its outputs auditable against the query. It also provides a design principle for new world models and a feasibility test for existing ones: the right abstraction is not the most detailed model of the world, but the simplest model that preserves the distinctions relevant to the query. We demonstrate this approach on queries that existing systems fail to answer correctly, and outline how an orchestrator can dynamically assemble and adapt physically viable models for planning, control, and verification.

这种分解方式使得模型具有可解释性，其组件可验证，且输出结果可针对查询进行审计。这也为新世界模型提供了设计原则，并为现有模型提供了可行性测试：正确的抽象并非是对世界最详尽的建模，而是能够保留与查询相关区别的最简单模型。我们在现有系统无法正确回答的查询上演示了这种方法，并概述了编排器如何动态组装和调整物理可行模型，以用于规划、控制和验证。