MolmoMotion: Language-guided 3D motion forecasting

MolmoMotion：语言引导的 3D 运动预测

Machines have become remarkably good at perceiving motion. Given a video, modern models can track how objects and points move through a scene with exceptionally high confidence. But perception is inherently retrospective: it explains motion that has already happened. Many of the systems and applications we want to build need to look forward instead. A robot reaching for a cup has to anticipate how the cup will move before it touches it. A video generator has to know what realistic motion comes next if it’s going to produce physically plausible frames. Predicting motion is harder than observing it, but it’s also far more useful in many scenarios.

机器在感知运动方面已经变得非常出色。给定一段视频，现代模型能够以极高的置信度追踪物体和点在场景中的移动方式。但感知本质上是回顾性的：它解释的是已经发生的运动。我们想要构建的许多系统和应用则需要向前看。一个伸手去拿杯子的机器人必须在触碰杯子之前预判它将如何移动。如果视频生成器想要生成符合物理规律的帧，它必须知道接下来会出现什么样的真实运动。预测运动比观察运动更难，但在许多场景中也更有用。

This idea was the motivation behind MolmoMotion, a new motion forecasting model we’re releasing today. Given a video frame, 3D points marked on an object, and written instructions describing the intended action (e.g., “Move and rotate the wooden bowl with fruit on the table”), MolmoMotion predicts where those points will move over the next few seconds in 3D space—achieving substantially stronger performance than existing forecasting methods.

这一理念正是我们今天发布的全新运动预测模型 MolmoMotion 的动力所在。给定一个视频帧、标记在物体上的 3D 点以及描述预期动作的书面指令（例如：“移动并旋转桌子上装有水果的木碗”），MolmoMotion 能够预测这些点在未来几秒钟内将在 3D 空间中如何移动，其性能远超现有的预测方法。

Given an RGB observation, a set of query points on an object, and an action description, MolmoMotion predicts the object’s future 3D point trajectory. These predicted trajectories can then drive downstream applications such as robotics planning and trajectory-conditioned video generation. Alongside the model, we’re publishing MolmoMotion-1M, the largest collection of 3D point trajectories paired with action descriptions, drawn from 1.16M videos. We’re also releasing PointMotionBench, a human-validated benchmark designed to measure object-centric 3D motion forecasting accuracy, containing 2.7K video clips.

给定 RGB 观测值、物体上的一组查询点以及动作描述，MolmoMotion 可以预测物体未来的 3D 点轨迹。这些预测出的轨迹随后可以驱动下游应用，例如机器人规划和轨迹条件视频生成。除了模型本身，我们还发布了 MolmoMotion-1M，这是目前最大的 3D 点轨迹数据集，包含从 116 万段视频中提取并配有动作描述的数据。我们还发布了 PointMotionBench，这是一个经人工验证的基准测试，旨在衡量以物体为中心的 3D 运动预测准确性，包含 2700 个视频片段。

We find that motion forecasters like MolmoMotion can be useful across a range of downstream tasks, from robot planning to controllable video generation. We’re releasing the model weights, the MolmoMotion-1M dataset, and our PointMotionBench benchmark openly for the community to study, improve, and customize.

我们发现，像 MolmoMotion 这样的运动预测器在从机器人规划到可控视频生成的各种下游任务中都非常有用。我们已将模型权重、MolmoMotion-1M 数据集以及 PointMotionBench 基准测试公开，供社区研究、改进和定制。

MolmoMotion: Under the hood

MolmoMotion：技术内幕

MolmoMotion represents motion in a deliberate, highly efficient way: as object-attached 3D points in world space, which capture motion without the cost of rendering full video. We chose it because we needed a general motion representation with three properties: Class-agnostic: not tied to templates for human bodies, hands, rigid objects, or any other fixed category. View-stable: the same physical motion should be represented consistently across cameras and viewpoints. Directly usable by downstream systems that need to reason about physical motion.

MolmoMotion 以一种深思熟虑且高效的方式表示运动：即世界空间中附着于物体的 3D 点，它无需渲染完整视频即可捕捉运动。我们选择这种方式是因为我们需要一种具备以下三个特性的通用运动表示：类别无关性（Class-agnostic）：不绑定于人体、手部、刚体或其他任何固定类别的模板。视角稳定性（View-stable）：相同的物理运动在不同摄像机和视角下应保持一致的表示。可直接供需要推理物理运动的下游系统使用。

Among the representations we considered, it was the only one that satisfied all three. A sparse set of surface points can describe rigid, articulated, and (within limits) deformable motion without assuming the type of object being moved. Because the points live in a shared world frame, their trajectories remain stable across camera motion and viewpoint change. And because they’re compact explicit trajectories in 3D space, they can be passed directly to systems such as robot policies or video generation models.

在我们考虑的各种表示方法中，这是唯一满足上述所有三点的方案。一组稀疏的表面点可以描述刚体、关节运动以及（在一定限度内的）可变形运动，而无需预设被移动物体的类型。由于这些点存在于共享的世界坐标系中，它们的轨迹在摄像机运动和视角变化时保持稳定。此外，由于它们是 3D 空间中紧凑且明确的轨迹，因此可以直接传递给机器人策略或视频生成模型等系统。

To forecast those trajectories, MolmoMotion uses Molmo 2 as its backbone, allowing it to connect language instructions to objects and points in an image. Given a short video history, an action description, and a set of query points with their initial 3D positions, the model first identifies the object being referred to, the query points, and the motion the instruction describes. It then predicts the future 3D trajectory of each point.

为了预测这些轨迹，MolmoMotion 使用 Molmo 2 作为其主干网络，使其能够将语言指令与图像中的物体和点关联起来。给定一段简短的视频历史、动作描述以及一组带有初始 3D 位置的查询点，模型首先识别出指令所指的物体、查询点以及指令描述的运动。随后，它会预测每个点未来的 3D 轨迹。

We train two variants of MolmoMotion: The autoregressive variant (MolmoMotion-AR) predicts future coordinates step by step. It represents 3D coordinates as structured text, following the coordinate-style prediction used by VLMs, and writes out the future trajectory in temporal order. Because each new coordinate is conditioned on the trajectory already generated, this encourages smooth rollouts and gives the strongest accuracy when the future path is well-defined. The flow-matching variant (MolmoMotion-FM) predicts trajectories in continuous 3D space by transforming noise into motion, which makes it better suited for representing uncertainty when an instruction admits multiple plausible futures.

我们训练了两个版本的 MolmoMotion：自回归版本（MolmoMotion-AR）逐步预测未来的坐标。它将 3D 坐标表示为结构化文本，遵循视觉语言模型（VLM）使用的坐标式预测方法，并按时间顺序写出未来轨迹。由于每个新坐标都以已生成的轨迹为条件，这有助于实现平滑的推演，并在未来路径明确时提供最高的准确性。流匹配版本（MolmoMotion-FM）通过将噪声转换为运动来预测连续 3D 空间中的轨迹，这使其更适合在指令存在多种合理未来可能时表示不确定性。

Introducing MolmoMotion-1M and PointMotionBench

介绍 MolmoMotion-1M 和 PointMotionBench

To train MolmoMotion, we needed data that didn’t yet exist: large-scale videos with 3D point trajectories grounded to specific objects and paired with action descriptions. Existing 3D-track datasets were small and domain-limited, and while internet videos have all the scale and diversity we wanted for a forecaster like MolmoMotion, they didn’t include 3D annotations. So we built an automatic pipeline that extracts object-grounded 3D trajectories from unconstrained video. Given an input video and its action description, our annotation pipeline produces object-grounded 3D point trajectories in metric world coordinates. (The figure below shows each stage.) The challenging part is that raw tracks from unconstrained video are noisy – with depth and tracking errors that leave points jittering.

为了训练 MolmoMotion，我们需要一种尚不存在的数据：大规模视频，其中包含与特定物体关联的 3D 点轨迹，并配有动作描述。现有的 3D 追踪数据集规模较小且领域受限；虽然互联网视频具备我们为 MolmoMotion 这类预测器所期望的规模和多样性，但它们不包含 3D 标注。因此，我们构建了一个自动流水线，从非受限视频中提取与物体关联的 3D 轨迹。给定输入视频及其动作描述，我们的标注流水线会生成以公制世界坐标表示的、与物体关联的 3D 点轨迹。（下图展示了每个阶段。）最具挑战性的部分在于，来自非受限视频的原始轨迹存在噪声——深度和追踪误差会导致点出现抖动。