LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment

LAST：通过 Gromov-Wasserstein 对齐桥接视觉-语言与动作流形

Abstract: We take a Gromov-Wasserstein perspective on Vision-Language-Action (VLA) learning, where the goal is to make the relational geometry of action representations compatible with the semantic geometry of VL embeddings. However, this alignment is non-trivial due to the mathematical heterogeneity between the domains: the semantic space of vision-language is topologically linear and isotropic, whereas the physical manifold of robotic action is non-Euclidean and anisotropic. Their disjoint metric structures render direct regression ill-posed.

摘要： 我们从 Gromov-Wasserstein 的视角审视视觉-语言-动作（VLA）学习，其目标是使动作表征的关系几何与视觉-语言（VL）嵌入的语义几何相兼容。然而，由于两个领域之间存在数学异质性，这种对齐并非易事：视觉-语言的语义空间在拓扑上是线性的且各向同性的，而机器人动作的物理流形则是非欧几里得的且各向异性的。它们不相交的度量结构使得直接回归变得病态。

To resolve this incompatibility, we introduce LAST (Lie-algebraic Action Space Tokenizer), which reconstructs the action space to establish local metric compatibility with the VL modality via a two-stage transformation: (1) Global Topological Linearization: linearizing the action manifold via Lie-algebraic mapping, converting trajectories into a fixed-length, physically additive representation. (2) Local Metric Discretization: hierarchically discretizing the representation into schemas and whitened residuals, yielding approximately isotropic local charts that are statistically aligned with the semantic metric. By resolving the structural mismatch at both global and local levels, LAST enables VLA models with superior convergence and generalizability.

为了解决这种不兼容性，我们引入了 LAST（李代数动作空间分词器），它通过两阶段转换重构动作空间，从而建立与 VL 模态的局部度量兼容性：(1) 全局拓扑线性化：通过李代数映射将动作流形线性化，将轨迹转换为固定长度且具有物理可加性的表征。(2) 局部度量离散化：将表征分层离散化为模式（schemas）和白化残差，产生在统计上与语义度量对齐的近似各向同性局部图表。通过解决全局和局部层面的结构不匹配问题，LAST 使 VLA 模型具备了更优的收敛性和泛化能力。

Paper Details:

Authors: Huaihai Lyu, Chaofan Chen, Yuheng Ji, Xiansheng Chen, Pengwei Wang, Shanghang Zhang, Changsheng Xu
Subject: Computer Vision and Pattern Recognition (cs.CV)
arXiv ID: 2606.11221

论文详情：

作者： Huaihai Lyu, Chaofan Chen, Yuheng Ji, Xiansheng Chen, Pengwei Wang, Shanghang Zhang, Changsheng Xu
学科： 计算机视觉与模式识别 (cs.CV)
arXiv ID: 2606.11221