COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models

COMPASS：在统一多模态模型中实现构图意图引导

Abstract: Composition is a high-level visual intent that governs where subjects are placed and how a scene is organized, yet current unified multimodal models remain unreliable at fine-grained composition recognition and struggle to turn such intent into controllable generation.

摘要： 构图是一种高层级的视觉意图，它决定了主体的位置以及场景的组织方式。然而，现有的统一多模态模型在细粒度构图识别方面仍然不可靠，且难以将此类意图转化为可控的生成结果。

We present COMPASS, the first unified multimodal framework that grounds composition-intent control in a single system spanning both composition perception and composition-guided generation, with a shared expert token $\tau_c$ as the central intent anchor.

我们提出了 COMPASS，这是首个将构图意图控制整合在单一系统中的统一多模态框架，涵盖了构图感知与构图引导生成，并以共享的专家标记 $\tau_c$ 作为核心意图锚点。

On the perception side, COMPASS injects composition expertise into an MoE backbone in a minimally invasive manner and distills the inferred intent into $\tau_c$.

在感知端，COMPASS 以最小侵入性的方式将构图专业知识注入到混合专家（MoE）主干网络中，并将推断出的意图提炼为 $\tau_c$。

On the generation side, COMPASS reuses $\tau_c$ as a global conditioning signal that steers the denoising trajectory, effectively converting passive composition analysis into explicit layout control.

在生成端，COMPASS 将 $\tau_c$ 重用为引导去噪轨迹的全局条件信号，从而有效地将被动的构图分析转化为显式的布局控制。

To support systematic instruction-following composition learning and evaluation at scale, we construct Comp-11, a large-scale dataset with an 11-class taxonomy and reasoning-augmented annotations.

为了支持大规模的系统性指令遵循构图学习与评估，我们构建了 Comp-11，这是一个包含 11 类分类体系及推理增强标注的大规模数据集。

Extensive experiments show that COMPASS substantially improves category-level composition understanding and delivers more composition-consistent, prompt-faithful generation than strong baselines.

大量实验表明，与强基线模型相比，COMPASS 显著提升了类别层面的构图理解能力，并能生成更符合构图要求且更忠实于提示词的内容。