COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models
COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models
COMPASS:在统一多模态模型中实现构图意图引导
Abstract: Composition is a high-level visual intent that governs where subjects are placed and how a scene is organized, yet current unified multimodal models remain unreliable at fine-grained composition recognition and struggle to turn such intent into controllable generation.
摘要: 构图是一种高层级的视觉意图,它决定了主体的位置以及场景的组织方式。然而,现有的统一多模态模型在细粒度构图识别方面仍然不可靠,且难以将此类意图转化为可控的生成结果。
We present COMPASS, the first unified multimodal framework that grounds composition-intent control in a single system spanning both composition perception and composition-guided generation, with a shared expert token $\tau_c$ as the central intent anchor.
我们提出了 COMPASS,这是首个将构图意图控制整合在单一系统中的统一多模态框架,涵盖了构图感知与构图引导生成,并以共享的专家标记 $\tau_c$ 作为核心意图锚点。
On the perception side, COMPASS injects composition expertise into an MoE backbone in a minimally invasive manner and distills the inferred intent into $\tau_c$.
在感知端,COMPASS 以最小侵入性的方式将构图专业知识注入到混合专家(MoE)主干网络中,并将推断出的意图提炼为 $\tau_c$。
On the generation side, COMPASS reuses $\tau_c$ as a global conditioning signal that steers the denoising trajectory, effectively converting passive composition analysis into explicit layout control.
在生成端,COMPASS 将 $\tau_c$ 重用为引导去噪轨迹的全局条件信号,从而有效地将被动的构图分析转化为显式的布局控制。
To support systematic instruction-following composition learning and evaluation at scale, we construct Comp-11, a large-scale dataset with an 11-class taxonomy and reasoning-augmented annotations.
为了支持大规模的系统性指令遵循构图学习与评估,我们构建了 Comp-11,这是一个包含 11 类分类体系及推理增强标注的大规模数据集。
Extensive experiments show that COMPASS substantially improves category-level composition understanding and delivers more composition-consistent, prompt-faithful generation than strong baselines.
大量实验表明,与强基线模型相比,COMPASS 显著提升了类别层面的构图理解能力,并能生成更符合构图要求且更忠实于提示词的内容。