Orchestra-o1: Omnimodal Agent Orchestration

Abstract: The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration.

摘要： 近期智能体集群（Agent Swarms）的成功，已将基于大语言模型（LLM）的智能体范式从单智能体工作流转向多智能体系统，凸显了智能体编排在任务分解与协作中的重要性。

However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video.

然而，现有的编排框架仅局限于少数几种模态，难以推广到异构模态共存并交互的复杂场景中。这种局限性在全模态（Omnimodal）场景下尤为突出，因为这些任务需要对文本、图像、音频和视频等多种输入进行统一的理解与协调。

In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution.

在这项工作中，我们提出了 Orchestra-o1，这是一个旨在支持跨多种模态高效智能体协作的全模态智能体编排框架。Orchestra-o1 引入了一种统一的编排机制，实现了模态感知的任务分解、在线子智能体专业化以及子任务的并行执行。

This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark.

这种可扩展的设计使智能体系统能够有效处理涉及异构信息源的复杂现实任务，在 OmniGAIA 基准测试中，其准确率比第二名高出 10.3%。

Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.

此外，我们还引入了决策对齐组相对策略优化（DA-GRPO），这是一种用于训练 Orchestra-o1-8B 的高效智能体强化学习方法，该模型在所有现有的开源全模态智能体中达到了最先进（SOTA）的性能水平。