INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

INFRAMIND：基础设施感知型多智能体编排

Abstract: Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step.

摘要： 现有的多智能体大语言模型（LLM）编排方法，从暴力集成到学习型路由，通常仅根据任务和模型特征来选择模型和拓扑结构。然而，这些方法并未考虑服务基础设施的运行时状态。在并发负载下的共享 GPU 集群中，这种对基础设施的“盲视”会导致系统性的资源利用不足：首选模型堆积了深层的请求队列，而同样具备能力的替代模型却处于闲置状态。在多智能体流水线中，由于每个查询都会触发多个连续的模型调用，这些延迟会在后续的每一个步骤中不断累积。

Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load.

弥补这一差距极具挑战性，因为相关的基础设施信号（队列深度、KV 缓存压力、延迟）是动态且充满噪声的，且必须驱动三个不同的决策环节：规划、分步路由和调度。我们引入了 INFRAMIND，这是一个使整个多智能体技术栈具备基础设施感知能力的框架。基础设施感知规划器根据实时系统负载和剩余预算来调整拓扑结构和角色选择，在拥塞时倾向于更简单的图结构，而在低负载时则倾向于更丰富的图结构。

An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model’s queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically.

随后，基础设施感知执行器会在每个智能体步骤中观察各模型的队列深度、缓存利用率和响应延迟，以决定调用哪个模型以及推理的深度；预算感知调度器进一步对每个模型的队列进行重排序，确保紧急请求优先得到处理。该系统被建模为分层约束马尔可夫决策过程（MDP），并通过强化学习进行端到端求解，从而自动学习如何在质量与延迟之间取得平衡。

Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.

在五个基准测试中，INFRAMIND 在低负载下比先前的基准测试准确率提升了高达 7.6 个百分点，延迟降低了 7 倍；在高负载下，它仍能保持高达 99.9% 的服务水平目标（SLO）合规率，而所有基准方法在该条件下的合规率均降至 50% 以下。