Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Arbor：作为自主智能体认知层的树搜索

Abstract: Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation. Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signal that reshapes subsequent exploration, and expanding as prior successes shift the bottleneck distribution.

摘要： Arbor 是一个多智能体框架，它引入了结构化树搜索作为在大型、有状态动作空间中运行的自主智能体的认知层。先前的自主优化系统通常在无状态评估的情况下针对孤立目标进行操作。相比之下，Arbor 维护了一个包含评分假设的显式搜索树，作为智能体之间的共享工作记忆。该树随每次测量而演进，将失败视为重塑后续探索的诊断信号，并随着先前成功案例改变瓶颈分布而不断扩展。

We validate Arbor on full-stack LLM inference optimization, a domain where achieving peak performance has historically required coordinated effort from engineering teams across the application, framework, compiler, kernel, and hardware stack. Arbor pairs an Orchestrator agent, which drives optimization by delegating to Domain Specialists across the inference stack, with a Critic agent that safeguards stability through root-cause analysis, introspection, and measurement validation — a checks-and-balances architecture where neither agent can unilaterally drive the system.

我们在全栈大模型（LLM）推理优化领域验证了 Arbor。在这一领域，要实现峰值性能，历来需要应用、框架、编译器、内核和硬件栈等多个工程团队的协同努力。Arbor 配备了一个“编排者”（Orchestrator）智能体，通过向推理栈中的“领域专家”智能体委派任务来推动优化；同时配备了一个“评论家”（Critic）智能体，通过根本原因分析、内省和测量验证来保障系统稳定性——这是一种制衡架构，确保没有任何一个智能体可以单方面驱动系统。

Agent capabilities are decomposed into hard skills (domain expertise) and soft skills (coordination protocols that determine how contributions compose), enabling fully autonomous multi-day campaigns. Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% throughput improvement and crashes irrecoverably within hours.

智能体的能力被分解为硬技能（领域专业知识）和软技能（决定贡献如何组合的协调协议），从而实现了完全自主的、持续数天的优化任务。Arbor 在推理吞吐量-延迟帕累托（Pareto）指标上，较厂商优化基准实现了高达 193% 的提升；而没有该框架支持的单一智能体，其吞吐量提升仅停留在 33%，且会在数小时内发生不可恢复的崩溃。

Arbor generalizes to multiple generations of hardware platform, and run-to-run variance is within 2 percentage points demonstrating that the method is hardware-agnostic and reproducible.

Arbor 可推广至多代硬件平台，且运行间的差异在 2 个百分点以内，证明了该方法具有硬件无关性和可复现性。