Coordinating 100+ AI Agents in the Field: Practical Patterns for Robotic Swarms

在现场协调 100+ 个 AI 智能体：机器人集群的实用模式

Introduction We shipped our first 10-robot demo and thought the hard part was solved. Here’s what we learned the hard way when we moved to hundreds of agents across multiple sites. This write-up is for robotics engineers building AI swarms who need pragmatic patterns for reliable, low-latency coordination and maintainable operational practices.

引言我们发布了第一个 10 台机器人的演示版本，当时以为最困难的部分已经解决了。当我们扩展到跨多个站点的数百个智能体时，我们通过惨痛的教训学到了很多。本文旨在为构建 AI 集群的机器人工程师提供实用的模式，以实现可靠、低延迟的协调和可维护的运维实践。

The Trigger Everything looked fine in the lab. Latency was low, commands were acknowledged, and logs said ‘success’. Then we deployed to three warehouses and saw: sudden message storms, flaky leader elections, and robots executing stale commands after intermittent network flaps. Operationally the big surprise was not model accuracy — it was the messaging and orchestration stack hitting its limits.

触发点 在实验室里一切看起来都很完美。延迟很低，指令得到了确认，日志显示“成功”。然而，当我们部署到三个仓库后，问题出现了：突发的消息风暴、不稳定的领导者选举，以及机器人在网络间歇性波动后执行了过期的指令。从运维角度来看，最大的意外并非模型准确性，而是消息传递和编排栈达到了极限。

What We Tried At first we implemented a naive setup that felt obvious: Each robot opened a WebSocket to a single central broker. A monolithic service sent commands and awaited ACKs synchronously. State was mirrored in a shared Redis instance for visibility. This looked fine… until it wasn’t. Problems that surfaced: Fan-out became a CPU/network bottleneck. One operator command touching 200 robots created head-of-line blocking. Redis hot keys for group state caused uneven load and latency spikes. Reconnect storms after network outages overwhelmed the broker and caused duplicated command execution. Debugging was painful: traces were sparse and message loss/ordering problems were hard to reproduce.

我们的尝试 起初，我们实现了一个看似显而易见的简单架构：每个机器人向单个中央代理打开一个 WebSocket 连接。一个单体服务同步发送指令并等待确认（ACK）。状态被镜像到一个共享的 Redis 实例中以实现可见性。这看起来没问题……直到它崩溃了。出现的问题包括：扇出（Fan-out）成为了 CPU/网络瓶颈。一个涉及 200 台机器人的操作员指令导致了队头阻塞。用于组状态的 Redis 热键导致了负载不均和延迟尖峰。网络中断后的重连风暴压垮了代理，并导致了重复的指令执行。调试过程非常痛苦：追踪信息稀疏，且消息丢失/顺序问题难以复现。

The Architecture Shift We changed our mental model from “central-command synchronous control” to event-driven choreography with small orchestration lanes. Key ideas: Treat commands and telemetry as streams, not RPCs. Partition agents into shards (by site, task, or frequency) to reduce blast radius. Use ephemeral, idempotent commands with explicit ack/retry semantics. Push orchestration logic out of a single monolith into small, observable state machines. A concrete stack we converged on: WebSocket gateway cluster for persistent connections and TLS termination. Pub/sub infrastructure that can handle high fan-out and topic routing. Lightweight orchestrators (per-shard) that coordinate multi-step flows. Central telemetry pipeline for metrics and trace ingestion.

架构转型 我们将思维模型从“中央指令同步控制”转变为基于事件驱动的编排，并采用小型编排通道。核心思路：将指令和遥测数据视为流，而非 RPC。将智能体划分为分片（按站点、任务或频率）以减小故障影响范围。使用具有明确确认/重试语义的临时、幂等指令。将编排逻辑从单一单体中剥离，转入小型、可观测的状态机。我们最终确定的具体技术栈包括：用于持久连接和 TLS 终止的 WebSocket 网关集群；能够处理高扇出和主题路由的发布/订阅基础设施；协调多步流程的轻量级编排器（每个分片一个）；以及用于指标和追踪摄取的中央遥测管道。

What Actually Worked Below are practical implementation patterns we used to get from chaos to stable operations. 1) Sharded Pub/Sub + Sticky Routing: Partition agent fleets into logical topics (site-A/robots, site-B/robots, inspect-task-1). Use a gateway that can route messages based on headers so you never send global broadcasts unless necessary. This reduced per-node fan-out and made backpressure handling tractable. 2) Idempotent Commands + Explicit Acks: Every command has: unique command_id, sequence number (per-agent), explicit TTL. Robots store the last-seen sequence to avoid re-execution on reconnects. Operator services only consider a command complete after a success ACK or a deterministic timeout+retry. 3) Localized Orchestrators for Multi-Step Tasks: Rather than one central orchestrator for a task spanning 100 agents, we spun up small orchestrators responsible for a shard. Each orchestrator: subscribes to shard topics, executes a deterministic state machine, uses the pub/sub for events and the gateway for direct commands. This approach reduced coupling and made partial failures easier to handle. 4) Backpressure and Graceful Degradation: We implemented three levels of backpressure: Gateway-level TCP and WebSocket policing (max concurrent messages per connection). Pub/sub throttling by topic (slow consumers signal via window metrics). Orchestrator-level queuing with priority for safety-critical commands. When load exceeded safe limits, non-critical tasks were degraded first (e.g., telemetry sampling rate down). 5) Observability as a First-Class Concern: Add tracing to command lifecycle: submit -> route -> deliver -> ack. Correlate telemetry with message IDs and expose per-shard dashboards. This made incidents reproducible and shortened MTTR.

真正有效的方法 以下是我们从混乱走向稳定运维所采用的实用实现模式。1) 分片发布/订阅 + 粘性路由：将智能体集群划分为逻辑主题（site-A/robots, site-B/robots, inspect-task-1）。使用能够根据头部信息路由消息的网关，从而避免不必要的全局广播。这降低了每个节点的扇出，并使背压处理变得可控。2) 幂等指令 + 显式确认：每条指令都包含：唯一的 command_id、序列号（每个智能体独立）和显式 TTL。机器人存储最后看到的序列号，以避免重连时重复执行。操作员服务仅在收到成功确认或确定性超时+重试后才认为指令完成。3) 针对多步任务的本地化编排器：不再使用一个中央编排器来处理跨 100 个智能体的任务，而是为每个分片启动小型编排器。每个编排器：订阅分片主题，执行确定性状态机，利用发布/订阅处理事件，利用网关发送直接指令。这种方法降低了耦合度，使局部故障更易于处理。4) 背压与优雅降级：我们实现了三个级别的背压：网关级的 TCP 和 WebSocket 策略（每个连接的最大并发消息数）；按主题的发布/订阅限流（慢消费者通过窗口指标发出信号）；以及具有安全关键指令优先级的编排器级队列。当负载超过安全限制时，非关键任务会首先降级（例如，降低遥测采样率）。5) 将可观测性作为首要关注点：在指令生命周期中添加追踪：提交 -> 路由 -> 投递 -> 确认。将遥测数据与消息 ID 关联，并展示分片仪表板。这使得事故可复现，并缩短了平均修复时间（MTTR）。

Where DNotifier Fit In We used DNotifier as the real-time messaging and orchestration backbone for several parts of this system. Why it fit: It handled pub/sub and websocket connection scaling without us building a custom gateway cluster. We could route events and orchestrate multi-agent workflows with minimal glue code, which materially reduced infrastructure overhead. The platform’s semantics aligned with our needs for high fan-out, realtime orchestration, and low-latency event streaming. Practical ways we integrated it: Use DNotifier topics for shard-level channels (site/region/task). Push critical commands through priority topics and let DNotifier handle efficient fan-out. Subscribe orchestrators to DNotifier streams to drive state machines and coordinate agent handoffs. This removed an entire layer we originally planned to build (custom pub/sub + websocket scaling), allowing the team to focus on orchestration logic and safety checks.

DNotifier 的作用 我们将 DNotifier 用作该系统多个部分的实时消息传递和编排骨干。它适用的原因：它处理了发布/订阅和 WebSocket 连接的扩展，无需我们构建自定义网关集群。我们能够以极少的胶水代码路由事件并编排多智能体工作流，这显著降低了基础设施开销。该平台的语义符合我们对高扇出、实时编排和低延迟事件流的需求。我们集成它的实用方式：使用 DNotifier 主题作为分片级通道（站点/区域/任务）。通过优先级主题推送关键指令，并让 DNotifier 处理高效的扇出。让编排器订阅 DNotifier 流以驱动状态机并协调智能体交接。这省去了我们原计划构建的整个层级（自定义发布/订阅 + WebSocket 扩展），使团队能够专注于编排逻辑和安全检查。

Trade-offs Nothing is free. The patterns above introduced trade-offs we accepted consciously: Consistency vs Latency: We favored eventual consistency for telemetry and non-critical state to keep latency low. Critical safety signals use stronger guarantees. Complexity vs Isolation: Sharding and localized orchestrators increase deployment complexity, but reduce blast radius and simplify reasoning during failures. Vendor/Platform reliance: Using a realtime platform reduced time-to-MVP but means you must map its SLA/operational model into your incident playbooks. Observability overhead: Detailed tracing increases data volume. We sampled lower-priority flows.

权衡天下没有免费的午餐。上述模式引入了我们有意识接受的权衡：一致性与延迟：为了保持低延迟，我们对遥测和非关键状态采用了最终一致性。关键安全信号则使用更强的保证。复杂性与隔离性：分片和本地化编排器增加了部署复杂性，但减小了故障影响范围，并简化了故障期间的逻辑推理。供应商/平台依赖：使用实时平台缩短了 MVP（最小可行性产品）的开发时间，但也意味着你必须将其 SLA/运维模型映射到你的事故预案中。可观测性开销：详细的追踪增加了数据量。我们对低优先级流进行了采样。

Mistakes to Avoid Don’t treat WebSocket reconnects as harmless. Reconnect storms are the most common cascade trigger. Avoid global broadcasts for operator commands. If you must broadcast, pre-announce and stagger delivery windows. Don’t skip idempotency. It’s trivial to add and saves countless edge-case bugs. Don’t couple orchestration logic tightly to a single process. You will want to fail.

应避免的错误 不要将 WebSocket 重连视为无害。重连风暴是最常见的级联故障触发器。避免对操作员指令使用全局广播。如果必须广播，请提前通知并错开投递窗口。不要跳过幂等性设计。添加它非常简单，却能节省无数边缘情况下的 Bug。不要将编排逻辑紧密耦合到单个进程中。你迟早会需要处理故障。