Tail Control: The Counterintuitive Engineering of Reliable Agentic Workflows
Tail Control: The Counterintuitive Engineering of Reliable Agentic Workflows
尾部控制:可靠智能体工作流的反直觉工程学
Agentic AI Tail Control: The Counterintuitive Engineering of Reliable Agentic Workflows Behind a customer’s API, a high-quality answer isn’t enough. It has to be usable, which means on time. Delivering that consistently is a problem about variance, not speed, and the fixes are counterintuitive. 智能体 AI 的尾部控制:可靠智能体工作流的反直觉工程学。在客户的 API 背后,仅仅提供高质量的答案是不够的。它必须是可用的,这意味着必须准时。持续交付高质量结果的问题在于“方差”而非“速度”,而其解决方案往往是反直觉的。
Run an LLM workflow inside your own company and almost any failure is cheap: you retry, fall back, or potentially even ignore it. Put that same workflow behind a customer’s API or MCP server and the grace is gone. Now only one thing matters: did the customer get a correct, usable result? Their process depends on yours delivering one. They, not you, now decide what counts as delivered. 在公司内部运行 LLM 工作流时,几乎任何失败的代价都很低:你可以重试、回退,甚至直接忽略它。但如果将同样的工作流置于客户的 API 或 MCP 服务器之后,这种宽容度便荡然无存。现在只有一件事至关重要:客户是否得到了正确、可用的结果?他们的流程依赖于你的交付。现在,决定什么才算“已交付”的是他们,而不是你。
At Databook we process billions of tokens for the world’s largest enterprises; this article is based on real data from production flows at scale. I hope it offers you some useful insights. Delivering that result is harder than it looks, because LLMs are notoriously unreliable. They fail frequently, in four flavors: an invalid answer (empty, unparseable, or simply wrong), a hard error, no answer at all, or no answer in time. And the whole run only succeeds if every step does, so the more you chain together, the more chances there are for one of them to fail. A workflow of individually excellent steps can still come out a coin flip. 在 Databook,我们为全球大型企业处理数十亿个 Token;本文基于大规模生产流程中的真实数据。希望它能为你提供一些有用的见解。交付结果比看起来要难,因为 LLM 的不可靠是出了名的。它们经常以四种方式失败:无效答案(为空、无法解析或纯粹错误)、硬错误、完全没有答案,或者没有及时给出答案。整个运行过程只有在每一步都成功时才算成功,因此你串联的步骤越多,其中某一步失败的几率就越大。一个由各个优秀步骤组成的工作流,最终结果可能依然像抛硬币一样不可控。
Inside your own company you can absorb every one of these, because you have slack on every axis: retry the failed step, wait out the slow one, spend a little more, relax the bar if you must. Put the same workflow behind a customer’s API and the slack vanishes, because the run now has to clear three resource budgets at the same time, none of which you set: 在公司内部,你可以消化所有这些问题,因为你在各个维度上都有回旋余地:重试失败的步骤、等待缓慢的步骤、多花点钱,或者在必要时降低标准。但将同样的工作流置于客户的 API 之后,这种余地就消失了,因为运行过程必须同时满足三个资源预算,而这三个预算都不是由你设定的:
Time — a window that closes whether or not you’re done: a hard gateway timeout (one to three minutes, sometimes five) that severs the connection mid-run, or something softer: an SLA, a caller blocked on the result, a process that can only wait so long. And it doesn’t resume: when the window closes, the customer just retries, starting the whole run over from zero. 时间——无论你是否完成,窗口都会关闭:硬网关超时(一到三分钟,有时五分钟)会在运行中途切断连接;或者更柔和的限制:SLA(服务等级协议)、被结果阻塞的调用者,或只能等待一定时间的流程。而且它不会恢复:当窗口关闭时,客户只会重试,整个运行过程从零开始。
Cost — now a margin, not a pool. Every run carries a price the customer already paid, so it has to come back profitable, not merely affordable. And the customer, not you, decides how often it runs. 成本——现在是利润率,而不是资金池。每次运行都带有客户已经支付的价格,因此它必须是盈利的,而不仅仅是负担得起的。而且,决定运行频率的是客户,而不是你。
Tokens and rate — a per-minute token budget (TPM) you share across every customer at once, and they tend to call in the same bursts. You hit the ceiling exactly when load is heaviest, which is exactly when latency is worst. Token 与速率——你同时与所有客户共享的每分钟 Token 预算(TPM),而他们往往会在同一时间段内爆发式调用。你恰好在负载最重时触及上限,而这正是延迟最严重的时候。
Under all three sits a hard floor you never trade beneath: quality. The answer has to be right to count at all. A fast, cheap, on-time answer that’s wrong is still a failure. Quality isn’t a budget you spend down. 在这三者之下,有一个你绝不能妥协的底线:质量。答案必须正确才算数。一个快速、廉价、准时但错误的答案仍然是失败。质量不是你可以随意消耗的预算。
The bind is that they apply together and pull against each other, so the obvious fix for one spends another. Wait out a slow step and you blow the time window. Race a second copy to beat the clock and you burn cost and quota. Reach for a stronger model to clear the quality floor and you get slower. None of the budgets are yours to loosen, so the only move left is to trade deliberately across all of them at once — without ever dropping below the floor. 困境在于它们同时作用并相互制约,因此解决一个问题的显而易见的方法往往会消耗另一个预算。等待一个缓慢的步骤,你就会错过时间窗口。为了抢时间运行第二个副本,你就会消耗成本和配额。为了达到质量底线而使用更强的模型,你的速度就会变慢。这些预算你都无法放宽,所以剩下的唯一策略就是同时在所有预算之间进行审慎的权衡——且绝不能跌破质量底线。
That is what makes a customer-facing workflow a genuinely different thing to build, and it sometimes forces a playbook that, from the inside, looks totally backwards: Kill a call that hasn’t failed; Fire a duplicate of a call you’re already paying for; Drop to a weaker model on purpose. 这就是为什么构建面向客户的工作流是一件完全不同的事情,它有时会迫使你采取一些从内部视角看完全“倒行逆施”的策略:终止一个尚未失败的调用;为一个你已经在付费的调用发起重复请求;故意降级到更弱的模型。
Here’s the thesis, up front, because everything else serves it: once quality clears the bar, reliable delivery is a question of variance, not speed. A predictable completion time beats a fast one with a long tail, because your customers can’t run their infrastructure on your best case; they have to build for your worst. 这里先抛出核心论点,因为其他一切都是为此服务的:一旦质量达标,可靠交付的问题就在于“方差”而非“速度”。一个可预测的完成时间胜过一个带有长尾效应的快速时间,因为你的客户无法基于你的“最佳情况”来运行他们的基础设施;他们必须为你的“最差情况”做准备。
What this is — and isn’t: workflows, not free reasoning agents. One distinction up front, because it changes everything. This is about an agentic workflow: a known process flow with LLM-powered steps inside it, run by a deterministic orchestrator. It is not a reasoning agent that decides its own next move at runtime. 这是什么——以及不是什么:是工作流,而非自由推理智能体。首先要明确一个区别,因为它改变了一切。本文讨论的是智能体工作流:一种包含 LLM 驱动步骤、由确定性编排器运行的已知流程。它不是在运行时决定下一步行动的推理智能体。
For the same task, a workflow is simply faster: it already knows the plan, skips the deliberation, and runs every independent step in parallel, so it reaches the same answer in a fraction of the time and cost a reasoning agent would take. Both have their place (reasoning agents are much more flexible), but they fail differently and you fix them differently. A reasoning agent’s problem is deciding what to do; a workflow’s problem (the one customers feel) is delivering what it already knows how to do, with quality, and in time. This article is about the latter. 对于相同的任务,工作流显然更快:它已经知道计划,跳过了深思熟虑的过程,并并行运行每个独立步骤,因此它能以推理智能体所需的一小部分时间和成本达到相同的答案。两者各有千秋(推理智能体更灵活),但它们的失败方式不同,修复方式也不同。推理智能体的问题在于决定做什么;而工作流的问题(客户所感受到的)在于如何高质量、准时地交付它已经知道如何去做的事情。本文讨论的是后者。