We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.

We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.

我们构建了一个路由层来削减 AI 成本,结果却搞垮了产品。

Cost optimization routing layers are a Pareto trap. The bill drops. The product breaks. Most teams take three months to notice. 成本优化路由层是一个帕累托陷阱(Pareto trap)。账单下降了,产品却坏了。大多数团队需要三个月才能察觉。

A team I was working with cut their AI inference bill by more than half last quarter. Eight weeks of clean engineering work. It was the win the engineering team had been chasing all year. It was also the wrong optimization. 我曾合作过的一个团队在上个季度将他们的 AI 推理账单削减了一半以上。这是八周扎实的工程工作,也是工程团队全年都在追求的胜利。但这也是一次错误的优化。

Three months later, customer satisfaction was dropping, churn was ticking up, and the cost savings were structurally tied to the quality loss. We had not won. We had just moved the cost somewhere we were not measuring. 三个月后,客户满意度下降,流失率上升,而成本节省与质量损失在结构上是挂钩的。我们并没有赢,只是把成本转移到了我们未曾衡量的领域。

This is the pattern I expect to see across production AI deployments over the next six months. The 2026 conversation around AI economics has produced a consensus playbook. Route simple queries to cheap models. Keep expensive queries on capable models. Cut the bill, keep the quality. Every CFO has seen the math. Every engineering team has built it or is building it. The math is real. The Pareto trap is also real. 这就是我预计在未来六个月内,生产环境 AI 部署中会出现的模式。围绕 2026 年 AI 经济学的讨论已经形成了一套共识手册:将简单查询路由到廉价模型,将复杂查询保留在高性能模型上。削减账单,保持质量。每位 CFO 都看过这些算式,每个工程团队都已经或正在构建它。算式是真的,但帕累托陷阱也是真的。

The piece below is what I told the team after we ran the post-mortem. It describes the architecture they built, the failure mode they walked into, the detection methodology that would have caught it earlier, and the architectural pattern they should have built instead. It also covers two other deployments I audited after this one, in which the same pattern appeared across different industries. The combined evidence is that cost-optimization routing layers, in the shape the consensus playbook prescribes, are structurally fragile in production. 以下内容是我在进行事后复盘后告诉团队的话。它描述了他们构建的架构、他们陷入的故障模式、本可以更早发现问题的检测方法,以及他们本应构建的架构模式。文中还涵盖了我在这次审计之后审计的另外两个部署案例,在不同行业中都出现了同样的模式。综合证据表明,按照共识手册所规定的形式,成本优化路由层在生产环境中存在结构性脆弱。

What we built

我们构建了什么

The team operated a customer support AI agent for a SaaS product with roughly 4 million monthly active users. The agent ran on a single capable model, the highest-tier reasoning model in their stack at the time of the build. Inference volume was high enough that the monthly bill from their model provider had grown into six figures and was tracking upward as adoption scaled. 该团队为一个拥有约 400 万月活跃用户的 SaaS 产品运营着一个客户支持 AI 代理。该代理运行在单一的高性能模型上,即构建时其技术栈中推理能力最强的模型。推理量非常大,导致每月支付给模型提供商的账单已达到六位数,并随着采用率的扩大而持续攀升。

The routing layer was conceptually clean. A small classifier model, custom-trained on roughly 200,000 historical customer-support queries with quality labels, sat in front of the main agent and labeled each incoming query as either “simple” or “complex.” Simple queries are routed to a cheaper model in the same provider family. Complex queries continued to route to the capable model. 路由层的概念很清晰。一个小型分类器模型(基于约 20 万条带有质量标签的历史客户支持查询进行定制训练)位于主代理之前,将每个传入的查询标记为“简单”或“复杂”。简单查询被路由到同一提供商系列下的廉价模型,复杂查询则继续路由到高性能模型。

The classifier itself was a fine-tuned encoder, light enough to run in under 30 milliseconds with negligible cost overhead. The classification taxonomy was built from production observation. Simple queries were what the team had repeatedly seen: account lookups, billing status questions, password resets, order tracking, and hours-of-operation questions. Complex queries were the ones that had historically required nuanced, multi-step reasoning: refund disputes, plan-change trade-offs, integration troubleshooting, and billing-cycle anomalies. 分类器本身是一个微调过的编码器,运行速度极快,可在 30 毫秒内完成,且成本开销微乎其微。分类分类法是根据生产观察建立的。简单查询是团队反复见到的问题:账户查询、账单状态询问、密码重置、订单跟踪和营业时间询问。复杂查询则是历史上需要细致、多步推理的问题:退款纠纷、计划变更权衡、集成故障排除和账单周期异常。

The split looked like about 65 percent simple and 35 percent complex across a representative week of production traffic. The cheaper model the team selected was about a quarter of the per-token cost of the capable model. For the simple queries the classifier sent to it, side-by-side evaluation against the capable model showed equivalent answer quality across 94 percent of a 5,000-query holdout set. The 6 percent gap was visible, but the team judged it acceptable given the cost reduction. 在具有代表性的一周生产流量中,简单查询占比约 65%,复杂查询占比约 35%。团队选择的廉价模型每 Token 成本约为高性能模型的四分之一。对于分类器发送给它的简单查询,在 5000 条保留查询集的对比评估中,94% 的回答质量与高性能模型相当。6% 的差距虽然可见,但考虑到成本削减,团队认为这是可以接受的。

They monitored the cheaper model’s quality through their existing evaluation pipeline, which sampled production responses for human review at roughly half a percent of traffic. The build took eight weeks. Three engineers, one ML practitioner, partial allocation. They added schema validation between the classifier and the downstream models, instrumentation on the routing decision, and a fallback path in case the classifier itself failed. 他们通过现有的评估流水线监控廉价模型的质量,该流水线对约 0.5% 的生产流量进行抽样,供人工审核。构建过程耗时八周,由三名工程师和一名机器学习从业者部分参与。他们在分类器和下游模型之间增加了模式验证,对路由决策进行了埋点,并设置了分类器本身故障时的回退路径。

The deployment was gradual. Five percent of traffic for the first week, then ten, then twenty-five, then fifty, then full rollout over six weeks. Each rollout step held quality metrics in the green range. Latency stayed within their existing target. Cost decreased in line with the routing share. By the end of week eight, the monthly inference bill had dropped to roughly 40% of its previous level. 部署是渐进式的。第一周 5% 的流量,然后是 10%、25%、50%,最后在六周内完成全面推广。每个推广阶段的质量指标都保持在绿色范围内。延迟保持在既定目标内。成本随着路由比例的调整而下降。到第八周结束时,每月推理账单已降至之前水平的约 40%。

The engineering team presented the work at the company’s all-hands. The CFO sent a thank-you note to the AI team. Adoption metrics inside the agent stayed flat to slightly positive. The team moved on to the next quarterly priority. The work was solid. The architecture was reasonable. The monitoring was in place. The team had done what every recent piece on AI cost optimization had recommended. Each individual decision was defensible. 工程团队在公司全员大会上展示了这项工作。CFO 给 AI 团队发了一封感谢信。代理内部的采用指标保持平稳或略有上升。团队随后转向了下一个季度优先级任务。这项工作很扎实,架构很合理,监控也很到位。团队做了近期每一篇关于 AI 成本优化文章所建议的事情。每一个单独的决策都是站得住脚的。

The combined system, however, had created a quality gap that the existing measurement architecture could not see. That gap took three months to surface in business metrics and another month to be correctly attributed. By the time they understood what was happening, four months had elapsed, and the customer impact was already in the room. 然而,整个系统产生了一个现有的测量架构无法察觉的质量差距。这个差距花了三个月才在业务指标中显现出来,又花了一个月才被正确归因。当他们明白发生了什么时,四个月已经过去了,客户影响已经造成了。

What we measured (and what we did not)

我们衡量了什么(以及我们没衡量什么)

The team’s evaluation architecture before the routing layer was built on the assumption that they were running a single model. The quality signal came from three sources. A daily human-review sample of about 200 responses, scored for accuracy and helpfulness. An offline regression suite of approximately 12,000 labeled queries is run weekly against the production model. And a satisfaction signal from the agent’s in-product feedback widget, where users could rate responses with a thumbs-up or thumbs-down. 在路由层构建之前,团队的评估架构是基于他们运行单一模型的假设。质量信号来自三个来源:每天约 200 条回复的人工抽样审核(评分准确性和有用性);每周针对生产模型运行的约 12,000 条标记查询的离线回归测试集;以及来自代理产品内反馈组件的满意度信号,用户可以在其中对回复进行点赞或点踩。

When the routing layer went live, the team extended the human-review sample to maintain the same total of about 200 daily reviews but did not separate it by routing tier. They added the cheaper model to the offline regression suite, where it scored within their acceptance threshold. They left the in-product… 当路由层上线时,团队扩大了人工审核抽样,以维持每天约 200 条的总量,但没有按路由层级进行区分。他们将廉价模型加入到离线回归测试集中,其得分在可接受阈值内。他们保留了产品内的……