RAG Is Burning Money — I Built a Cost Control Layer to Fix It
RAG Is Burning Money — I Built a Cost Control Layer to Fix It
RAG 正在烧钱——我构建了一个成本控制层来解决它
Large Language Model RAG Is Burning Money — I Built a Cost Control Layer to Fix It. Most RAG systems optimize for relevance, not cost. I built a production-ready cost control layer combining semantic caching, query routing, and budget enforcement that reduces LLM costs by 85% without sacrificing answer quality. 大型语言模型(LLM)的 RAG 系统正在烧钱——我构建了一个成本控制层来解决这个问题。大多数 RAG 系统优化的是相关性,而非成本。我构建了一个生产就绪的成本控制层,结合了语义缓存、查询路由和预算执行机制,在不牺牲回答质量的前提下,将 LLM 的成本降低了 85%。
TL;DR: This article shows a full working implementation in pure Python, along with benchmark results from a local setup. RAG systems do not fail only on quality. They can also become inefficient in terms of cost, often in ways that are not immediately visible. Every extra retrieved token has a cost. In my system, context over-fetching ranged from 3–8× beyond what queries actually required. 简而言之:本文展示了一个纯 Python 实现的完整方案,以及本地环境下的基准测试结果。RAG 系统的失败不仅仅体现在质量上,它们在成本方面也可能变得效率低下,且这种低效往往并不直观。每一个额外检索的 Token 都有成本。在我的系统中,上下文过度获取(Over-fetching)的程度达到了查询实际需求量的 3 到 8 倍。
In many baseline implementations, repeated queries are processed independently, with no reuse of previous results. In single-model setups, a large share of simple queries may be handled by high-cost models, even when lower-cost alternatives would be sufficient. With semantic caching (up to 98.5% hit rate in a pre-seeded, warmed cache benchmark), query routing (around 81% of requests shifted to a lower-cost model in the benchmark mix), and a token budget layer with a circuit breaker, the system achieved up to 85.8% cost reduction at 10,000 requests per day, while maintaining response quality under the evaluated setup. 在许多基准实现中,重复查询是独立处理的,没有复用先前的结果。在单模型架构中,即使低成本模型足以胜任,大量简单查询仍可能由高成本模型处理。通过语义缓存(在预填充、预热的缓存基准测试中命中率高达 98.5%)、查询路由(在基准测试组合中约 81% 的请求被转移到低成本模型)以及带有断路器的 Token 预算层,该系统在每天 10,000 次请求的情况下实现了高达 85.8% 的成本削减,同时在评估设置下保持了响应质量。
The System That Was Working Fine — And Quietly Draining Money
那个运行良好却在悄悄“吸血”的系统
I built a RAG system that worked perfectly and I ran the same queries through the same pipeline and got the same outputs every time. In testing, nothing looked wrong, latency was stable and answers were correct. Then I looked at the token logs. In my setup, even simple questions such as “What is RAG?” or “Define semantic search.” were hitting the most expensive model. Every repeated query was billed in full, even when I’d answered the exact same question ten minutes earlier. Every request was retrieving ten chunks when two were doing the actual work. The system wasn’t broken. It was just financially blind. And at scale, that distinction stops mattering. 我构建了一个运行完美的 RAG 系统,每次通过相同的流水线运行相同的查询,都能得到相同的输出。在测试中,一切看起来都很正常,延迟稳定,回答准确。然后我查看了 Token 日志。在我的设置中,即使是“什么是 RAG?”或“定义语义搜索”这样简单的问题,也会调用最昂贵的模型。每一个重复的查询都被全额计费,即使我在十分钟前才回答过完全相同的问题。每一个请求都在检索十个数据块,而实际上只有两个在起作用。系统没有坏,它只是对财务状况“盲目”。而在规模化场景下,这种区别就不再重要了。
Why RAG Is Financially Blind by Design
为什么 RAG 在设计上对财务“盲目”
RAG was designed to solve a retrieval quality problem. It was never designed to solve a cost problem. That’s not a criticism — it’s just a different layer of the stack. But in production, the two layers collide. And the collision is expensive. There are three specific failure modes. RAG 的设计初衷是解决检索质量问题,而非成本问题。这并非批评,只是技术栈中不同的层面。但在生产环境中,这两个层面会发生碰撞,而这种碰撞代价高昂。具体存在三种失效模式。
Failure Mode 1: Context Window Over-Fetching 失效模式 1:上下文窗口过度获取
Most implementations retrieve the top-10 chunks by default. “Just to be safe.” The problem: in practice, 2–3 chunks contain the answer. The other 7–8 are noise — redundant context that adds tokens without adding information. You’re paying for those tokens every time. 大多数实现默认检索前 10 个数据块,美其名曰“为了保险起见”。问题在于:实际上只有 2-3 个数据块包含答案,其余 7-8 个都是噪音——这些冗余的上下文增加了 Token 消耗却未提供额外信息。你每次都在为这些 Token 付费。
Failure Mode 2: No Caching Layer 失效模式 2:缺乏缓存层
Two users ask “What is RAG?” ten minutes apart, and the system produces the same embedding, retrieves the same chunks, and returns the same answer. You pay the full LLM cost twice. There is no semantic memory between requests in a standard RAG pipeline. Every query is treated as if it has never been asked before. 两个用户相隔十分钟询问“什么是 RAG?”,系统生成相同的嵌入,检索相同的数据块,并返回相同的答案。你却支付了两次全额 LLM 费用。标准的 RAG 流水线在请求之间没有语义记忆,每一个查询都被视为从未被问过。
Failure Mode 3: No Model Routing 失效模式 3:缺乏模型路由
Some pipelines default to a single high-capability model for all queries, regardless of complexity. Even when the query is: “What does LLM stand for?” That question doesn’t need GPT-4.5 or Claude Opus. It needs a fast, cheap model and it needs to finish in 200ms. 一些流水线默认对所有查询使用单一的高性能模型,而不考虑复杂度。即使查询是“LLM 代表什么?”,这个问题根本不需要 GPT-4.5 或 Claude Opus,它只需要一个快速、廉价的模型,并在 200 毫秒内完成。