Anthropic prompt caching cut our RCA cost by 90%
Anthropic prompt caching cut our RCA cost by 90%
Originally published at theculprit.ai/blog/anthropic-prompt-caching-90-percent. LLM costs in production scale faster than the post-mortem of the demo bill suggests they will. The shape of the problem: you ship a feature that calls Claude on every meaningful event. The first month the bill is rounding error and nobody looks at it. The second month a customer’s traffic ramps and the line item is suddenly five percent of revenue. The third month your finance person sends a polite Slack about whether this is “a real cost trend or a one-time spike,” and everyone on the engineering team has to defend an architecture decision they made eight weeks ago when the bill was rounding error.
本文最初发布于 theculprit.ai/blog/anthropic-prompt-caching-90-percent。在生产环境中,大语言模型(LLM)的成本增长速度往往远超演示阶段账单所预示的水平。问题的形态通常是这样的:你上线了一个功能,每当发生重要事件时都会调用 Claude。第一个月,账单金额微不足道,没人会去关注;第二个月,随着客户流量增加,该项支出突然占到了收入的 5%;第三个月,财务人员发来一条礼貌的 Slack 消息,询问这到底是“真实的成本趋势还是单次激增”,而工程团队的每个人都不得不为八周前做出的架构决策进行辩护——当时那笔账单还只是个零头。
You can reduce this. Not by being clever about how you call the model — by being clever about what’s constant across your calls. Anthropic’s prompt caching, in our case, takes the per-RCA input cost from full-rate to one-tenth of full-rate on a 90%+ cache-hit rate. That’s not a hypothetical; it’s what we measure in production, and the math is simple enough to walk through here so you can run the numbers on your own pipeline.
你可以降低这些成本。不是通过在调用模型的方式上耍小聪明,而是通过识别并利用多次调用中保持不变的内容。以我们为例,Anthropic 的提示词缓存(Prompt Caching)在 90% 以上的缓存命中率下,将每次根因分析(RCA)的输入成本从全额降至十分之一。这不是假设,而是我们在生产环境中的实际测量结果。这里的计算逻辑非常简单,你可以参考并将其应用到你自己的流水线中。
The pricing structure
定价结构
Anthropic publishes four price points per model. For Claude Haiku 4.5, the model we run as the default for incident root-cause analysis, those points are (verified from the Anthropic API docs):
Anthropic 为每个模型发布了四个价格点。以我们作为事件根因分析默认模型的 Claude Haiku 4.5 为例,其价格点如下(已从 Anthropic API 文档核实):
| Token category | Haiku 4.5 |
|---|---|
| Base input | $1.00 per million tokens |
| Cache write (5-minute TTL) | $1.25 per million tokens |
| Cache read | $0.10 per million tokens |
| Output | $5.00 per million tokens |
| Token 类别 | Haiku 4.5 |
|---|---|
| 基础输入 | 每百万 token 1.00 美元 |
| 缓存写入(5 分钟 TTL) | 每百万 token 1.25 美元 |
| 缓存读取 | 每百万 token 0.10 美元 |
| 输出 | 每百万 token 5.00 美元 |
Two things to read from that table: Cache read is 10x cheaper than base input. Same tokens in the request body, ten percent of the cost — if you can get them into the cache. Cache write is 25% more expensive than base input. First time you send a cached segment, you’re paying a small premium so the next request can pay the discount. The math only pays off if you call the model with the same cached segment more than ~1.25 times on average within the 5-minute TTL window. That second point is the one most teams miss. If your call pattern is “one-shot, cold cache every time,” prompt caching makes you slightly worse off. The win comes from repeatable structure across calls.
从表中可以读出两点:缓存读取比基础输入便宜 10 倍。请求体中相同的 token,成本仅为原来的 10%——前提是你得把它们放进缓存里。缓存写入比基础输入贵 25%。当你第一次发送缓存片段时,你支付了一小笔溢价,以便后续请求能享受折扣。只有当你在 5 分钟的 TTL 窗口内,平均使用相同的缓存片段调用模型超过约 1.25 次时,这笔账才划算。第二点是大多数团队容易忽略的。如果你的调用模式是“一次性、每次都是冷缓存”,那么提示词缓存反而会让你多花钱。真正的收益来自于多次调用中可重复的结构。
What’s actually cacheable in an RCA call
RCA 调用中哪些内容可以缓存?
A typical RCA call has five sources of tokens:
- System prompt. Defines the role (“you are an SRE analyzing an incident”), the JSON schema for the response, and any guardrails. Identical across every call across every tenant. Maybe 800-1500 tokens depending on how rigorous your schema is.
- Retrieval context (“here are 3 prior incidents from this same service that resolved similarly”). Static for a few minutes within a Batch run on one tenant + service. Maybe 400-800 tokens depending on how aggressive the retrieval is.
- Per-incident events (“event 1 at 14:32:01: ConnectionPoolExhausted…; event 2 at 14:32:04: …”). Unique to the incident under analysis. Cannot be cached across incidents. Typically 1500-3000 tokens.
- Per-incident metadata (incident ID, service ID, severity). Tiny but unique.
- Output tokens. The model’s response. Cost is fixed at the output rate; caching doesn’t apply.
一次典型的 RCA 调用包含五个 token 来源:
- 系统提示词(System prompt)。 定义角色(“你是一名正在分析事件的 SRE”)、响应的 JSON 模式以及任何护栏。在所有租户的所有调用中完全相同。根据模式的严谨程度,大约 800-1500 个 token。
- 检索上下文(Retrieval context)。(“这是来自同一服务的 3 个类似已解决的过往事件”)。在单个租户+服务的批处理运行中,几分钟内保持静态。根据检索的激进程度,大约 400-800 个 token。
- 单次事件详情(Per-incident events)。(“事件 1 在 14:32:01:连接池耗尽…;事件 2 在 14:32:04:…”)。对于正在分析的事件是唯一的,无法跨事件缓存。通常 1500-3000 个 token。
- 单次事件元数据(Per-incident metadata)。(事件 ID、服务 ID、严重程度)。极小但唯一。
- 输出 token。 模型的响应。成本按输出费率固定,不适用缓存。
Sources 1 and 2 are cacheable. Sources 3 and 4 are not. Source 5 is irrelevant. In our distribution, sources 1 + 2 are roughly 70-80% of the input tokens for a typical RCA call. Cache them at 0.10 per million; pay full rate on the remaining 20-30%; total input cost drops by about 60-70% from the naive baseline. The “90%” headline number rounds up because we measure cache hits, not total cost, and within the cached portion the savings really are 90%.
来源 1 和 2 是可缓存的,来源 3 和 4 不可缓存,来源 5 无关。在我们的分布中,来源 1+2 大约占典型 RCA 调用输入 token 的 70-80%。以每百万 0.10 美元的价格缓存它们,其余 20-30% 支付全额费率,总输入成本比原始基准下降了约 60-70%。“90%”这个标题数字是向上取整的结果,因为我们衡量的是缓存命中率而非总成本,而在缓存部分,节省比例确实达到了 90%。
The two-segment trick
“双片段”技巧
Anthropic’s API takes a cache_control marker per segment in your system array. Each marker is an independent breakpoint — the cache stores tokens up to the marker. If you have two segments, the API caches each one separately.
Anthropic 的 API 允许在系统数组的每个片段中设置 cache_control 标记。每个标记都是一个独立的断点——缓存会存储到该标记为止的 token。如果你有两个片段,API 会分别缓存它们。
Why two segments instead of one? Because the cache lifetime for those two pieces is different. The system prompt almost never changes — every RCA call across every tenant hits the cache. Cache read essentially every time after the first call. The retrieval context (prior similar incidents for this service) changes whenever a new incident on that service resolves and shifts the top-K. Within a single Batch run on one tenant + service, repeats hit the cache. Across tenants, never. If you stuff both into a single segment, the moment the retrieval context for tenant A changes, tenant B’s hit rate drops too — because the one combined segment hashes differently. Two segments → independent cache lifetimes → tenant A’s churn doesn’t punish tenant B.
为什么要用两个片段而不是一个?因为这两部分的缓存生命周期不同。系统提示词几乎从不改变——所有租户的所有 RCA 调用都会命中缓存。在第一次调用后,几乎每次都是缓存读取。而检索上下文(该服务的过往类似事件)会在该服务有新事件解决并改变 Top-K 时发生变化。在单个租户+服务的批处理运行中,重复调用会命中缓存;但跨租户则不会。如果你把两者塞进一个片段,一旦租户 A 的检索上下文发生变化,租户 B 的命中率也会下降——因为合并后的片段哈希值变了。两个片段 → 独立的缓存生命周期 → 租户 A 的变动不会影响租户 B。
The order matters. Anthropic caches up to each marker, so the more-static segment must come first. If you put per-tenant retrieval first and the static system prompt second, the static prompt’s cache key now includes the per-tenant content above it; you’ve just made the most cacheable segment uncacheable across tenants.
顺序很重要。Anthropic 会缓存到每个标记为止,因此更静态的片段必须放在前面。如果你把租户特定的检索放在前面,把静态系统提示词放在后面,那么静态提示词的缓存键就会包含其上方的租户特定内容;你刚刚把最可缓存的片段变成了跨租户不可缓存的片段。
What kills the cache
什么会破坏缓存?
In rough order of frequency:
- The 5-minute ephemeral TTL. A cached segment expires 5 minutes after its last write. If your call pattern is bursty, a long quiet period will let every cached segment expire and you’ll pay cache write on the next batch. Spread your calls if you can.
- Whitespace drift. If you concatenate the system prompt with
\n\nin one place and\nin another, you have two distinct cache keys. The cache hashes the literal token sequence, not the semantic meaning. Pick one separator and lint for it. - Trailing dynamic content. A common bug: someone adds a timestamp…
按频率大致排序:
- 5 分钟的临时 TTL。 缓存片段在最后一次写入后 5 分钟过期。如果你的调用模式是突发性的,长时间的静默期会让所有缓存片段过期,下一次批处理时你将不得不支付缓存写入费用。如果可以,请分散你的调用。
- 空白字符漂移。 如果你在一个地方用
\n\n连接系统提示词,而在另一个地方用\n,你就会得到两个不同的缓存键。缓存哈希的是字面上的 token 序列,而不是语义。选择一种分隔符并进行 lint 检查。 - 末尾的动态内容。 一个常见的 bug:有人添加了一个时间戳……