I Cut My LLM API Bill by 73% — Here's the Exact Optimization Playbook
I Cut My LLM API Bill by 73% — Here’s the Exact Optimization Playbook
我将 LLM API 账单削减了 73% —— 这是我的精确优化指南
Running LLMs in production burns cash. Fast. When your app goes from “prototype” to “actually used by people,” that API bill can go from “whatever” to “wait, that’s a mortgage payment” in about two weeks. I learned this the hard way. My knowledge base platform went from a few hundred requests to thousands per day, and my LLM bill jumped to $4,200/month. After spending three weeks optimizing, I brought it down to $1,130/month — a 73% reduction — without anyone noticing a drop in quality. Here’s the exact playbook.
在生产环境中运行大语言模型(LLM)非常烧钱,而且速度极快。当你的应用从“原型”变成“真实用户使用”时,API 账单可能会在两周内从“无所谓”变成“等等,这简直是一笔房贷”。我对此深有体会。我的知识库平台从每天几百次请求增长到数千次,LLM 账单随之飙升至每月 4,200 美元。经过三周的优化,我将其降至每月 1,130 美元——削减了 73%——且没有任何用户察觉到质量下降。以下是我的具体操作指南。
1. The Routing Layer: Right Model for the Right Job
1. 路由层:为合适的任务选择合适的模型
Most developers send everything to the biggest model. That’s like using a sledgehammer to crack a nut. The strategy: Classify requests by complexity and route accordingly.
大多数开发者习惯将所有请求都发送给最强大的模型。这就像是用大锤去砸核桃。策略是:根据复杂程度对请求进行分类,并进行相应的路由。
// Simple classification layer
function routeByComplexity(userInput: string): LLMModel {
const tokens = userInput.split(/\s+/).length;
if (tokens < 15 && !containsTechnicalTerms(userInput)) {
return 'cheap-fast-model'; // $0.15/M tokens
}
if (tokens < 100 && isStructuredQuery(userInput)) {
return 'mid-tier-model'; // $0.50/M tokens
}
return 'premium-model'; // $3.00/M tokens — only when needed
}
The impact: ~40% of our requests are simple (formatting, classification, short answers). Routing those to cheaper models saved ~$900/month alone. How to classify: Start with heuristics (token count, keyword matching). Once you have data, train a tiny classifier that costs pennies to run.
影响:我们约 40% 的请求属于简单任务(格式化、分类、简短回答)。仅将这些请求路由到更便宜的模型,每月就节省了约 900 美元。如何分类:从启发式方法(Token 计数、关键词匹配)开始。一旦有了数据,就可以训练一个运行成本极低的小型分类器。
2. Response Caching: The $600/Month Win
2. 响应缓存:每月节省 600 美元的秘诀
If a user asks “What is RAG?” and another user asks “What is RAG?” three hours later — that’s the same answer. Don’t pay twice.
如果一个用户问“什么是 RAG?”,三小时后另一个用户也问“什么是 RAG?”——答案是一样的。不要付两次钱。
import hashlib
import redis
class LLMCache:
def __init__(self):
self.redis = redis.Redis()
def get_cache_key(self, prompt: str, model: str) -> str:
raw = f"{model}:{prompt.strip().lower()}"
return f"llm:{hashlib.sha256(raw.encode()).hexdigest()[:16]}"
def get(self, prompt: str, model: str):
key = self.get_cache_key(prompt, model)
return self.redis.get(key)
def set(self, prompt: str, model: str, response: str, ttl: int = 86400):
key = self.get_cache_key(prompt, model)
self.redis.setex(key, ttl, response)
Key decisions: TTL: 24 hours for general knowledge, 1 hour for time-sensitive queries. Cache scope: Cache at the prompt level, not the response — normalize whitespace, lowercase, strip trailing punctuation. Hit rate: We achieved 35% cache hit rate on FAQ-style content. The catch: Don’t cache creative tasks (writing, brainstorming). Those need fresh outputs every time.
关键决策:TTL(生存时间):通用知识设为 24 小时,时效性查询设为 1 小时。缓存范围:在 Prompt 层面缓存,而非响应层面——对空格进行标准化、转为小写、去除末尾标点。命中率:我们在 FAQ 类内容上实现了 35% 的缓存命中率。注意:不要缓存创意类任务(写作、头脑风暴),这些任务每次都需要新鲜的输出。
3. Token Budgeting: The Silent Killer Is Output Length
3. Token 预算:隐形杀手是输出长度
Most LLM pricing charges per output token. A model that outputs 2,000 tokens when 300 would do is burning your money.
大多数 LLM 的定价是按输出 Token 收费的。如果一个模型本可以用 300 个 Token 完成任务,却输出了 2,000 个,那就是在烧你的钱。
-
Before: User: “Summarize this article” -> Model: generates 1,800 token essay -> Cost: $0.054/request
-
After: User: “Summarize this article in 3 bullet points, max 50 words each.” -> Model: generates 120 tokens -> Cost: $0.0036/request
-
优化前: 用户:“总结这篇文章” -> 模型:生成 1,800 个 Token 的文章 -> 成本:$0.054/次
-
优化后: 用户:“用 3 个要点总结这篇文章,每点不超过 50 字。” -> 模型:生成 120 个 Token -> 成本:$0.0036/次
Tactics that work: Explicit token budgets in prompts (“Answer in under 100 words”), max_tokens parameter (set hard limits), output format constraints (JSON schemas force conciseness), and temperature tuning (lower temperature 0.1-0.3 reduces rambling). This alone cut our output token count by 60%.
有效的策略:在 Prompt 中明确 Token 预算(“用 100 字以内回答”)、设置 max_tokens 参数(设定硬性限制)、约束输出格式(JSON Schema 强制简洁)、以及调整 Temperature(较低的 0.1-0.3 可以减少废话)。仅此一项就将我们的输出 Token 数量减少了 60%。
4. Prompt Compression: Shrink the Input, Shrink the Cost
4. Prompt 压缩:缩减输入,缩减成本
Your prompt tokens cost money too. If you’re sending a 5,000-token system prompt with every request, you’re paying $0.015 per call just for setup.
你的 Prompt Token 也要钱。如果你每次请求都发送 5,000 个 Token 的系统提示词,那么仅初始化成本每次就要 0.015 美元。
What I compressed: System prompts (3,200 → 800 tokens), Few-shot examples (6 → 2 examples), Context windows (only include relevant sections). The technique: Run your prompt through a cheap model first: “Condense these instructions to the minimum needed for correct execution.” Test output quality, and A/B test for a week. We cut input tokens by 55% with zero quality loss.
我的压缩方案:系统提示词(3,200 → 800 Token)、少样本示例(6 → 2 个)、上下文窗口(仅包含相关部分)。技巧:先用廉价模型处理你的 Prompt:“将这些指令压缩到正确执行所需的最低限度。”测试输出质量,并进行一周的 A/B 测试。我们在零质量损失的情况下将输入 Token 减少了 55%。
5. Batch Processing: The Async Advantage
5. 批处理:异步的优势
If your app doesn’t need real-time responses, batching is your best friend. Most providers offer significant discounts for batch API calls. Our use case: Article processing pipeline (tagging, summarizing, extracting entities).
如果你的应用不需要实时响应,批处理就是你最好的朋友。大多数供应商为批量 API 调用提供大幅折扣。我们的用例:文章处理流水线(打标签、总结、提取实体)。
# Batch API pattern (OpenAI example)
batch_input = [
{"custom_id": "article-001", "method": "POST", "url": "/v1/chat/completions", "body": {...}},
# ... up to 50,000 requests per batch
]
batch_job = openai.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions")
Batch processing costs 50% less than real-time API calls. For our content pipeline, this saved $400/month.
批处理的成本比实时 API 调用低 50%。对于我们的内容流水线,这每月节省了 400 美元。
6. Model Distillation: The Advanced Play
6. 模型蒸馏:进阶玩法
This is the most effort but the biggest payoff. For tasks you run thousands of times per day (classification, tagging), fine-tune a smaller model on outputs from the big model.
这是最费力但回报最大的方法。对于每天运行数千次的任务(分类、打标签),使用大模型的输出来微调一个小模型。
The process: Run 1,000 examples through GPT-4 to get “gold standard” outputs, then fine-tune GPT-4o-mini on those examples. The small model now produces ~90% of the quality at ~10% of the cost. Our results: Classification accuracy went from 94% (premium) to 89% (fine-tuned small), but cost dropped from $0.03 to $0.003 per request.
流程:用 GPT-4 运行 1,000 个示例获取“黄金标准”输出,然后用这些示例微调 GPT-4o-mini。小模型现在能以约 10% 的成本实现约 90% 的质量。我们的结果:分类准确率从 94%(高级模型)降至 89%(微调后的小模型),但单次请求成本从 0.03 美元降至 0.003 美元。
The Numbers, Laid Bare
账单明细
| Tactic | Monthly Savings | Effort |
|---|---|---|
| Smart routing | ~$900 | 2 days |
| Response caching | ~$600 | 1 day |
| Token budgeting | ~$800 | 3 hours |
| Prompt compression | ~$400 | 1 day |
| Batch processing | ~$400 | 2 hours |
| Model distillation | ~$170 | 1 week |
| Total | ~$3,070 |
| 策略 | 每月节省 | 工作量 |
|---|---|---|
| 智能路由 | ~$900 | 2 天 |
| 响应缓存 | ~$600 | 1 天 |
| Token 预算 | ~$800 | 3 小时 |
| Prompt 压缩 | ~$400 | 1 天 |
| 批处理 | ~$400 | 2 小时 |
| 模型蒸馏 | ~$170 | 1 周 |
| 总计 | ~$3,070 |
From $4,200 to… 从 4,200 美元降至……