I Cut My LLM API Bill by 73% — Here's the Exact Optimization Playbook

I Cut My LLM API Bill by 73% — Here’s the Exact Optimization Playbook

我将 LLM API 账单削减了 73% —— 这是我的精确优化指南

Running LLMs in production burns cash. Fast. When your app goes from “prototype” to “actually used by people,” that API bill can go from “whatever” to “wait, that’s a mortgage payment” in about two weeks. I learned this the hard way. My knowledge base platform went from a few hundred requests to thousands per day, and my LLM bill jumped to $4,200/month. After spending three weeks optimizing, I brought it down to $1,130/month — a 73% reduction — without anyone noticing a drop in quality. Here’s the exact playbook.

在生产环境中运行大语言模型（LLM）非常烧钱，而且速度极快。当你的应用从“原型”变成“真实用户使用”时，API 账单可能会在两周内从“无所谓”变成“等等，这简直是一笔房贷”。我对此深有体会。我的知识库平台从每天几百次请求增长到数千次，LLM 账单随之飙升至每月 4,200 美元。经过三周的优化，我将其降至每月 1,130 美元——削减了 73%——且没有任何用户察觉到质量下降。以下是我的具体操作指南。

1. The Routing Layer: Right Model for the Right Job

1. 路由层：为合适的任务选择合适的模型

Most developers send everything to the biggest model. That’s like using a sledgehammer to crack a nut. The strategy: Classify requests by complexity and route accordingly.

大多数开发者习惯将所有请求都发送给最强大的模型。这就像是用大锤去砸核桃。策略是：根据复杂程度对请求进行分类，并进行相应的路由。

// Simple classification layer
function routeByComplexity(userInput: string): LLMModel {
  const tokens = userInput.split(/\s+/).length;
  if (tokens < 15 && !containsTechnicalTerms(userInput)) {
    return 'cheap-fast-model'; // $0.15/M tokens
  }
  if (tokens < 100 && isStructuredQuery(userInput)) {
    return 'mid-tier-model'; // $0.50/M tokens
  }
  return 'premium-model'; // $3.00/M tokens — only when needed
}

The impact: ~40% of our requests are simple (formatting, classification, short answers). Routing those to cheaper models saved ~$900/month alone. How to classify: Start with heuristics (token count, keyword matching). Once you have data, train a tiny classifier that costs pennies to run.

影响：我们约 40% 的请求属于简单任务（格式化、分类、简短回答）。仅将这些请求路由到更便宜的模型，每月就节省了约 900 美元。如何分类：从启发式方法（Token 计数、关键词匹配）开始。一旦有了数据，就可以训练一个运行成本极低的小型分类器。

2. Response Caching: The $600/Month Win

2. 响应缓存：每月节省 600 美元的秘诀

If a user asks “What is RAG?” and another user asks “What is RAG?” three hours later — that’s the same answer. Don’t pay twice.

如果一个用户问“什么是 RAG？”，三小时后另一个用户也问“什么是 RAG？”——答案是一样的。不要付两次钱。

import hashlib
import redis

class LLMCache:
    def __init__(self):
        self.redis = redis.Redis()

    def get_cache_key(self, prompt: str, model: str) -> str:
        raw = f"{model}:{prompt.strip().lower()}"
        return f"llm:{hashlib.sha256(raw.encode()).hexdigest()[:16]}"

    def get(self, prompt: str, model: str):
        key = self.get_cache_key(prompt, model)
        return self.redis.get(key)

    def set(self, prompt: str, model: str, response: str, ttl: int = 86400):
        key = self.get_cache_key(prompt, model)
        self.redis.setex(key, ttl, response)

Key decisions: TTL: 24 hours for general knowledge, 1 hour for time-sensitive queries. Cache scope: Cache at the prompt level, not the response — normalize whitespace, lowercase, strip trailing punctuation. Hit rate: We achieved 35% cache hit rate on FAQ-style content. The catch: Don’t cache creative tasks (writing, brainstorming). Those need fresh outputs every time.

关键决策：TTL（生存时间）：通用知识设为 24 小时，时效性查询设为 1 小时。缓存范围：在 Prompt 层面缓存，而非响应层面——对空格进行标准化、转为小写、去除末尾标点。命中率：我们在 FAQ 类内容上实现了 35% 的缓存命中率。注意：不要缓存创意类任务（写作、头脑风暴），这些任务每次都需要新鲜的输出。

3. Token Budgeting: The Silent Killer Is Output Length

3. Token 预算：隐形杀手是输出长度

Most LLM pricing charges per output token. A model that outputs 2,000 tokens when 300 would do is burning your money.

大多数 LLM 的定价是按输出 Token 收费的。如果一个模型本可以用 300 个 Token 完成任务，却输出了 2,000 个，那就是在烧你的钱。

Before: User: “Summarize this article” -> Model: generates 1,800 token essay -> Cost: $0.054/request
After: User: “Summarize this article in 3 bullet points, max 50 words each.” -> Model: generates 120 tokens -> Cost: $0.0036/request
优化前： 用户：“总结这篇文章” -> 模型：生成 1,800 个 Token 的文章 -> 成本：$0.054/次
优化后： 用户：“用 3 个要点总结这篇文章，每点不超过 50 字。” -> 模型：生成 120 个 Token -> 成本：$0.0036/次

Tactics that work: Explicit token budgets in prompts (“Answer in under 100 words”), max_tokens parameter (set hard limits), output format constraints (JSON schemas force conciseness), and temperature tuning (lower temperature 0.1-0.3 reduces rambling). This alone cut our output token count by 60%.

有效的策略：在 Prompt 中明确 Token 预算（“用 100 字以内回答”）、设置 max_tokens 参数（设定硬性限制）、约束输出格式（JSON Schema 强制简洁）、以及调整 Temperature（较低的 0.1-0.3 可以减少废话）。仅此一项就将我们的输出 Token 数量减少了 60%。

4. Prompt Compression: Shrink the Input, Shrink the Cost

4. Prompt 压缩：缩减输入，缩减成本

Your prompt tokens cost money too. If you’re sending a 5,000-token system prompt with every request, you’re paying $0.015 per call just for setup.

你的 Prompt Token 也要钱。如果你每次请求都发送 5,000 个 Token 的系统提示词，那么仅初始化成本每次就要 0.015 美元。

What I compressed: System prompts (3,200 → 800 tokens), Few-shot examples (6 → 2 examples), Context windows (only include relevant sections). The technique: Run your prompt through a cheap model first: “Condense these instructions to the minimum needed for correct execution.” Test output quality, and A/B test for a week. We cut input tokens by 55% with zero quality loss.

我的压缩方案：系统提示词（3,200 → 800 Token）、少样本示例（6 → 2 个）、上下文窗口（仅包含相关部分）。技巧：先用廉价模型处理你的 Prompt：“将这些指令压缩到正确执行所需的最低限度。”测试输出质量，并进行一周的 A/B 测试。我们在零质量损失的情况下将输入 Token 减少了 55%。

5. Batch Processing: The Async Advantage

5. 批处理：异步的优势

If your app doesn’t need real-time responses, batching is your best friend. Most providers offer significant discounts for batch API calls. Our use case: Article processing pipeline (tagging, summarizing, extracting entities).

如果你的应用不需要实时响应，批处理就是你最好的朋友。大多数供应商为批量 API 调用提供大幅折扣。我们的用例：文章处理流水线（打标签、总结、提取实体）。

# Batch API pattern (OpenAI example)
batch_input = [
    {"custom_id": "article-001", "method": "POST", "url": "/v1/chat/completions", "body": {...}},
    # ... up to 50,000 requests per batch
]
batch_job = openai.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions")

Batch processing costs 50% less than real-time API calls. For our content pipeline, this saved $400/month.

批处理的成本比实时 API 调用低 50%。对于我们的内容流水线，这每月节省了 400 美元。

6. Model Distillation: The Advanced Play

6. 模型蒸馏：进阶玩法

This is the most effort but the biggest payoff. For tasks you run thousands of times per day (classification, tagging), fine-tune a smaller model on outputs from the big model.

这是最费力但回报最大的方法。对于每天运行数千次的任务（分类、打标签），使用大模型的输出来微调一个小模型。

The process: Run 1,000 examples through GPT-4 to get “gold standard” outputs, then fine-tune GPT-4o-mini on those examples. The small model now produces ~90% of the quality at ~10% of the cost. Our results: Classification accuracy went from 94% (premium) to 89% (fine-tuned small), but cost dropped from $0.03 to $0.003 per request.

流程：用 GPT-4 运行 1,000 个示例获取“黄金标准”输出，然后用这些示例微调 GPT-4o-mini。小模型现在能以约 10% 的成本实现约 90% 的质量。我们的结果：分类准确率从 94%（高级模型）降至 89%（微调后的小模型），但单次请求成本从 0.03 美元降至 0.003 美元。

The Numbers, Laid Bare

账单明细

Tactic	Monthly Savings	Effort
Smart routing	~$900	2 days
Response caching	~$600	1 day
Token budgeting	~$800	3 hours
Prompt compression	~$400	1 day
Batch processing	~$400	2 hours
Model distillation	~$170	1 week
Total	~$3,070

策略	每月节省	工作量
智能路由	~$900	2 天
响应缓存	~$600	1 天
Token 预算	~$800	3 小时
Prompt 压缩	~$400	1 天
批处理	~$400	2 小时
模型蒸馏	~$170	1 周
总计	~$3,070

From $4,200 to… 从 4,200 美元降至……