I Cut My LLM API Bill by 73% — Here's the Exact Optimization Playbook

I Cut My LLM API Bill by 73% — Here’s the Exact Optimization Playbook

我将 LLM API 账单削减了 73% —— 这是我的精确优化指南

Running LLMs in production burns cash. Fast. When your app goes from “prototype” to “actually used by people,” that API bill can go from “whatever” to “wait, that’s a mortgage payment” in about two weeks. I learned this the hard way. My knowledge base platform went from a few hundred requests to thousands per day, and my LLM bill jumped to $4,200/month. After spending three weeks optimizing, I brought it down to $1,130/month — a 73% reduction — without anyone noticing a drop in quality. Here’s the exact playbook.

在生产环境中运行大语言模型(LLM)非常烧钱,而且速度极快。当你的应用从“原型”变成“真实用户使用”时,API 账单可能会在两周内从“无所谓”变成“等等,这简直是一笔房贷”。我对此深有体会。我的知识库平台从每天几百次请求增长到数千次,LLM 账单随之飙升至每月 4,200 美元。经过三周的优化,我将其降至每月 1,130 美元——削减了 73%——且没有任何用户察觉到质量下降。以下是我的具体操作指南。

1. The Routing Layer: Right Model for the Right Job

1. 路由层:为合适的任务选择合适的模型

Most developers send everything to the biggest model. That’s like using a sledgehammer to crack a nut. The strategy: Classify requests by complexity and route accordingly.

大多数开发者习惯将所有请求都发送给最强大的模型。这就像是用大锤去砸核桃。策略是:根据复杂程度对请求进行分类,并进行相应的路由。

// Simple classification layer
function routeByComplexity(userInput: string): LLMModel {
  const tokens = userInput.split(/\s+/).length;
  if (tokens < 15 && !containsTechnicalTerms(userInput)) {
    return 'cheap-fast-model'; // $0.15/M tokens
  }
  if (tokens < 100 && isStructuredQuery(userInput)) {
    return 'mid-tier-model'; // $0.50/M tokens
  }
  return 'premium-model'; // $3.00/M tokens — only when needed
}

The impact: ~40% of our requests are simple (formatting, classification, short answers). Routing those to cheaper models saved ~$900/month alone. How to classify: Start with heuristics (token count, keyword matching). Once you have data, train a tiny classifier that costs pennies to run.

影响:我们约 40% 的请求属于简单任务(格式化、分类、简短回答)。仅将这些请求路由到更便宜的模型,每月就节省了约 900 美元。如何分类:从启发式方法(Token 计数、关键词匹配)开始。一旦有了数据,就可以训练一个运行成本极低的小型分类器。

2. Response Caching: The $600/Month Win

2. 响应缓存:每月节省 600 美元的秘诀

If a user asks “What is RAG?” and another user asks “What is RAG?” three hours later — that’s the same answer. Don’t pay twice.

如果一个用户问“什么是 RAG?”,三小时后另一个用户也问“什么是 RAG?”——答案是一样的。不要付两次钱。

import hashlib
import redis

class LLMCache:
    def __init__(self):
        self.redis = redis.Redis()

    def get_cache_key(self, prompt: str, model: str) -> str:
        raw = f"{model}:{prompt.strip().lower()}"
        return f"llm:{hashlib.sha256(raw.encode()).hexdigest()[:16]}"

    def get(self, prompt: str, model: str):
        key = self.get_cache_key(prompt, model)
        return self.redis.get(key)

    def set(self, prompt: str, model: str, response: str, ttl: int = 86400):
        key = self.get_cache_key(prompt, model)
        self.redis.setex(key, ttl, response)

Key decisions: TTL: 24 hours for general knowledge, 1 hour for time-sensitive queries. Cache scope: Cache at the prompt level, not the response — normalize whitespace, lowercase, strip trailing punctuation. Hit rate: We achieved 35% cache hit rate on FAQ-style content. The catch: Don’t cache creative tasks (writing, brainstorming). Those need fresh outputs every time.

关键决策:TTL(生存时间):通用知识设为 24 小时,时效性查询设为 1 小时。缓存范围:在 Prompt 层面缓存,而非响应层面——对空格进行标准化、转为小写、去除末尾标点。命中率:我们在 FAQ 类内容上实现了 35% 的缓存命中率。注意:不要缓存创意类任务(写作、头脑风暴),这些任务每次都需要新鲜的输出。

3. Token Budgeting: The Silent Killer Is Output Length

3. Token 预算:隐形杀手是输出长度

Most LLM pricing charges per output token. A model that outputs 2,000 tokens when 300 would do is burning your money.

大多数 LLM 的定价是按输出 Token 收费的。如果一个模型本可以用 300 个 Token 完成任务,却输出了 2,000 个,那就是在烧你的钱。

  • Before: User: “Summarize this article” -> Model: generates 1,800 token essay -> Cost: $0.054/request

  • After: User: “Summarize this article in 3 bullet points, max 50 words each.” -> Model: generates 120 tokens -> Cost: $0.0036/request

  • 优化前: 用户:“总结这篇文章” -> 模型:生成 1,800 个 Token 的文章 -> 成本:$0.054/次

  • 优化后: 用户:“用 3 个要点总结这篇文章,每点不超过 50 字。” -> 模型:生成 120 个 Token -> 成本:$0.0036/次

Tactics that work: Explicit token budgets in prompts (“Answer in under 100 words”), max_tokens parameter (set hard limits), output format constraints (JSON schemas force conciseness), and temperature tuning (lower temperature 0.1-0.3 reduces rambling). This alone cut our output token count by 60%.

有效的策略:在 Prompt 中明确 Token 预算(“用 100 字以内回答”)、设置 max_tokens 参数(设定硬性限制)、约束输出格式(JSON Schema 强制简洁)、以及调整 Temperature(较低的 0.1-0.3 可以减少废话)。仅此一项就将我们的输出 Token 数量减少了 60%。

4. Prompt Compression: Shrink the Input, Shrink the Cost

4. Prompt 压缩:缩减输入,缩减成本

Your prompt tokens cost money too. If you’re sending a 5,000-token system prompt with every request, you’re paying $0.015 per call just for setup.

你的 Prompt Token 也要钱。如果你每次请求都发送 5,000 个 Token 的系统提示词,那么仅初始化成本每次就要 0.015 美元。

What I compressed: System prompts (3,200 → 800 tokens), Few-shot examples (6 → 2 examples), Context windows (only include relevant sections). The technique: Run your prompt through a cheap model first: “Condense these instructions to the minimum needed for correct execution.” Test output quality, and A/B test for a week. We cut input tokens by 55% with zero quality loss.

我的压缩方案:系统提示词(3,200 → 800 Token)、少样本示例(6 → 2 个)、上下文窗口(仅包含相关部分)。技巧:先用廉价模型处理你的 Prompt:“将这些指令压缩到正确执行所需的最低限度。”测试输出质量,并进行一周的 A/B 测试。我们在零质量损失的情况下将输入 Token 减少了 55%。

5. Batch Processing: The Async Advantage

5. 批处理:异步的优势

If your app doesn’t need real-time responses, batching is your best friend. Most providers offer significant discounts for batch API calls. Our use case: Article processing pipeline (tagging, summarizing, extracting entities).

如果你的应用不需要实时响应,批处理就是你最好的朋友。大多数供应商为批量 API 调用提供大幅折扣。我们的用例:文章处理流水线(打标签、总结、提取实体)。

# Batch API pattern (OpenAI example)
batch_input = [
    {"custom_id": "article-001", "method": "POST", "url": "/v1/chat/completions", "body": {...}},
    # ... up to 50,000 requests per batch
]
batch_job = openai.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions")

Batch processing costs 50% less than real-time API calls. For our content pipeline, this saved $400/month.

批处理的成本比实时 API 调用低 50%。对于我们的内容流水线,这每月节省了 400 美元。

6. Model Distillation: The Advanced Play

6. 模型蒸馏:进阶玩法

This is the most effort but the biggest payoff. For tasks you run thousands of times per day (classification, tagging), fine-tune a smaller model on outputs from the big model.

这是最费力但回报最大的方法。对于每天运行数千次的任务(分类、打标签),使用大模型的输出来微调一个小模型。

The process: Run 1,000 examples through GPT-4 to get “gold standard” outputs, then fine-tune GPT-4o-mini on those examples. The small model now produces ~90% of the quality at ~10% of the cost. Our results: Classification accuracy went from 94% (premium) to 89% (fine-tuned small), but cost dropped from $0.03 to $0.003 per request.

流程:用 GPT-4 运行 1,000 个示例获取“黄金标准”输出,然后用这些示例微调 GPT-4o-mini。小模型现在能以约 10% 的成本实现约 90% 的质量。我们的结果:分类准确率从 94%(高级模型)降至 89%(微调后的小模型),但单次请求成本从 0.03 美元降至 0.003 美元。

The Numbers, Laid Bare

账单明细

TacticMonthly SavingsEffort
Smart routing~$9002 days
Response caching~$6001 day
Token budgeting~$8003 hours
Prompt compression~$4001 day
Batch processing~$4002 hours
Model distillation~$1701 week
Total~$3,070
策略每月节省工作量
智能路由~$9002 天
响应缓存~$6001 天
Token 预算~$8003 小时
Prompt 压缩~$4001 天
批处理~$4002 小时
模型蒸馏~$1701 周
总计~$3,070

From $4,200 to… 从 4,200 美元降至……