Tokenminning: How to Get More from Your Chatbot for Less

Tokenminning：如何以更低成本从聊天机器人中获得更多价值

Tokenmaxxing is out. Real patterns for reducing costs without sacrificing AI effectiveness. “Tokenmaxxing”（代币最大化）已经过时了。这里有无需牺牲 AI 效能即可降低成本的实用模式。

Tokenmaxxing is the latest productivity virus spreading through big tech. Engineers are being judged, directly or indirectly, by how much AI they can consume. More tokens, more output, more compute. Some companies even had leaderboards. It’s the 2026 version of ranking engineers by lines of code. “Tokenmaxxing”是目前在大型科技公司中蔓延的最新生产力病毒。工程师们正直接或间接地被以他们消耗了多少 AI 资源来评判。更多的代币、更多的输出、更多的算力。有些公司甚至设立了排行榜。这简直是 2026 年版的“按代码行数评估工程师”。

Less is more

少即是多

Tokenminning is the antithesis of tokenmaxxing. Token efficiency becomes increasingly important as your usage grows. Every unnecessary token increases cost, latency and complexity. Tokenminning is a new pattern, which systematically minimizes token use while maintaining, if not improving, the performance of your AI agents. “Tokenminning”（代币挖掘）是 Tokenmaxxing 的对立面。随着使用量的增长，代币效率变得愈发重要。每一个不必要的代币都会增加成本、延迟和复杂性。Tokenminning 是一种新模式，它在系统性地最小化代币使用的同时，能够保持甚至提升 AI 代理的性能。

In this article, I cover practical strategies for tokenminning that I use to reduce costs. All of these strategies can be deployed without significant refactoring. The result: significantly lower AI costs without a sacrifice in quality. 在本文中，我将介绍我用于降低成本的 Tokenminning 实用策略。所有这些策略都可以在无需大规模重构的情况下部署。结果是：在不牺牲质量的前提下，显著降低 AI 使用成本。

The Cost of Tokenmaxxing

Tokenmaxxing 的代价

Tokenmaxxing and other naïve approaches to AI usage share a common assumption: inputs with more tokens lead to better outputs. This assumption leads to larger than necessary prompts, loaded with uncompressed context and RAG bloat. In some cases, it can improve performance, but, it introduces some significant problems. Tokenmaxxing 和其他天真的 AI 使用方法都有一个共同的假设：输入更多的代币会带来更好的输出。这种假设导致了不必要的超长提示词，充斥着未压缩的上下文和 RAG（检索增强生成）冗余。在某些情况下，这确实能提升性能，但它也带来了几个严重的问题。

1. Financial Cost 1. 财务成本

Unsurprisingly, costs skyrocket. Every token sent to and generated by a model has a price. Interactive chats have reasonable sized inputs and outputs, so naively estimated costs first seem manageable. However, real agent token usage violates all of the assumptions you may have had regarding average token use. Running long-running agents with frontier models can result in ridiculous costs. 不出所料，成本会飙升。发送给模型和由模型生成的每一个代币都有价格。交互式聊天通常有合理的输入和输出规模，因此最初天真地估算的成本看起来似乎可控。然而，真实的 AI 代理代币使用量会打破你对平均代币使用量的所有假设。使用前沿模型运行长时间的代理可能会导致荒谬的成本。

2. Inference Speed 2. 推理速度

More tokens also mean more latency. Logically, larger prompts take longer to process, increasing the time-to-first-token and overall response times. This can be detrimental with customer-facing AI or time-sensitive agents. 更多的代币也意味着更高的延迟。从逻辑上讲，更大的提示词需要更长的处理时间，从而增加了首字延迟（time-to-first-token）和整体响应时间。这对于面向客户的 AI 或对时间敏感的代理来说可能是致命的。

3. Quality 3. 质量

A big misconception is that more context produces better results. This is simply not the case, especially with very long contexts. Models have limited attention. As prompts become increasingly large, important information competes with irrelevant details for the model’s focus. “Context rot,” is a real problem, where LLMs become less effective as the context grows, and attention effectiveness deteriorates strangely with large context: it works for the beginning and end of the context window, but degrades in the middle. 一个巨大的误区是：更多的上下文会产生更好的结果。事实并非如此，尤其是在处理超长上下文时。模型的注意力是有限的。随着提示词变得越来越大，重要信息会与无关细节争夺模型的注意力。“上下文腐烂”（Context rot）是一个真实存在的问题，即随着上下文的增长，LLM 的效能会下降，且注意力效能在处理大上下文时会发生奇怪的退化：它在上下文窗口的开头和结尾表现良好，但在中间部分却会变差。

🛠️ Real strategies for “tokenminning”

🛠️ “Tokenminning” 的实用策略

If you haven’t already experienced the true cost of using AI, the problems outlined above should now be evident. AI engineers need to start thinking about how to realistically reduce token use while keeping performance high. Here are a few strategies I use to reduce AI costs. 如果你还没有体验过使用 AI 的真实成本，那么上述问题现在应该很明显了。AI 工程师需要开始思考如何在保持高性能的同时，切实地减少代币使用。以下是我用来降低 AI 成本的几种策略。

Strategy #1: Routing 策略一：路由（Routing）

Realistically, most prompts don’t need a frontier model. It’s true, models like Claude Opus or GPT 5.5 excel at complex reasoning, planning, and difficult coding tasks. But simple requests, like tool usage, summarization and classification can be handled by smaller, lower-cost models. You may even route these to a quantized local model and skip the API cost all together. 现实情况是，大多数提示词并不需要前沿模型。诚然，像 Claude Opus 或 GPT 5.5 这样的模型在复杂推理、规划和困难的编码任务上表现出色。但简单的请求，如工具调用、摘要和分类，完全可以由更小、成本更低的模型来处理。你甚至可以将这些请求路由到量化的本地模型，从而完全省去 API 费用。

Here is a high-level summary of how it works: 以下是其工作原理的简要概述：

A lightweight self-hosted webservice intercepts each prompt request.
- 一个轻量级的自托管 Web 服务拦截每个提示词请求。
This webservice is typically referred to as an “LLM Gateway.”
- 这种 Web 服务通常被称为“LLM 网关”。
Within the webservice, you will need the following hooks for each prompt:
- 在 Web 服务内部，你需要为每个提示词设置以下钩子（hooks）：
- Process: run any preprocessing required for each prompt. (处理：运行每个提示词所需的预处理)
- Evaluate: run classification on the processed prompt. (评估：对处理后的提示词进行分类)
- Route: based off the evaluation, apply predefined rules to select the model. (路由：根据评估结果，应用预定义规则选择模型)
- Execute: execute the LLM call with the selected model. (执行：使用选定的模型执行 LLM 调用)
- Validate: [optional, but helpful], run validation rules on the output. (验证：[可选但有用]，对输出运行验证规则)
- Return: … (返回：…)