I thought Mnemara would save tokens for cloud based models, that was wrong.
I thought Mnemara would save tokens for cloud-based models, that was wrong.
我曾以为 Mnemara 能为云端模型节省 Token,但我错了。
Mnemara was built for local models. I built it for Claude too. Only one of those was a good idea. The context management problem felt real, and it was. I was running Gemma 9B locally for parts of Aethon Autopoiesis — the MUD-based AI research project I’ve been pouring time into — and a 16k context window doesn’t last long when you’re trying to hold a coherent session across a real workflow. Tool calls take space. Thinking blocks take space. Read outputs take space. The model can technically still talk to you at turn forty, but its window has filled with the rinds of the last thirty turns and there’s no room left to actually do work.
Mnemara 最初是为本地模型构建的。后来我也把它适配到了 Claude 上。但事实证明,只有前者是个好主意。上下文管理的问题确实存在,而且很棘手。我一直在本地运行 Gemma 9B 来处理 Aethon Autopoiesis(我投入了大量时间研究的基于 MUD 的 AI 项目)的部分任务。当你试图在实际工作流中保持对话连贯性时,16k 的上下文窗口根本撑不了多久。工具调用需要空间,思考块需要空间,读取输出也需要空间。从技术上讲,模型在第四十轮对话时依然能和你交流,但它的窗口已经被前三十轮的“残渣”填满了,根本没有空间去执行实际任务。
The lever was obvious. If the window is the binding constraint, manage the window. Strip thinking blocks once they’ve served their purpose. Stub out file contents you’ve already read. Drop oldest-first when you’re up against the ceiling. Pin what matters so it never gets evicted by accident. Give the operator a TUI that makes all of it visible and editable instead of hidden behind opaque magic. That’s Mnemara. A rolling-context conversation runtime with pinned slots, judgment-driven eviction, transparent turn storage, and a role doc that sits in the system prompt. The whole thing is about making the context window workable — letting a small model punch above its window by aggressively curating what’s in there. It does that job well. I’ve run Gemma sessions for hours that stayed coherent because Mnemara was holding the state and the model didn’t have to.
解决办法显而易见:如果窗口是限制因素,那就管理好窗口。在思考块完成使命后将其剔除;将已读取的文件内容简化为存根;当达到上限时,优先丢弃最旧的内容;将重要信息固定,防止被意外清除。为操作者提供一个 TUI(终端用户界面),让这一切变得可见且可编辑,而不是隐藏在不透明的“魔法”之后。这就是 Mnemara。它是一个具备固定槽位、基于判断的驱逐机制、透明对话存储以及系统提示词中角色文档的滚动上下文对话运行时。它的核心在于让上下文窗口变得可用——通过积极地筛选内容,让小型模型发挥出超越其窗口限制的能力。它在这方面表现出色。我曾运行过数小时的 Gemma 会话,由于 Mnemara 负责维护状态而无需模型自身承担,对话始终保持连贯。
Then I ported the same runtime to Claude. The features still worked. The TUI still rendered. The eviction commands still freed tokens on the turn I ran them. Mechanically, nothing was broken. But something was off, and it took a few real sessions to put my finger on what. Cloud models don’t have the same constraints. Claude Sonnet has a 200k context window. The window is rarely the binding thing — you can fit most of a codebase in there and still have room to think. The constraint isn’t “how much fits.” It’s “how much do you pay to send it.” And that’s where Mnemara’s whole model inverts.
随后,我将同样的运行时移植到了 Claude 上。功能依然有效,TUI 依然能渲染,驱逐命令在执行时也确实释放了 Token。从机制上讲,一切正常。但总觉得哪里不对劲,我经过几次实际会话才意识到问题所在。云端模型没有同样的限制。Claude Sonnet 拥有 200k 的上下文窗口,窗口很少成为瓶颈——你可以把大部分代码库塞进去,依然有空间进行思考。这里的限制不是“能装下多少”,而是“发送这些内容需要支付多少费用”。而这正是 Mnemara 的整个模型逻辑发生反转的地方。
Cloud APIs use prompt caching. You hit the cache by sending the same prefix turn after turn — same system prompt, same early context. Cache hits cost roughly a tenth of fresh reads. So the economic shape of a cloud session is: send a stable prefix, let it cache, ride that cache for as long as the TTL holds. Eviction breaks the cache. Every time Mnemara compresses, drops oldest, strips thinking blocks, or rearranges the window, the prefix changes. The cache invalidates. The next turn isn’t a cached read of the smaller window — it’s a fresh, uncached read of whatever’s left. The tokens you “saved” by evicting come back as a cache miss on the next call, billed at full price. You don’t save tokens. You spend them. Just on a delay.
云端 API 使用提示词缓存(Prompt Caching)。通过在每一轮对话中发送相同的前缀(相同的系统提示词、相同的早期上下文),你就能命中缓存。缓存命中的成本大约是全新读取的十分之一。因此,云端会话的经济逻辑是:发送一个稳定的前缀,让其缓存,并在 TTL(生存时间)有效期内持续利用该缓存。而驱逐机制会破坏缓存。每当 Mnemara 进行压缩、丢弃旧内容、剔除思考块或重排窗口时,前缀就会改变,缓存随之失效。下一轮对话不再是基于较小窗口的缓存读取,而是对剩余内容的全新、未缓存读取。你通过驱逐“节省”下来的 Token,在下一次调用时会以缓存未命中的形式反弹回来,并按全价计费。你并没有节省 Token,你只是延迟了支出。
That’s the inversion, and it’s worth saying out loud because the mechanism is sneaky: the per-turn metric Mnemara reports — “freed 12,400 tokens” — is real. The window genuinely shrank. The bill genuinely got worse anyway, because the next turn had to rebuild a context the cache was about to serve for free. Local: tokens are the wall. Cloud: tokens are the bill, and the bill has a discount you just threw away. There’s a second mismatch underneath the first. Local models, when you run them yourself, have real persistence between calls — the process is yours, the state is yours, “rolling context” maps onto something the model actually lives inside. Cloud models are stateless. Each API call rebuilds the conversation from whatever you send. The “rolling window” abstraction is doing nothing the model can feel. It’s a fiction you’re maintaining for your own convenience, and on the cloud side it’s an expensive fiction.
这就是反转所在,值得大声说出来,因为这个机制非常隐蔽:Mnemara 报告的每轮指标——“释放了 12,400 个 Token”——是真实的。窗口确实缩小了。但账单确实变贵了,因为下一轮对话必须重建原本可以由缓存免费提供的上下文。在本地:Token 是墙(限制);在云端:Token 是账单,而账单里本有一份折扣,却被你丢弃了。在这之下还有第二个错位。当你自己运行本地模型时,调用之间存在真正的持久性——进程是你的,状态是你的,“滚动上下文”映射到模型实际居住的环境中。而云端模型是无状态的。每一次 API 调用都会根据你发送的内容重建对话。“滚动窗口”的抽象对模型来说毫无感知。这只是为了你自己的方便而维持的一种虚构,而在云端,这是一种昂贵的虚构。
Local models are stateless as well, but with the right set of tools, it’s not quite the same. So Mnemara stays. But it stays where it belongs: local model infrastructure. Small windows, real persistence, no caching layer to break. It’s the right tool for that job and I’m going to keep building on it for the parts of Aethon Autopoiesis that run on local backends — Gwen for gameplay, Huginn for code, anything else I end up putting on Ollama. The role-doc-as-system-prompt pattern, pinned slots for stable lore and player state, judgment-driven eviction over mechanical FIFO — all of that earns its keep when the window is genuinely scarce.
本地模型虽然也是无状态的,但配合合适的工具,情况就完全不同了。所以 Mnemara 会保留下来,但它只会待在它该待的地方:本地模型基础设施。小窗口、真正的持久性、没有会被破坏的缓存层。它是完成这项工作的正确工具,我将继续在 Aethon Autopoiesis 的本地后端部分使用它——比如用于游戏逻辑的 Gwen、用于代码的 Huginn,以及任何我最终放在 Ollama 上的东西。“角色文档作为系统提示词”的模式、用于固定背景设定和玩家状态的槽位、基于判断而非机械 FIFO(先进先出)的驱逐机制——当窗口确实稀缺时,所有这些都物有所值。
For cloud, the right approach is roughly the opposite of what Mnemara does. Keep prefixes stable. Don’t rearrange. Append rather than evict. When the conversation is genuinely done, end it and start fresh — don’t try to surgically shrink a live session. Treat the context window as a single send, not a managed state. The model isn’t living inside it between turns. You are. That’s the lesson, and it cost me a few weekends to learn. Worth it. The mistake was assuming “context management” meant the same thing on both sides of the API boundary. It doesn’t. Local models reward you for managing the window. Cloud models reward you for leaving it alone.
对于云端,正确的方法与 Mnemara 所做的几乎相反。保持前缀稳定。不要重排。追加而非驱逐。当对话真正结束时,直接结束并重新开始——不要试图对正在进行的会话进行“外科手术式”的缩减。将上下文窗口视为一次性发送,而不是一种受管理的状态。模型在对话轮次之间并不“居住”在其中,是你自己在其中。这就是我花了好几个周末才学到的教训。值得。我的错误在于假设“上下文管理”在 API 边界的两侧意味着同样的事情。事实并非如此。本地模型奖励你对窗口的管理,而云端模型奖励你对窗口的放任。