Making budget models punch above their weight with a smart Rust harness
Making budget models punch above their weight with a smart Rust harness
利用智能 Rust 框架,让低成本模型发挥越级性能
June 8, 2026 2026年6月8日
Dirge is an agentic harness that I’ve been developing for my own use, and it’s getting to the point where it’s becoming generally useful. In this post, I’ll discuss some of the rationale behind it and the interesting features it provides which differentiate it from other tools in this space. Dirge 是我为个人使用而开发的一款智能体(Agentic)框架,目前它已经逐渐成熟,具备了广泛的实用价值。在这篇文章中,我将探讨其背后的设计理念,以及它与其他同类工具相比所具备的独特功能。
The first thing to note are its performance and memory footprint. Most existing coding tools like OpenCode are rather memory-intensive, often using around 300 MB of RAM even when sitting there doing nothing. Some tools like Claude Code even lag at times while you’re typing. Dirge is written in Rust, which compiles to a tiny and fast binary file weighing in at about 30 MB. When Dirge starts up, it needs only around 8 MB of RAM while idle, while working on tasks pushes that up to roughly 15 MB. So you could run twenty copies of Dirge at the same time for the cost of a single instance of OpenCode. 首先值得一提的是它的性能和内存占用。大多数现有的编码工具(如 OpenCode)非常消耗内存,即使在空闲状态下也往往占用约 300 MB 的内存。一些工具(如 Claude Code)甚至会在你输入时出现卡顿。Dirge 使用 Rust 编写,编译后的二进制文件非常小巧且快速,仅约 30 MB。Dirge 启动时,空闲状态下仅需约 8 MB 内存,处理任务时也只会增加到约 15 MB。因此,运行 20 个 Dirge 实例的内存开销,仅相当于运行一个 OpenCode 实例。
However, lean size alone is not the main point, and there are other Rust-based harnesses to choose from. What makes Dirge actually interesting is how it supports less capable models to get the most out of them. The conventional wisdom is that intelligence resides in the model, and the harness is treated as a matter of minimal plumbing. Its sole job is to give the model a tool loop along with a system prompt, and then to stay out of the way. In this view, the only way of getting a better agent is to get a bigger, and typically more expensive, model. 然而,轻量化并不是重点,市面上还有其他基于 Rust 的框架可供选择。Dirge 真正有趣的地方在于,它如何支持能力较弱的模型并榨取其最大潜力。传统观点认为,智能存在于模型本身,而框架只是最基础的“管道”。框架的唯一任务就是为模型提供工具循环和系统提示词,然后保持“隐身”。按照这种观点,想要获得更好的智能体,唯一的办法就是使用更大、通常也更昂贵的模型。
Having spent a lot of time doing agentic coding with models such as DeepSeek and Qwen has changed my mind on the subject. It turns out that much of what makes an agent effective in practice lies in how well the harness meets the expectations of the model. The model usually knows what it wants, and can figure out how to get there and what actions it needs to take. What makes one setup feel cutting-edge and another feel frustrating is everything that surrounds the model. A harness needs to guide it before it acts, to correct mechanical errors, and to tell the model exactly what went wrong. It should also remember what has been learned from each attempt and manage the context intelligently. 在花费大量时间使用 DeepSeek 和 Qwen 等模型进行智能体编码后,我改变了看法。事实证明,智能体在实践中是否有效,很大程度上取决于框架在多大程度上满足了模型的预期。模型通常知道自己想要什么,也能弄清楚如何达成目标以及需要采取哪些行动。让一套系统感觉“尖端”而另一套感觉“挫败”的关键,在于模型周围的一切。框架需要在模型行动前进行引导,纠正机械性错误,并准确告知模型哪里出了问题。它还应该记住每次尝试中学到的东西,并智能地管理上下文。
Frontier labs build much of this into post-training and tune their own harnesses to fit the strengths and weaknesses of their specific model. While Dirge cannot change how a model was trained, it can close the performance gap by meeting the model where it is. Once you invest a bit of work in the harness capabilities, a cheaper open model starts to behave like one that costs much more. 前沿实验室在后训练阶段构建了许多此类功能,并调整其框架以适应特定模型的优缺点。虽然 Dirge 无法改变模型的训练方式,但它可以通过适配模型当前的能力来缩小性能差距。一旦你在框架能力上投入一些精力,廉价的开源模型也能表现得像昂贵的模型一样出色。
The gap appears at three different time scales, and Dirge invests in all these cases. Each time the model makes a tool call, it can either succeed or fail. Maybe the call is malformed. Maybe it edits a file and introduces a syntax error. Or maybe it gets stuck retrying the same failing command over and over. In each case, a failed step consumes time and tokens without advancing the task, and these failures quickly accumulate to fill up the context window with noise, leading the model to lose the thread of what it’s doing. And the longer a session runs, the worse the model gets at following instructions because as the window nears its limit, earlier instructions and corrections get truncated or forgotten. So the model continues to repeat mistakes or ignore earlier context. Things get even worse across sessions, since each new agent starts with utter ignorance of what went on before. The model doesn’t remember any past decisions, file structures, or problems you’ve already solved. Every session has to rebuild its understanding of the codebase from scratch. 这种差距出现在三个不同的时间尺度上,而 Dirge 在这三个方面都进行了投入。模型每次调用工具时,都可能成功或失败。也许调用格式错误,也许修改文件时引入了语法错误,又或者陷入了重复执行失败命令的死循环。在每种情况下,失败的步骤都会消耗时间和 Token,却无法推进任务。这些失败会迅速积累,用噪音填满上下文窗口,导致模型丢失任务线索。会话运行时间越长,模型遵循指令的能力就越差,因为当窗口接近极限时,早期的指令和修正会被截断或遗忘。因此,模型会不断重复错误或忽略之前的上下文。跨会话的情况更糟,因为每个新智能体启动时都对之前发生的事情一无所知。模型不记得任何过去的决策、文件结构或你已经解决的问题。每个会话都必须从零开始重新构建对代码库的理解。
Let’s take a look at what Dirge does with each separate piece of the puzzle. The attacks are in a certain order, and each is connected with its neighbor, making it a part of the whole process. As often tends to be the case, the aggregate is more than the sum of its parts. 让我们看看 Dirge 是如何处理这些难题的。这些应对策略按特定顺序排列,彼此相连,共同构成了一个完整的流程。正如通常情况一样,整体大于部分之和。
How Dirge works
Dirge 的工作原理
Dirge is essentially a state machine wrapped around the model. It lays down a series of steps, each of which consists of running the model, classifying the reply, then verifying and executing any tool calls, and finally verifying that the job has really been done before allowing the model to stop. This loop in itself is the core plumbing. What makes it a real power multiplier are three layers of apparatus wrapped around it, each of which corresponds to one of the time scales we just discussed. Dirge 本质上是一个围绕模型构建的状态机。它设定了一系列步骤:运行模型、分类回复、验证并执行工具调用,最后在允许模型停止前验证任务是否真正完成。这个循环本身就是核心管道。真正使其成为“能力倍增器”的是包裹在它外面的三层装置,每一层都对应我们刚才讨论的一个时间尺度。
A steering-and-repair layer ensures that each turn lands. A long-horizon layer ensures continuity within a session despite the limitation of the context window’s size. A learning layer transfers hard-won knowledge between sessions which is stored in one SQLite database associated with the project. On top of that sits a plugin system which lets you reach into any part of the agentic loop you want. Let’s take a look at these features in order. 一个“引导与修复层”确保每一步都能落地;一个“长视距层”确保在上下文窗口受限的情况下,会话内部的连续性;一个“学习层”在会话间传递来之不易的知识,这些知识存储在与项目关联的 SQLite 数据库中。在此之上,还有一个插件系统,让你能够介入智能体循环的任何部分。让我们依次看看这些功能。
Making each turn land
确保每一步都能落地
You might have heard that open models aren’t good at tool calling and that you have to pay for a top-tier model trained on API contracts to get reliable results. All tool calling means is that the model has to output structured data, like JSON, in a specific format. Frontier models, like Claude, are directly trained on thousands of API contracts to get them to produce outputs that match function signatures and parameter rules. Open models are typically trained for general text generation rather than structured output tasks, and aren’t capable of producing such exact outputs. But that’s precisely an area where the harness can close the gap in how the model’s output is parsed, formatted, and verified. 你可能听说过开源模型不擅长工具调用,必须付费使用经过 API 契约训练的顶级模型才能获得可靠结果。其实,工具调用仅仅意味着模型需要以特定格式输出结构化数据(如 JSON)。Claude 等前沿模型经过数千个 API 契约的直接训练,能够生成符合函数签名和参数规则的输出。开源模型通常针对通用文本生成而非结构化输出任务进行训练,无法产生如此精确的输出。但这恰恰是框架可以发挥作用的地方——通过优化模型输出的解析、格式化和验证方式来缩小差距。
The first thing that can be adjusted is steering itself. Dirge includes a set of instructions that are known to work well based on existing literature. These are baked into the system prompt and loop, causing the model to complete what it starts by checking itself against an explicit definition of done, creating self-discipline. The system also prompts… 首先可以调整的是引导机制本身。Dirge 包含了一套基于现有文献证明有效的指令。这些指令被嵌入到系统提示词和循环中,促使模型通过对照明确的“完成定义”进行自我检查,从而建立自律性。该系统还会提示……