We built a coding harness that beats frontier models using open ones. It's in open beta.

We built a coding harness that beats frontier models using open ones. It’s in open beta.

我们构建了一个能够利用开源模型击败前沿模型的编程工具,现已开启公开测试。

Here is the bet we made: build software memory-first, not model-first, and it will outperform. Everyone else is racing to wrap the next model. We did the opposite. We built the memory layer first, the routing first, tool-calling, now the recursive engine, then let the model be a swappable part. 这是我们押下的赌注:以“内存优先”而非“模型优先”的理念构建软件,它将表现得更出色。其他人都在竞相包装最新的模型,而我们反其道而行之。我们先构建了内存层、路由层、工具调用,现在是递归引擎,然后让模型成为一个可替换的组件。

Today that bet has a name: Backboard Development Studio. It starts with the R-CLI, a coding harness now in open beta. The headline result? It beats frontier models using open ones. Keep reading, the numbers are below and there is a promo code at the bottom. 今天,这个赌注有了一个名字:Backboard Development Studio。它始于 R-CLI,这是一个现已开启公开测试的编程工具。最核心的成果是什么?它能利用开源模型击败前沿模型。请继续阅读,下方有具体数据,文末还有优惠码。

Test it. The beta is open. Two lines and you are running. 去测试一下吧。测试版已开放,两行命令即可运行。

macOS / Linux

curl -fsSL https://app.backboard.io/api/cli | bash

Windows (PowerShell)

irm https://app.backboard.io/api/cli/windows | iex

Get your API key: https://app.backboard.io 获取你的 API 密钥:https://app.backboard.io

Promo code: DEVTOCLI for credit toward inference while you put it through its paces. Find the Promo submit in the top right corner of the billing page. 优惠码:DEVTOCLI,在你进行压力测试时可获得推理额度。在账单页面的右上角找到“Promo”提交入口即可使用。

The hypothesis, stated plainly: Model-first thinking says: pick the smartest model, prompt it well, hope it remembers. Memory-first thinking says: give the system real persistence, real routing, real recall, and a “smaller” model will outwork a “smarter” one that forgets everything between turns. We believed the second one. So we built it. 简单来说,我们的假设是:“模型优先”思维认为:挑选最聪明的模型,写好提示词,然后祈祷它能记住。“内存优先”思维则认为:赋予系统真正的持久性、真正的路由和真正的回溯能力,一个“较小”的模型也能胜过一个在每轮对话间就会遗忘一切的“更聪明”模型。我们坚信后者,所以我们构建了它。

The R-CLI is powered by our memory algorithms (the same ones that rank #1 on LoCoMo and LongMemEval) and runs on Backboard’s unified API: memory, routing across 17,000+ models, RAG, and stateful threads behind one key. R-CLI 由我们的内存算法驱动(该算法在 LoCoMo 和 LongMemEval 上排名第一),并运行在 Backboard 的统一 API 之上:通过一个密钥即可实现内存管理、跨 17,000 多个模型的路由、RAG 以及有状态线程。

Then we tested it in public. That part did not go quietly. The numbers we’re getting on internal test runs this week: 随后我们进行了公开测试,过程并不平静。我们本周内部测试得出的数据如下:

  • 92% on Terminal Bench 2.1 running Codex 5.5
  • 70% on Terminal Bench 2.1 running GLM 5.1, an open-source model
  • Up to 30% fewer tokens and up to 90% lower cost than the closed harnesses
  • 0% of your code used to train anyone’s model <— Please read the T’s & C’s of your fav harnesses…
  • 在运行 Codex 5.5 的 Terminal Bench 2.1 上达到 92%
  • 在运行开源模型 GLM 5.1 的 Terminal Bench 2.1 上达到 70%
  • 比闭源工具减少高达 30% 的 Token 消耗,成本降低高达 90%
  • 你的代码 0% 被用于训练任何人的模型 <— 请务必阅读你所钟爱工具的条款与条件……

Read that second line again. An open model, inside our harness, posting numbers that go toe to toe with Claude Code, at a fraction of the cost. And to be clear: we are not the cheap open-source alternative. We run the full frontier lineup too. We just happen to beat frontier results with open models like GLM 5.1 and DeepSeek V4. Same harness, your choice of brain. 再读一遍第二行。一个开源模型,在我们的工具中,跑出了与 Claude Code 不相上下的成绩,而成本仅为其一小部分。需要明确的是:我们并非廉价的开源替代品。我们也运行所有前沿模型。我们只是恰好能用 GLM 5.1 和 DeepSeek V4 这样的开源模型击败前沿模型的结果。同样的工具,由你选择大脑。

Then it gets weird: /expert mode. You do not have to pick one model. You can use two in a single task. Try /expert mode: plan with Opus 4.7, execute with DeepSeek V4. The expensive model architects. The fast cheap one ships. The harness orchestrates the handoff. Frontier reasoning where it counts, frontier-beating cost where it does not. One command. Nobody else is selling that, because nobody else built memory and routing first. 接下来更有意思:/expert 模式。你不必只选一个模型,可以在单个任务中使用两个。试试 /expert 模式:用 Opus 4.7 规划,用 DeepSeek V4 执行。昂贵的模型负责架构,快速廉价的模型负责交付。工具负责协调交接。在关键之处发挥前沿推理能力,在非关键之处实现超越前沿的成本优势。一条命令即可完成。没人能提供这种服务,因为没人像我们这样优先构建了内存和路由。

A developer tried to take it apart in public. We launched. A serious builder showed up in the comments and pushed back hard. Well-tooled local repo. His own RAG, skills, memory, a knowledge graph he had clearly invested months in. He ran the CLI and came back with a fair verdict: “kind of specific, not super helpful for a setup like mine.” Serious builder. Serious objection. The strongest one a developer can make: “I already hand-built the thing you are selling.” 一位开发者试图在公开场合拆解它。我们发布后,一位资深的构建者在评论区提出了强烈的质疑。他拥有工具完善的本地仓库,有自己的 RAG、技能、内存以及显然投入了数月心血构建的知识图谱。他运行了 CLI 后给出了一个中肯的评价:“太具体了,对我的配置来说没那么有用。”资深的构建者,严肃的反对意见。这是开发者能提出的最有力的质疑:“你卖的东西,我已经亲手构建好了。”

Then one fact flipped the whole conversation. The fact that ended the argument: The R-CLI is stateful by default. The persistence he was hand-building? The session-priming file he writes and re-reads every time? The weekly cron jobs auditing how often his agents drift? The pre-commit hooks keeping them on the rails? Native on our side. Not a layer you bolt on. The default behavior. That is what memory-first actually means in your terminal. 随后一个事实扭转了整个对话,也终结了争论:R-CLI 默认是有状态的。他亲手构建的持久性?他每次都要写入和重读的会话引导文件?他每周审计代理偏离程度的定时任务?他用来保持代理运行轨迹的预提交钩子?在我们这边,这些都是原生的。不是外挂的层,而是默认行为。这就是“内存优先”在你的终端中真正的含义。

So for him it was never “adopt a whole new ecosystem.” It was a harness swap: keep your own RAG, memory, and graph, drop the maintenance tax. The thread went from “not for me” to “let me talk to your CLI lead.” A demo call got booked. The objection did not get argued away. It got dissolved by a capability he did not know was there. The lesson we took: the pitch was never “we are better.” It was “you are doing by hand what we do by default.” A developer handed us that line for free. 所以对他来说,这从来不是“采用一套全新的生态系统”,而是一次工具替换:保留你自己的 RAG、内存和图谱,省去维护成本。讨论帖从“不适合我”变成了“让我和你们的 CLI 负责人谈谈”。演示会议预约成功了。反对意见并没有被辩论驳倒,而是被他此前不知道的功能所化解。我们学到的教训是:我们的卖点从来不是“我们更好”,而是“你正在手动做我们默认提供的事情”。一位开发者免费为我们贡献了这句标语。

Four pillars. Miss one and it does not ship. 四大支柱。缺一不可,否则无法发布。

  1. Best in the world. Performance is the bar, not a tagline. We ran benchmarks internally because we expect to be measured.

  2. Easiest to use. One key. The same key for your R-CLI… well it unlocks: Memory, routing, multi-agent, parallel tool calls, all behind one integrated surface. No stitching eight services together and praying the glue holds.

  3. Most accessible. Frontier coding quality, your choice of model to get there. Closed, open, or mixed in one workflow. GLM 5.1 and DeepSeek V4 are the proof, not the promise.

  4. People stay by choice. Any model, your own embeddings, modular layers, your data exportable through real endpoints. No lock-in, no theatrics, no fear-mongering. If you stay, it is because the flexibility is unrivaled.

  5. 世界一流。 性能是标准,而非口号。我们进行了内部基准测试,因为我们期待被评估。

  6. 最易使用。 一个密钥。同一个密钥用于你的 R-CLI……它解锁了:内存、路由、多代理、并行工具调用,所有功能都在一个集成界面下。无需拼凑八个服务并祈祷它们能稳定运行。

  7. 最易获取。 前沿的编程质量,由你选择模型来实现。在同一个工作流中混合使用闭源或开源模型。GLM 5.1 和 DeepSeek V4 是证明,而非承诺。

  8. 用户自愿留存。 任何模型、你自己的嵌入、模块化层,你的数据可以通过真实端点导出。没有锁定,没有噱头,没有恐吓营销。如果你留下来,那是因为其灵活性无可匹敌。

One more thing: The R-CLI is the first surface of Backboard Development Studio. The IDE is close. Same engine, same performance, plus multi-agent sessions, Pi extension integrations, and coding-theme skills pre-built. The CLI is the foundation. We nail the harness with the community first. Then the IDE lands on something already proven. 还有一件事:R-CLI 是 Backboard Development Studio 的第一个界面。IDE 即将到来。同样的引擎,同样的性能,外加多代理会话、Pi 扩展集成以及预置的编程主题技能。CLI 是基础。我们先与社区一起打磨好这个工具,然后 IDE 将建立在已经过验证的基础上。

Come argue with us: The best feedback we have gotten so far came from someone telling us we were wrong. He pushed, we answered, he booked a call, his team switched. So: paste the command, claim your key, run DEVTOCLI, and try to break it. Then drop a comment with what held up, what did not, and what your current setup still does better. Memory-first or model-first. We made our bet. Come test it. 来和我们辩论吧:我们目前收到的最好反馈来自一个告诉我们“你们错了”的人。他质疑,我们回答,他预约了会议,他的团队随后切换到了我们的平台。所以:粘贴命令,领取密钥,运行 DEVTOCLI,试着去“破坏”它。然后留下评论,告诉我们哪些地方表现良好,哪些地方不行,以及你当前的配置在哪些方面做得更好。内存优先还是模型优先?我们已经押注了,来测试一下吧。

Backboard.io is full-stack, model-agnostic AI infrastructure. Backboard Development Studio is our recursive coding environment, stateful by default, built on the unified API. Backboard.io 是全栈、模型无关的 AI 基础设施。Backboard Development Studio 是我们的递归编程环境,默认有状态,构建在统一 API 之上。