Five labs, five minds: building a multi-model finance drama on small models
Five labs, five minds: building a multi-model finance drama on small models
五个实验室,五个大脑:在小模型上构建多模型金融剧
A second Build Small Hackathon field report: what happens when each agent in an emergent economy runs on a different lab’s small model, and the player becomes the financier pulling the strings. 这是“Build Small”黑客松的第二份现场报告:当新兴经济体中的每个智能体都运行在不同实验室的小模型上,而玩家成为幕后操盘的金融家时,会发生什么?
The first version of Thousand Token Wood was a weather-god sandbox: five woodland creatures on one fine-tuned 0.5B model traded goods, and you poked the world with shocks and watched bubbles and crashes emerge. It was a nice toy. It was also something you watched rather than played. 《Thousand Token Wood》的第一个版本是一个“天气之神”沙盒:五个森林生物运行在一个微调过的 0.5B 模型上进行商品交易,你可以通过制造冲击来干扰世界,观察泡沫和崩盘的产生。它是一个有趣的玩具,但你更多是在观看,而不是在游玩。
v2 rebuilt it into a game you operate. You are the Patron of the Wood, a shadow financier: you lend at interest, whisper tips that may be true or planted, short the market, bribe, and broker alliances, while a magistrate hunts you for trading on what you should not know. The creatures remember how you treated them and scheme back. And the biggest change is under the hood: every creature now thinks with a different lab’s small model. This is the engineering report. 第二版将其重构为你能够操作的游戏。你是森林的赞助人,一位影子金融家:你发放高利贷,散布真假难辨的内幕消息,做空市场,行贿并促成联盟,同时还要躲避治安官的追捕,因为你掌握了不该知道的交易信息。生物们会记住你对待它们的方式并进行反击。而最大的变化在于底层逻辑:每个生物现在都使用不同实验室的小模型进行思考。以下是工程报告。
Heterogeneity is the product, not a constraint
异构性是产品,而非限制
The obvious way to run a council of agents is one model, many prompts. v2 runs four: gpt-oss-20b (OpenAI), MiniCPM3-4B (OpenBMB), Nemotron-Mini-4B (NVIDIA), and a fine-tuned Qwen 0.5B of my own. The point is not novelty for its own sake. A market is interesting when the participants genuinely differ, and four labs’ models trained on different data with different post-training are about as different as small models get. The owl hoards differently than the fox speculates. The council is a live argument, not a script. 运行智能体委员会最显而易见的方法是“一个模型,多个提示词”。第二版运行了四个模型:gpt-oss-20b (OpenAI)、MiniCPM3-4B (OpenBMB)、Nemotron-Mini-4B (NVIDIA) 以及我自研微调的 Qwen 0.5B。这样做的目的并非为了标新立异。当参与者存在本质差异时,市场才有趣;四个实验室的模型在不同的数据和训练后处理下训练,其差异性已达到小模型的极限。猫头鹰囤积物资的方式与狐狸投机的方式截然不同。这个委员会是一场实时的辩论,而非预设的剧本。
Standing four distinct models up on one platform surfaced the real lesson: the friction is almost entirely at the serving layer, not the modeling layer. Current vLLM (0.22.1) JIT-compiles kernels at load and needs the CUDA toolkit (nvcc) present. A lean base image does not ship it, so all four models failed identically with “could not find nvcc” until I based them on a CUDA devel image. This was not a gpt-oss quirk; it was universal to the vLLM version. One image fix unblocked everything. 将四个不同的模型部署在同一个平台上,揭示了一个真正的教训:摩擦力几乎完全存在于服务层,而非模型层。当前的 vLLM (0.22.1) 在加载时会即时编译内核,并且需要 CUDA 工具包 (nvcc)。精简的基础镜像不包含它,因此所有四个模型都因为“找不到 nvcc”而报错,直到我将它们迁移到 CUDA 开发镜像上。这不是 gpt-oss 的特有问题,而是 vLLM 该版本普遍存在的问题。修复镜像后,一切问题迎刃而解。
gpt-oss-20b runs in its native MXFP4 quantization and fits a 24GB L4 with room to spare; no high-end GPU needed. It also speaks a channel format that wraps the answer in an analysis preamble, so the consumer has to extract the final channel. MiniCPM3 needed trust_remote_code; Nemotron loaded clean. Per-model footguns, each a one-line config.
gpt-oss-20b 以其原生的 MXFP4 量化运行,在 24GB 的 L4 显卡上绰绰有余,无需高端 GPU。它还使用一种通道格式,将答案包裹在分析前言中,因此消费者必须提取最终的通道内容。MiniCPM3 需要 trust_remote_code;Nemotron 则加载顺利。每个模型都有各自的“坑”,但只需一行配置即可解决。
The thing that made four heterogeneous models tractable was the same primitive that made one model tractable in v1: a tolerant JSON parse-and-repair layer that every model’s output flows through. Different tokenizers and formatting habits produce different malformations; the parser drops what it cannot salvage and the simulation never crashes. Build that layer once and adding a model is a config entry, not a refactor. 让四个异构模型变得可控的关键,与第一版中控制单个模型的方法相同:一个容错的 JSON 解析与修复层,所有模型的输出都必须经过它。不同的分词器和格式习惯会产生不同的畸形输出;解析器会丢弃无法修复的部分,从而确保模拟过程永不崩溃。只需构建一次该层,添加新模型就只是配置项的修改,而非重构。
Information asymmetry needs a firewall
信息不对称需要防火墙
The dramatic core of v2 is the insider tip. You can whisper a tip to a creature that is true (a real forecast of the next market mania the deck will draw, your genuine edge) or false (bait). Acting on a true tip and profiting raises your heat; cross a threshold and the magistrate opens an investigation that ends in a fine, frozen assets, or exile. 第二版的核心戏剧冲突在于内幕消息。你可以向某个生物耳语一个消息,它可以是真的(对下一轮市场狂热的真实预测,你的核心优势),也可以是假的(诱饵)。根据真实消息获利会增加你的“热度”;一旦超过阈值,治安官就会展开调查,导致罚款、资产冻结或流放。
For that to be a real game, the truth of a tip must be hidden from the creatures. They see the rumor text; they must never see the flag. This is a security property, not a UI nicety, and small-model agents make it sharp: everything the model could repeat back is whatever you put in its prompt. So the hidden flag lives off-prompt entirely (on the player’s ledger), it is stripped from the public event record at construction, and the only thing the narrator ever summarizes is public events. A single test scans every creature’s full prompt, every turn, for the banned tokens. That test is the most important one in the suite. When you give an agent secret information, assume it will leak unless a test proves it cannot. 为了让这成为真正的游戏,消息的真实性必须对生物们保密。它们能看到传闻文本,但绝不能看到标记。这是一个安全属性,而非 UI 细节。小模型智能体让这一点变得更加严峻:模型能复述的一切,都是你放入提示词中的内容。因此,隐藏标记完全存在于提示词之外(在玩家的账本上),在构建时会从公共事件记录中剔除,叙述者总结的唯一内容就是公共事件。每一轮,一个测试程序都会扫描每个生物的完整提示词,检查是否存在违禁标记。这是整个套件中最关键的测试。当你给智能体秘密信息时,假设它会泄露,除非测试证明它不会。
Memory is cheap drama if you bound it
如果有边界,记忆就是廉价的戏剧
Creatures carry persistent relationships: a signed sentiment toward the Patron and toward each other, nudged by events (you shorted my crop, you repaid your loan, you allied me with a rival). A creature that turns hostile refuses your loans and quotes you worse; allied creatures stop undercutting each other and behave like a cartel. 生物们拥有持久的关系:对赞助人及彼此之间带有正负号的情感值,这些情感会受到事件的影响(你做空了我的庄稼,你偿还了贷款,你让我与竞争对手结盟)。变得敌对的生物会拒绝你的贷款并给出更差的报价;结盟的生物则会停止互相压价,表现得像一个卡特尔组织。
The trap is prompt inflation. Raw history grows without bound and a small model drowns in it. The fix is to never put history in the prompt: the model sees a one-line bucketed summary (“you feel warmly toward Oona, wary of the Patron”), capped to the few strongest feelings, derived from integer sentiment. Notes are kept for traces but bounded and never shown. The behavioral bias is part emergent (the summary nudges the model) and part mechanical (a strongly hostile creature deterministically refuses), so it is observable and testable rather than a hope. 陷阱在于提示词膨胀。原始历史记录会无限增长,小模型会淹没在其中。解决方法是永远不要将历史记录放入提示词:模型只看到一行分桶总结(“你对 Oona 感觉温暖,对赞助人保持警惕”),仅保留最强烈的几种情感,并由整数情感值推导得出。笔记用于追踪但有边界且从不展示。行为偏差部分是涌现的(总结引导模型),部分是机械的(强烈敌对的生物会确定性地拒绝),因此它是可观察和可测试的,而不是一种期望。
What actually happened
实际运行结果
A representative council run, with the full v2 mechanics live: 一次典型的委员会运行,完整启用了第二版机制:
- Models in the council: 4 labs, all under the 32B cap, served on Modal
- 委员会模型: 4 个实验室,均在 32B 参数以下,部署在 Modal 上
- Fine-tuned 0.5B reliability: 0% self-buys, 100% valid offers (beats its 3B teacher)
- 微调 0.5B 可靠性: 0% 自买,100% 有效报价(击败了它的 3B 老师)
- Truth firewall: 0 leaks of a tip’s hidden flag across every prompt scanned
- 真实性防火墙: 在所有扫描的提示词中,消息隐藏标记泄露次数为 0
- Insider tip edge: a true-tip pre-position settles a positive P&L; a false tip does not
- 内幕消息优势: 基于真实消息的提前布局能带来正向盈亏;虚假消息则不能
- Heat to investigation: two clean suspicious wins cross the magistrate’s line
- 热度至调查: 两次明显的“可疑获利”触及了治安官的底线
- Ruin: a margin call and a loan default banish a creature, who returns a chapter later
- 破产: 一次追加保证金通知和贷款违约导致生物被流放,但它在下一章又回来了
A single seeded run exercising the Patron, the information war, relationships, and leverage end to end. 一次单种子运行,完整演练了赞助人、信息战、关系网和杠杆机制。
Takeaways for building with small models
小模型构建心得
A small model is a reliable format generator and an unreliable reasoner; you close the gap with structure, prompting, and a small fine-tune, not with scale. A heterogeneous council is more interesting than a homogeneous one and costs you only config once the serving layer is built. 小模型是可靠的格式生成器,但不是可靠的推理者;你需要通过结构、提示词工程和微调来弥补差距,而不是通过增加规模。异构委员会比同构委员会更有趣,一旦服务层构建完成,其成本仅在于配置。