Introducing North Mini Code: Cohere’s First Model For Developers
Introducing North Mini Code: Cohere’s First Model For Developers
隆重推出 North Mini Code:Cohere 首款开发者模型
Today, we are releasing North Mini Code, a 30B-parameter Mixture-of-Experts model with 3B active parameters with powerful agentic coding capabilities, available on Hugging Face under the Apache 2.0 license. North Mini Code is the first model in Cohere’s new family of models, and is specifically designed and trained for agentic software engineering tasks.
今天,我们正式发布 North Mini Code。这是一个拥有 300 亿参数的混合专家(MoE)模型,激活参数为 30 亿,具备强大的智能体编程能力,现已在 Hugging Face 上以 Apache 2.0 许可证开源。North Mini Code 是 Cohere 全新模型系列中的首款产品,专为智能体软件工程任务而设计和训练。
North Mini Code is optimized for complex software engineering workflows, terminal-based agentic tasks, and high-quality code generation. On Artificial Analysis’ Coding Index, North Mini Code achieves a score of 33.4, outperforming Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), Devstral Small 2 (24B Dense), and even substantially larger models such as Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B). It ranks among the strongest open-source coding models in its size class.
North Mini Code 针对复杂的软件工程工作流、基于终端的智能体任务以及高质量代码生成进行了优化。在 Artificial Analysis 的编码指数(Coding Index)中,North Mini Code 获得了 33.4 分,超越了 Qwen3.5 (35B-A3B)、Gemma 4 (26B-A4B)、Devstral Small 2 (24B Dense),甚至超过了 Nemotron 3 Super (120B-A12B)、Mistral Small 4 (119B-A6B) 和 Devstral 2 (123B) 等规模大得多的模型。它在同量级开源编程模型中名列前茅。
Real-world code agents depend on model quality and robustness across agent harnesses. We trained North Mini Code using multiple scaffolds rather than optimizing for a single one. This approach enables North Mini Code to serve as a reliable foundation for coding agents such as OpenCode.
现实世界中的代码智能体依赖于模型在不同智能体框架下的质量和鲁棒性。我们使用多种脚手架(scaffolds)对 North Mini Code 进行了训练,而不是针对单一框架进行优化。这种方法使 North Mini Code 能够成为 OpenCode 等编程智能体的可靠基础。
Architecture
架构
North Mini Code is a decoder-only Transformer-based sparse Mixture-of-Experts model. It uses our efficient attention implementation, interleaved between sliding-window attention with RoPE and global attention with no positional embeddings, in a 3:1 ratio. The feed-forward block is an MoE block with 128 experts, of which 8 are activated per token. Each expert block is an FFN block with SwiGLU activation. The router applies a sigmoid activation function to the logits before the top-k selection. We also use a single dense layer before the sparse layers.
North Mini Code 是一个基于 Transformer 解码器的稀疏混合专家(MoE)模型。它采用了我们高效的注意力实现方式,以 3:1 的比例交替使用带有 RoPE 的滑动窗口注意力和不带位置编码的全局注意力。前馈模块是一个包含 128 个专家的 MoE 模块,每个 token 激活其中 8 个。每个专家模块都是带有 SwiGLU 激活函数的 FFN 模块。路由器在进行 top-k 选择之前,会对 logits 应用 sigmoid 激活函数。此外,我们在稀疏层之前还使用了一个单一的稠密层。
Post-Training for Coding Excellence
追求卓越编程的后训练
We post-train North Mini Code using a two-stage cascaded supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR), focusing on agentic coding. Our first stage SFT data focuses on coding capabilities that are integrated within a wider mix for robustness and usability. The datamix includes programming, reasoning, and instruction following across a large variety of domains where the code datasets correspond to 70% of trainable tokens, 43% agentic tool-use data, and 27% single-turn competitive or scientific programming data.
我们通过两阶段级联监督微调(SFT)和随后的可验证奖励强化学习(RLVR)对 North Mini Code 进行后训练,重点关注智能体编程。第一阶段的 SFT 数据侧重于编程能力,并将其整合到更广泛的数据集中以提高鲁棒性和可用性。数据混合包括跨多个领域的编程、推理和指令遵循,其中代码数据集占可训练 token 的 70%,智能体工具使用数据占 43%,单轮竞赛或科学编程数据占 27%。
In the second stage SFT, we use a 4.5 billion token data mixture from only agentic and reasoning-driven samples, where code data forms 61% of trainable tokens. This mixture comprises our highest-quality data across coding and wider agentic tasks where tool calls and completions are verified as executable and correct.
在第二阶段 SFT 中,我们使用了 45 亿个 token 的数据混合,这些数据仅来自智能体和推理驱动的样本,其中代码数据占可训练 token 的 61%。该混合数据包含了我们在编程和更广泛的智能体任务中质量最高的数据,其中工具调用和补全结果均经过验证,确保可执行且正确。
Our internal data pipeline heavily relies on containerised agentic coding environments. We maintain a disjoint subset of these environments for use in synthetic SFT data generation and RLVR. The majority are based on software engineering tasks from real-world repositories, while the rest are terminal-based agentic tasks sourced from open-source and internal datasets. In total, we used over 70k verifiable tasks across ~5k unique repositories.
我们的内部数据流水线严重依赖容器化的智能体编程环境。我们维护了这些环境的一个不相交子集,用于合成 SFT 数据生成和 RLVR。其中大部分基于来自真实代码库的软件工程任务,其余则是源自开源和内部数据集的基于终端的智能体任务。总计,我们在约 5000 个独立代码库中使用了超过 7 万个可验证任务。
Robustness Across Harnesses
跨框架的鲁棒性
Harness robustness improves model usability in realistic software development settings, where agents encounter diverse and unpredictable tooling environments. These environments differ not just in prompting but in fundamental tool-use modality.
框架鲁棒性提升了模型在真实软件开发环境中的可用性,因为智能体会遇到各种不可预测的工具环境。这些环境不仅在提示词上有所不同,在基础的工具使用模式上也存在差异。