Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

介绍 Mellum2：JetBrains 推出的 12B 混合专家模型

Mellum2 is a 12B-parameter Mixture-of-Experts model trained from scratch on natural language and code. The model activates only 2.5B parameters per token, making it efficient for high-throughput, low-latency inference. Mellum2 can be used for routing, RAG, summarization, sub-agents, high-throughput coding features, and private deployments. It is released under the Apache 2.0 license. Compared with similar-sized models, Mellum2 delivers competitive benchmark performance while achieving more than 2x faster inference. Mellum2 是一个拥有 120 亿参数的混合专家（MoE）模型，完全基于自然语言和代码从零训练而成。该模型在处理每个 token 时仅激活 25 亿参数，从而实现了高效的高吞吐量、低延迟推理。Mellum2 可用于路由、RAG（检索增强生成）、摘要生成、子智能体（sub-agents）、高吞吐量编码功能以及私有化部署。该模型采用 Apache 2.0 许可证发布。与同等规模的模型相比，Mellum2 在提供极具竞争力的基准测试性能的同时，推理速度提升了 2 倍以上。

Download the model on Hugging Face: https://huggingface.co/collections/JetBrains/mellum-2 在 Hugging Face 上下载模型：https://huggingface.co/collections/JetBrains/mellum-2

For architecture details, training setup, benchmarks, and evaluation methodology, read the full technical report: https://arxiv.org/pdf/2605.31268 如需了解架构细节、训练设置、基准测试及评估方法，请阅读完整技术报告：https://arxiv.org/pdf/2605.31268

Today we’re releasing Mellum2, an open Mixture-of-Experts model optimized for low-latency text-and-code workloads. Mellum originally started as a code completion model. With Mellum2, we extend that foundation to a broader set of natural language and software engineering tasks while keeping the model focused on efficient inference and deployability. Modern AI systems increasingly rely on multiple model calls: routing, retrieval, summarization, planning, validation, and tool use. Many of these operations are latency-sensitive and do not require the largest available model. Mellum2 targets these workloads. 今天，我们发布了 Mellum2，这是一个针对低延迟文本和代码工作负载进行优化的开源混合专家模型。Mellum 最初是一个代码补全模型。通过 Mellum2，我们将这一基础扩展到了更广泛的自然语言和软件工程任务中，同时保持了模型在高效推理和可部署性方面的专注。现代 AI 系统越来越依赖于多次模型调用：路由、检索、摘要、规划、验证和工具使用。其中许多操作对延迟非常敏感，并不需要使用目前最大的模型。Mellum2 正是针对这些工作负载而设计的。

Benchmark highlights

基准测试亮点

In our technical report, we evaluate Mellum2 across code generation, reasoning, science, and math benchmarks. Mellum2 is competitive with similarly sized open models while delivering more than 2x faster inference, making it suitable for high-throughput production workloads. 在我们的技术报告中，我们对 Mellum2 在代码生成、推理、科学和数学基准测试中的表现进行了评估。Mellum2 与同等规模的开源模型相比具有竞争力，同时推理速度提升了 2 倍以上，使其非常适合高吞吐量的生产环境。

Model architecture

模型架构

The MoE architecture keeps total model capacity high while activating only a subset of parameters for each token. This makes inference more efficient and helps reduce serving cost for real-time workloads. Mellum2 is intentionally focused on text and code rather than multimodal tasks. This specialization keeps the model compact and efficient for software engineering workloads. MoE 架构在保持模型总容量较高的同时，每个 token 仅激活一部分参数。这使得推理更加高效，并有助于降低实时工作负载的服务成本。Mellum2 有意专注于文本和代码，而非多模态任务。这种专业化设计使模型保持紧凑，并能高效处理软件工程工作负载。

Key use cases

核心应用场景

Routing and orchestration: Mellum2 works well as a lightweight routing and orchestration model in multi-model systems, including prompt classification, tool selection, and intermediate control-flow steps. 路由与编排： Mellum2 非常适合作为多模型系统中的轻量级路由和编排模型，包括提示词分类、工具选择以及中间控制流步骤。

RAG pipelines: The model is well suited for latency-sensitive retrieval pipelines, including context compression, summarization, and retrieval post-processing. RAG 流水线： 该模型非常适合对延迟敏感的检索流水线，包括上下文压缩、摘要生成和检索后处理。

Sub-agents: Mellum2 can be used for agent subtasks such as planning, validation, transformation, and context preparation, reducing the need to invoke larger models for intermediate operations. 子智能体： Mellum2 可用于规划、验证、转换和上下文准备等智能体子任务，减少了在中间操作中调用大型模型的必要。

Private deployment: Because Mellum2 is open and efficient to serve, it can be deployed in self-hosted environments involving proprietary code or internal data. 私有化部署： 由于 Mellum2 是开源且易于服务的，它可以部署在涉及专有代码或内部数据的自托管环境中。

Why well-scoped models matter

为什么“定位明确”的模型很重要

As AI systems mature, the most effective architectures are becoming less monolithic. A single frontier model can be powerful, but production systems often need several specialized components working together: retrievers, routers, code-aware models, validators, tool callers, and larger reasoning models. We think of Mellum2 as a “focal” model: a fast, well-scoped model optimized for high-frequency tasks inside larger AI systems. The goal is not to replace every model in the stack. The goal is to make the stack faster, cheaper, and easier to control. 随着 AI 系统的成熟，最有效的架构正变得不再那么单一。单一的前沿模型固然强大，但生产系统通常需要多个专业组件协同工作：检索器、路由器、代码感知模型、验证器、工具调用器以及更大的推理模型。我们将 Mellum2 视为一个“焦点”模型：一个快速、定位明确的模型，专门针对大型 AI 系统内的高频任务进行了优化。我们的目标不是取代技术栈中的每一个模型，而是让整个技术栈运行得更快、成本更低、更易于控制。

Getting started with Mellum2

如何开始使用 Mellum2

If you are building AI systems for software engineering – inside an IDE, in a RAG pipeline, as part of an agent workflow, or on private infrastructure – Mellum2 is ready to try. 如果您正在构建软件工程相关的 AI 系统——无论是集成在 IDE 中、用于 RAG 流水线、作为智能体工作流的一部分，还是部署在私有基础设施上——Mellum2 都已准备好供您尝试。