Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality
Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality
Granite Embedding Multilingual R2:开源 Apache 2.0 多语言嵌入模型,支持 32K 上下文 — 百万级参数规模下最佳检索质量
TL;DR: Two new Apache 2.0 multilingual embedding models built on ModernBERT — a 97M-parameter compact model that beats every open sub-100M multilingual embedder on MTEB Multilingual Retrieval (60.3), and a 311M full-size model that scores 65.2 on MTEB Multilingual Retrieval (#2 among open models under 500M parameters) with Matryoshka support. Both cover 200+ languages, are tuned on 52 languages, handle 32K-token context (64x R1), and add code retrieval across 9 programming languages.
简而言之: 基于 ModernBERT 构建的两款全新 Apache 2.0 多语言嵌入模型现已发布。其中,97M 参数的紧凑型模型在 MTEB 多语言检索任务中得分 60.3,超越了所有同类开源模型;311M 参数的全尺寸模型在 MTEB 多语言检索中得分 65.2(在 500M 参数以下的开源模型中排名第二),并支持 Matryoshka 嵌入。两款模型均覆盖 200 多种语言,针对 52 种语言进行了微调,支持 32K token 上下文(是 R1 版本的 64 倍),并新增了对 9 种编程语言的代码检索支持。
Multilingual embedding models face a persistent tension: broad language coverage usually comes at the cost of model size, and small models usually sacrifice languages. If you work across languages — retrieval-augmented generation over multilingual corpora, cross-lingual search, code retrieval in international teams — you’ve likely had to choose between a model that’s fast enough and one that’s good enough. The Granite Embedding Multilingual R2 release narrows that gap considerably.
多语言嵌入模型一直面临着一个持久的矛盾:广泛的语言覆盖通常以牺牲模型体积为代价,而小型模型往往会牺牲语言支持。如果你需要在多语言环境下工作——例如针对多语言语料库的检索增强生成(RAG)、跨语言搜索或国际团队中的代码检索——你很可能不得不在“速度足够快”和“效果足够好”的模型之间做出选择。Granite Embedding Multilingual R2 的发布显著缩小了这一差距。
We’re releasing two new multilingual embedding models:
- granite-embedding-311m-multilingual-r2 — A 311M-parameter full-size model with 768-dimensional embeddings, Matryoshka dimension support, and top-tier multilingual retrieval quality.
- granite-embedding-97m-multilingual-r2 — A 97M-parameter compact model with 384-dimensional embeddings that delivers strong retrieval quality for its size.
我们发布了两款全新的多语言嵌入模型:
- granite-embedding-311m-multilingual-r2:一款 3.11 亿参数的全尺寸模型,具备 768 维嵌入,支持 Matryoshka 维度,并提供顶级的多语言检索质量。
- granite-embedding-97m-multilingual-r2:一款 9700 万参数的紧凑型模型,具备 384 维嵌入,在同等规模下提供出色的检索质量。
Both models support 200+ languages with enhanced retrieval quality for 52 languages and programming code, handle context lengths up to 32,768 tokens (a 64x increase over their R1 predecessors), and are released under the Apache 2.0 license. They work out of the box with sentence-transformers and transformers, require no task-specific instructions, and are compatible as drop-in replacements in LangChain, LlamaIndex, Haystack, and Milvus with a one-line model name change.
两款模型均支持 200 多种语言,并针对 52 种语言和编程代码优化了检索质量。它们支持高达 32,768 个 token 的上下文长度(较 R1 版本提升了 64 倍),并以 Apache 2.0 许可证发布。它们可直接与 sentence-transformers 和 transformers 配合使用,无需特定任务指令,且只需修改一行模型名称,即可作为 LangChain、LlamaIndex、Haystack 和 Milvus 的直接替代品。
For frameworks currently using an English-only default, that one line gives every user in your community support for 200+ languages — no API changes, no new dependencies, no code changes required on their end. Both models ship with ONNX and OpenVINO weights for CPU-optimized inference.
对于目前默认仅支持英语的框架,只需修改这一行代码,社区中的每位用户就能获得 200 多种语言的支持——无需更改 API,无需添加新依赖,用户端也无需进行任何代码修改。两款模型均提供 ONNX 和 OpenVINO 权重,以实现 CPU 优化推理。
The underlying encoder was pretrained on text from 200+ languages, producing general-purpose embeddings for any of them. The following 52 languages receive explicit retrieval-pair and cross-lingual training for higher-quality retrieval: Albanian (sq), Arabic (ar), Azerbaijani (az), Bengali (bn), Bulgarian (bg), Catalan (ca), Chinese (zh), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Icelandic (is), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Khmer (km), Korean (ko), Latvian (lv), Lithuanian (lt), Malay (ms), Marathi (mr), Norwegian (no), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Serbian (sr), Slovak (sk), Slovenian (sl), Spanish (es), Swahili (sw), Swedish (sv), Tagalog (tl), Telugu (te), Thai (th), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi). Additionally, the models are trained on programming code (Python, Go, Java, JavaScript, PHP, Ruby, SQL, C, C++) and support cross-lingual code retrieval.
底层编码器在 200 多种语言的文本上进行了预训练,可为任何语言生成通用嵌入。以下 52 种语言经过了明确的检索对和跨语言训练,以实现更高质量的检索:阿尔巴尼亚语、阿拉伯语、阿塞拜疆语、孟加拉语、保加利亚语、加泰罗尼亚语、中文、克罗地亚语、捷克语、丹麦语、荷兰语、英语、爱沙尼亚语、芬兰语、法语、格鲁吉亚语、德语、希腊语、希伯来语、印地语、匈牙利语、冰岛语、印度尼西亚语、意大利语、日语、哈萨克语、高棉语、韩语、拉脱维亚语、立陶宛语、马来语、马拉地语、挪威语、波斯语、波兰语、葡萄牙语、罗马尼亚语、俄语、塞尔维亚语、斯洛伐克语、斯洛文尼亚语、西班牙语、斯瓦希里语、瑞典语、他加禄语、泰卢固语、泰语、土耳其语、乌克兰语、乌尔都语、乌兹别克语、越南语。此外,模型还针对编程代码(Python, Go, Java, JavaScript, PHP, Ruby, SQL, C, C++)进行了训练,并支持跨语言代码检索。
Enterprise-Ready by Design
企业级设计
Both embedding models are trained on a mixture of IBM‑curated datasets, publicly available data, and internally generated or synthetic data. Public web‑derived data used in training is selected and filtered using IBM‑developed quality, deduplication, and governance processes intended to reduce risk in downstream commercial use. We intentionally avoid the use of the MS‑MARCO training dataset and datasets with explicit non‑commercial licensing restrictions.
两款嵌入模型均基于 IBM 精选数据集、公开可用数据以及内部生成或合成数据的混合体进行训练。训练中使用的公开网络数据经过了 IBM 开发的质量、去重和治理流程的筛选与过滤,旨在降低下游商业使用中的风险。我们特意避免使用 MS-MARCO 训练数据集以及具有明确非商业许可限制的数据集。
The models are pretrained using GneissWeb, an IBM‑curated dataset derived from publicly available web content and processed using IBM’s data preparation and governance tooling—along with additional IBM‑curated and other publicly available sources. Datasets undergo IBM governance review to assess licensing considerations, ownership signals, and personal data risks. These processes are designed to contribute to responsible use and enterprise deployment.
这些模型使用 GneissWeb 进行预训练,这是一个由 IBM 精选的数据集,源自公开的网络内容,并使用 IBM 的数据准备和治理工具进行处理,同时结合了其他 IBM 精选及公开来源。数据集需经过 IBM 的治理审查,以评估许可考量、所有权信号和个人数据风险。这些流程旨在促进负责任的使用和企业级部署。
A Strong Sub-100M Multilingual Model
强大的百兆参数级多语言模型
The standout of this release is granite-embedding-97m-multilingual-r2. At 97 million parameters, it scores 60.3 on Multilingual MTEB Retrieval across 18 languages — the highest retrieval score we’ve found for any open multilingual embedding model under 100M parameters. The next-best model in that size class, multilingual-e5-small, scores 50.9 on the same benchmark — a +9.4 point gap on a mature benchmark.
本次发布的一大亮点是 granite-embedding-97m-multilingual-r2。在 9700 万参数规模下,它在 18 种语言的 MTEB 多语言检索任务中得分 60.3,这是我们目前发现的 100M 参数以下所有开源多语言嵌入模型中的最高检索得分。同类规模中表现次之的模型 multilingual-e5-small 在同一基准测试中得分为 50.9,在成熟的基准测试中领先了 9.4 分。
At roughly one-third the size of the 311M full-size model, it retains the majority of its retrieval quality across multilingual, code, and long-document benchmarks — a +12.2 point gain on MTEB Multilingual Retrieval over its direct predecessor, driven by a new architecture, better training data, and a novel pruning methodology (more on that below). The full-size granite-embedding-311m-multilingual-r2 scores 65.2 on the same benchmark, a +13.0 point gain over its R1 predecessor.
其体积仅为 311M 全尺寸模型的三分之一左右,却在多语言、代码和长文档基准测试中保留了大部分检索质量。得益于新架构、更好的训练数据和创新的剪枝方法(下文详述),它在 MTEB 多语言检索任务中较其前代产品提升了 12.2 分。全尺寸的 granite-embedding-311m-multilingual-r2 在同一基准测试中得分 65.2,较 R1 版本提升了 13.0 分。
What Changed from R1
R1 版本有哪些改进
The Granite Embedding Multilingual R1 models were built on XLM-RoBERTa encoders with a 512-token context window. Granite Embedding Multilingual R1 模型基于 XLM-RoBERTa 编码器构建,上下文窗口为 512 个 token。