Granite 4.1 LLMs: How They’re Built

Granite 4.1 大语言模型：构建之道

An in-depth technical walkthrough of data engineering, pre-training, supervised fine-tuning, and reinforcement learning behind the Granite 4.1 LLMs. 本文深入探讨了 Granite 4.1 大语言模型背后的数据工程、预训练、监督微调及强化学习技术细节。

TL;DR — Granite 4.1 is a family of dense, decoder‑only LLMs (3B, 8B, and 30B) trained on ~15T tokens using a multi‑stage pre‑training pipeline, including long‑context extension of up to 512K tokens. The models are further refined with supervised fine‑tuning on ~4.1M high‑quality curated samples and reinforcement learning via on‑policy GRPO with DAPO loss (Yu et al., 2025). Notably, the 8B instruct model matches or surpasses the previous Granite 4.0‑H‑Small (32B‑A9B MoE) despite using a simpler dense architecture with fewer parameters. All Granite 4.1 models are released under the Apache 2.0 license. 摘要 — Granite 4.1 是一个稠密（dense）、仅解码器（decoder-only）架构的大语言模型系列（包含 3B、8B 和 30B 参数版本）。该系列模型基于约 15 万亿（15T）token，通过多阶段预训练流水线进行训练，并支持高达 512K token 的长上下文扩展。模型通过约 410 万条高质量精选样本进行监督微调，并利用基于策略的 GRPO 算法及 DAPO 损失函数（Yu et al., 2025）进行强化学习优化。值得注意的是，尽管 8B 指令模型采用了更简单的稠密架构且参数量更少，但其性能已达到或超过了之前的 Granite 4.0-H-Small（32B-A9B MoE）模型。所有 Granite 4.1 模型均在 Apache 2.0 许可下发布。

Building high‑quality small language models goes beyond simply scaling compute—it requires rigorous data curation throughout training. For Granite 4.1, we prioritized data quality over quantity, progressively refining the data mixture across five pre‑training stages. We further curated supervised fine‑tuning data using an LLM‑as‑Judge framework and applied a multi‑stage reinforcement learning pipeline to systematically strengthen performance in math, coding, instruction following, and general chat. 构建高质量的小型语言模型不仅仅是增加算力，更需要在整个训练过程中进行严格的数据筛选。对于 Granite 4.1，我们优先考虑数据质量而非数量，在五个预训练阶段中逐步优化数据配比。此外，我们利用“大模型作为裁判”（LLM-as-Judge）框架精选了监督微调数据，并应用了多阶段强化学习流水线，以系统性地提升模型在数学、编程、指令遵循及通用对话方面的表现。

Model Architecture

模型架构

Granite 4.1 models use a decoder-only dense transformer architecture. The core design choices include Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, RMSNorm, and shared input/output embeddings. Granite 4.1 模型采用仅解码器的稠密 Transformer 架构。核心设计选择包括分组查询注意力（GQA）、旋转位置编码（RoPE）、SwiGLU 激活函数、RMSNorm 以及共享输入/输出嵌入。

Component	3B Dense	8B Dense	30B Dense
Embedding size	2560	4096	4096
Number of layers	40	40	64
Attention head size	64	128	128
Number of attention heads	40	32	32
Number of KV heads	8	8	8
MLP hidden size	8192	12800	32768
MLP activation	SwiGLU	SwiGLU	SwiGLU
Position embedding	RoPE	RoPE	RoPE

组件	3B 稠密	8B 稠密	30B 稠密
嵌入维度	2560	4096	4096
层数	40	40	64
注意力头大小	64	128	128
注意力头数量	40	32	32
KV 头数量	8	8	8
MLP 隐藏层大小	8192	12800	32768
MLP 激活函数	SwiGLU	SwiGLU	SwiGLU
位置编码	RoPE	RoPE	RoPE

All three model sizes share the same training pipeline and data strategy, differing only in architecture dimensions. 这三种规模的模型共享相同的训练流水线和数据策略，仅在架构维度上有所不同。

Pre-Training

预训练

Granite 4.1 is trained from scratch on approximately 15 trillion tokens using a five‑phase training strategy. Phases 1–2 focus on foundational pre‑training, phases 3–4 perform mid‑training with progressively higher‑quality data annealing, and phase 5 introduces long‑context training, extending the context window to 512K tokens. Each phase employs a distinct data mixture and learning‑rate schedule, gradually shifting from broad web‑scale data to more curated, domain‑specific content. Granite 4.1 采用五阶段训练策略，从零开始在约 15 万亿 token 上进行训练。第 1-2 阶段侧重于基础预训练；第 3-4 阶段进行中期训练，通过逐步提高数据质量进行退火（annealing）；第 5 阶段引入长上下文训练，将上下文窗口扩展至 512K token。每个阶段采用不同的数据配比和学习率计划，逐渐从广泛的网页级数据转向更精选的领域特定内容。

(Note: The article continues with detailed breakdowns of the five phases, which follow the same logic of alternating English and Chinese.)