Open Reproduction of DeepSeek-R1

Open Reproduction of DeepSeek-R1 (DeepSeek-R1 开源复现)

Open R1 A fully open reproduction of DeepSeek-R1. This repo is a work in progress, let’s build it together! Open R1 是 DeepSeek-R1 的完全开源复现项目。该仓库目前正在开发中，让我们一起完善它！

Table of Contents (目录)

Overview, Plan of attack, Installation, Training models, SFT, GRPO, Evaluating models, Reproducing Deepseek’s evaluation results, Data generation, Generate data from a smol distilled R1 model, Generate data from DeepSeek-R1, Contributing. 概述、攻击计划、安装、模型训练、SFT、GRPO、模型评估、复现 DeepSeek 评估结果、数据生成、从小型蒸馏 R1 模型生成数据、从 DeepSeek-R1 生成数据、贡献指南。

Overview (概述)

The goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it. The project is simple by design and mostly consists of: 本仓库的目标是构建 R1 流水线中缺失的部分，以便每个人都能对其进行复现并在此基础上进行开发。该项目设计简洁，主要包含：

src/open_r1: contains the scripts to train models as well as generate synthetic data:
- grpo.py: trains a model with GRPO on a given dataset.
- sft.py: performs a simple SFT of a model on a dataset.
- generate.py: generates synthetic data from a model using Distilabel.
Makefile: contains easy-to-run commands for each step in the R1 pipeline leveraging the scripts above.
src/open_r1: 包含训练模型和生成合成数据的脚本：
- grpo.py: 使用 GRPO 在给定数据集上训练模型。
- sft.py: 对模型进行简单的 SFT（监督微调）。
- generate.py: 使用 Distilabel 从模型生成合成数据。
Makefile: 包含利用上述脚本执行 R1 流水线每一步的简易命令。

Plan of attack (攻击计划)

We will use the DeepSeek-R1 tech report as a guide, which can roughly be broken down into three main steps: 我们将以 DeepSeek-R1 技术报告为指南，大致可分为三个主要步骤：

Step 1: replicate the R1-Distill models by distilling a high-quality corpus from DeepSeek-R1.
Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will likely involve curating new, large-scale datasets for math, reasoning, and code.
Step 3: show we can go from base model to RL-tuned via multi-stage training.
第一步: 通过从 DeepSeek-R1 蒸馏高质量语料库来复现 R1-Distill 模型。
第二步: 复现 DeepSeek 用于创建 R1-Zero 的纯强化学习（RL）流水线。这可能涉及策划用于数学、推理和代码的大规模新数据集。
第三步: 展示如何通过多阶段训练从基础模型过渡到经过 RL 调优的模型。

News 🗞️ (新闻)

[2025/05/26] (Step 1 completed!): We release Mixture-of-Thoughts—a curated reasoning dataset of 350k verified traces distilled from R1. The dataset spans tasks in mathematics, coding, and science, and is designed to teach language models to reason step-by-step. We also provide a recipe to train OpenR1-Distill-7B, which replicates the reasoning capabilities of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B and marks the completion of step 1 in the Open R1 project.
[2025/05/26] (第一步完成！): 我们发布了 Mixture-of-Thoughts，这是一个包含 35 万条从 R1 蒸馏出的经过验证的推理轨迹数据集。该数据集涵盖数学、编码和科学任务，旨在教会语言模型进行逐步推理。我们还提供了训练 OpenR1-Distill-7B 的配方，它复现了 deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 的推理能力，标志着 Open R1 项目第一步的完成。
⚡️ [2025/03/11] (update #3): We release the CodeForces-CoTs dataset of 10k competitive programming problems and 100k solutions distilled from R1. We also release IOI24: a new benchmark of very hard problems from international olympiads. A 7B Qwen model trained on CodeForces-CoTs can outperform Claude 3.7 Sonnet on IOI24, while a 32B model can outperform R1 itself.
⚡️ [2025/03/11] (更新 #3): 我们发布了 CodeForces-CoTs 数据集，包含 1 万个竞赛编程问题和 10 万个从 R1 蒸馏出的解决方案。我们还发布了 IOI24：一个来自国际奥赛的超难题基准测试。在 CodeForces-CoTs 上训练的 7B Qwen 模型在 IOI24 上的表现可以超过 Claude 3.7 Sonnet，而 32B 模型甚至可以超过 R1 本身。
∞ [2025/02/10] (update #2): We release the OpenR1-Math-220k dataset of 220k traces distilled from R1 on a new version of NuminaMath. Models trained on this dataset match the performance of DeepSeek’s distilled ones.
∞ [2025/02/10] (更新 #2): 我们发布了 OpenR1-Math-220k 数据集，包含 22 万条在 NuminaMath 新版本上从 R1 蒸馏出的轨迹。在该数据集上训练的模型性能与 DeepSeek 的蒸馏模型相当。
🔥 [2025/02/02] (update #1): We implement the first parts of the training, inference, and evaluation pipelines. Let’s go!
🔥 [2025/02/02] (更新 #1): 我们实现了训练、推理和评估流水线的第一部分。开始吧！

Installation (安装)

Caution: Libraries rely on CUDA 12.4. If you see errors related to segmentation faults, double check the version your system is running with nvcc --version. 注意: 库依赖于 CUDA 12.4。如果遇到段错误（segmentation faults）相关的错误，请使用 nvcc --version 仔细检查系统运行的版本。

To run the code in this project, first, create a Python virtual environment using e.g. uv. To install uv, follow the UV Installation Guide. 要运行本项目代码，首先请使用 uv 等工具创建 Python 虚拟环境。安装 uv 请参考 UV 安装指南。

Note: As a shortcut, run make install to setup development libraries. Afterwards, if everything is setup correctly you can try out the Open-R1 models. 注意: 作为快捷方式，运行 make install 即可安装开发库。之后，如果一切设置正确，你就可以尝试 Open-R1 模型了。

uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip

Tip: For Hugging Face cluster users, add export UV_LINK_MODE=copy to your .bashrc to suppress cache warnings from uv. 提示: 对于 Hugging Face 集群用户，请将 export UV_LINK_MODE=copy 添加到你的 .bashrc 中，以抑制 uv 的缓存警告。

Next, install vLLM and FlashAttention: 接下来，安装 vLLM 和 FlashAttention：

uv pip install vllm==0.8.5.post1
uv pip install setuptools && uv pip install flash-attn --no-build-isolation

This will also install PyTorch v2.6.0 and it is very important to use this version since the vLLM binaries are compiled for it. 这也会安装 PyTorch v2.6.0，使用此版本非常重要，因为 vLLM 二进制文件是针对该版本编译的。

Training models (模型训练)

Note: The training commands below are configured for a node of 8 x H100s (80GB). For different hardware and topologies, you may need to tune the batch size and number of gradient accumulation steps. 注意: 以下训练命令是为 8 x H100 (80GB) 节点配置的。对于不同的硬件和拓扑结构，你可能需要调整批处理大小和梯度累积步数。

We support training models with either DDP or DeepSpeed (ZeRO-2 and ZeRO-3). For example, to perform SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as open-r1/Mixture-of-Thoughts, run: 我们支持使用 DDP 或 DeepSpeed (ZeRO-2 和 ZeRO-3) 进行模型训练。例如，要对从 DeepSeek-R1 蒸馏出的带有推理轨迹的数据集（如 open-r1/Mixture-of-Thoughts）进行 SFT，请运行：

# Train via command line
accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path open-r1/Qwen2.5-Math-7B-RoPE-300k \
    --dataset_name open-r1/Mixture-of-Thoughts \
    --dataset_config all \
    --eos_token '<|im_end|>' \
    --learning_rate 4.0e-5 \
    --num_train_epochs 5 \
    --max_seq_length 32768 \
    --per_device_train_batch_size 2 \
    --gradient_checkpointing \
    --bf16 \
    --use_liger_kernel \
    --output_dir data/OpenR1-Distill-7B