Train Your Own LLM from Scratch

Train Your Own LLM from Scratch

从零开始训练你自己的大语言模型 (LLM)

A hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why. Andrej Karpathy’s nanoGPT was my first real exposure to LLMs and transformers. Seeing how a working language model could be built in a few hundred lines of PyTorch completely changed how I thought about AI and inspired me to go deeper into the space. This workshop is my attempt to give others that same experience. 这是一个实践研讨会,你将亲手编写 GPT 训练流水线的每一个部分,并深入理解每个组件的功能及其背后的原理。Andrej Karpathy 的 nanoGPT 是我第一次真正接触大语言模型(LLM)和 Transformer。看到一个可运行的语言模型竟然可以用几百行 PyTorch 代码构建出来,这彻底改变了我对人工智能的看法,并激励我深入探索这一领域。本研讨会旨在让其他人也能获得同样的体验。

nanoGPT targets reproducing GPT-2 (124M params) and covers a lot of ground. This project strips it down to the essentials and scales it to a ~10M param model that trains on a laptop in under an hour — designed to be completed in a single workshop session. nanoGPT 的目标是复现 GPT-2(1.24 亿参数),涵盖了广泛的内容。本项目将其精简至核心要素,并将其缩减为一个约 1000 万参数的模型,可以在一小时内于笔记本电脑上完成训练——旨在单次研讨会内即可完成。

What You’ll Build

你将构建什么

A working GPT model trained from scratch on your MacBook, capable of generating Shakespeare-like text. You’ll write: 一个在你的 MacBook 上从零开始训练的 GPT 模型,能够生成类似莎士比亚风格的文本。你将编写:

  • Tokenizer — turning text into numbers the model can process
  • Model architecture — the transformer: embeddings, attention, feed-forward layers
  • Training loop — forward pass, loss, backprop, optimizer, learning rate scheduling
  • Text generation — sampling from your trained model
  • 分词器 (Tokenizer) — 将文本转换为模型可以处理的数字
  • 模型架构 (Model architecture) — Transformer:嵌入层、注意力机制、前馈神经网络层
  • 训练循环 (Training loop) — 前向传播、损失函数、反向传播、优化器、学习率调度
  • 文本生成 (Text generation) — 从你训练好的模型中进行采样

Prerequisites

前置要求

Any laptop or desktop (Mac, Linux, or Windows). Python 3.12+. Comfort reading Python code (you don’t need ML experience). Training uses Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU automatically. Also works on Google Colab — upload the files and run with !python train.py. 任何笔记本或台式机(Mac、Linux 或 Windows)。Python 3.12+。能够阅读 Python 代码(无需机器学习经验)。训练过程会自动使用 Apple Silicon GPU (MPS)、NVIDIA GPU (CUDA) 或 CPU。也可以在 Google Colab 上运行——上传文件并使用 !python train.py 运行即可。

Getting Started

开始上手

Local (recommended) 本地环境(推荐)

Install uv if you don’t have it: 如果你还没有安装 uv,请先安装:

# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Then set up the project: 然后设置项目:

uv sync
mkdir scratchpad && cd scratchpad

Google Colab Google Colab 环境

If you don’t have a local setup, upload the repo to Colab and install dependencies: 如果你没有本地环境,可以将仓库上传到 Colab 并安装依赖:

!pip install torch numpy tqdm tiktoken

Upload data/shakespeare.txt to your Colab files, then write your code in notebook cells or upload .py files and run them with !python train.py. Work through the docs in order. Each part walks you through writing a piece of the pipeline, explaining what each component does and why. By the end, you’ll have a working model.py, train.py, and generate.py that you wrote yourself. 将 data/shakespeare.txt 上传到 Colab 文件中,然后在 Notebook 单元格中编写代码,或者上传 .py 文件并使用 !python train.py 运行。请按顺序阅读文档。每一部分都会引导你编写流水线的一个片段,并解释每个组件的作用和原理。最终,你将亲手写出可运行的 model.pytrain.pygenerate.py

Curriculum

课程大纲

PartWhat You’ll WriteConcepts
Part 1: TokenizationCharacter-level tokenizerCharacter encoding, vocabulary size, why BPE fails on small data
Part 2: The TransformerFull GPT model architectureEmbeddings, self-attention, layer norm, MLP blocks
Part 3: The Training LoopComplete training pipelineLoss functions, AdamW, gradient clipping, LR scheduling
Part 4: Text GenerationInference and samplingTemperature, top-k, autoregressive decoding
Part 5: Putting It All TogetherTrain on real data, experimentLoss curves, scaling experiments, next steps
Part 6: CompetitionTrain the best AI poetFind datasets, scale up, submit your best poem
章节你将编写的内容核心概念
第一部分:分词字符级分词器字符编码、词汇表大小、为什么 BPE 在小数据集上失效
第二部分:Transformer完整的 GPT 模型架构嵌入层、自注意力机制、层归一化、MLP 模块
第三部分:训练循环完整的训练流水线损失函数、AdamW、梯度裁剪、学习率调度
第四部分:文本生成推理与采样温度系数 (Temperature)、Top-k、自回归解码
第五部分:综合实践在真实数据上训练并实验损失曲线、缩放实验、后续步骤
第六部分:竞赛训练最强的 AI 诗人寻找数据集、扩展规模、提交你的最佳诗作

Architecture: GPT at a Glance

架构:GPT 一览

Input Text │ ▼ 
┌─────────────────┐ 
│ Tokenizer       │ "hello" → [20, 43, 50, 50, 53] (character-level) 
└────────┬────────┘ 

┌─────────────────┐ 
│ Token Embed +   │ token IDs → vectors (n_embd dimensions) 
│ Position Embed  │ + positional information 
└────────┬────────┘ 

┌─────────────────┐ 
│ Transformer     │ × n_layer 
│  Block:         │ 
│ ┌────────────┐  │ 
│ │ LayerNorm  │  │ 
│ │ Self-Attn  │  │ n_head parallel attention heads 
│ │ + Residual │  │ 
│ ├────────────┤  │ 
│ │ LayerNorm  │  │ 
│ │ MLP (FFN)  │  │ expand 4x, GELU, project back 
│ │ + Residual │  │ 
│ └────────────┘  │ 
└────────┬────────┘ 

┌─────────────────┐ 
│ LayerNorm       │ 
│ Linear → logits │ vocab_size outputs (probability over next token) 
└─────────────────┘

Model Configs for This Workshop

本研讨会的模型配置

ConfigParamsn_layern_headn_embdTrain Time (M3 Pro)
Tiny~0.5M22128~5 min
Small~4M44256~20 min
Medium (default)~10M66384~45 min

All configs use character-level tokenization (vocab_size=65) and block_size=256. 所有配置均使用字符级分词(词汇表大小=65)和块大小(block_size=256)。

Tokenization: Characters vs BPE

分词:字符级 vs BPE

This workshop uses character-level tokenization on Shakespeare. BPE tokenization (GPT-2’s 50k vocab) doesn’t work on small datasets — most token bigrams are too rare for the model to learn patterns from. 本研讨会在莎士比亚数据集上使用字符级分词。BPE 分词(GPT-2 的 5 万词汇表)在小数据集上效果不佳——大多数双词组合(bigrams)过于罕见,模型无法从中学习到规律。

TokenizerVocab SizeDataset Size Needed
Character-level~65Small (Shakespeare, ~1MB)
BPE (tiktoken)50,257Large (TinyStories+, 100MB+)

Part 5 covers switching to BPE for larger datasets. 第五部分将介绍如何针对更大的数据集切换到 BPE。

Key References

关键参考资料