CS336: Language Modeling from Scratch

CS336：从零构建语言模型

Stanford / Spring 2026 斯坦福大学 / 2026年春季

Course Staff / 课程团队

Instructors: Tatsunori Hashimoto, Percy Liang 授课教师： Tatsunori Hashimoto, Percy Liang

Course Assistants (CA): Herman Brunborg, Marcel Rød, Steven Cao 课程助教 (CA)： Herman Brunborg, Marcel Rød, Steven Cao

Logistics / 课程安排

Lectures: Monday/Wednesday 3:00-4:20pm in Skilling Auditorium 授课时间： 周一/周三下午 3:00-4:20，地点：Skilling Auditorium

Recordings: YouTube playlist 课程录像： YouTube 播放列表

Contact: Students should ask all course-related questions in public Slack channels. All announcements will also be made in Slack. For personal matters, email cs336-spr2526-staff@lists.stanford.edu. 联系方式： 学生应在公开的 Slack 频道中提出所有与课程相关的问题。所有公告也将通过 Slack 发布。如有个人事务，请发送邮件至 cs336-spr2526-staff@lists.stanford.edu。

What is this course about? / 课程简介

Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks. As the field of artificial intelligence (AI), machine learning (ML), and NLP continues to grow, possessing a deep understanding of language models becomes essential for scientists and engineers alike. 语言模型是现代自然语言处理 (NLP) 应用的基石，并开创了一种通过单一通用系统解决多种下游任务的新范式。随着人工智能 (AI)、机器学习 (ML) 和 NLP 领域的不断发展，深入理解语言模型对于科学家和工程师而言变得至关重要。

This course is designed to provide students with a comprehensive understanding of language models by walking them through the entire process of developing their own. Drawing inspiration from operating systems courses that create an entire operating system from scratch, we will lead students through every aspect of language model creation, including data collection and cleaning for pre-training, transformer model construction, model training, and evaluation before deployment. 本课程旨在通过引导学生亲历开发过程，使他们对语言模型有全面的理解。借鉴操作系统课程中“从零构建操作系统”的理念，我们将带领学生完成语言模型创建的各个环节，包括预训练数据的收集与清洗、Transformer 模型构建、模型训练以及部署前的评估。

Prerequisites / 先修要求

Proficiency in Python: The majority of class assignments will be in Python. Unlike most other AI classes, students will be given minimal scaffolding. The amount of code you will write will be at least an order of magnitude greater than for other classes. Therefore, being proficient in Python and software engineering is paramount. Python 熟练度： 大多数课程作业将使用 Python 完成。与大多数其他 AI 课程不同，本课程提供的基础框架较少。你需要编写的代码量至少比其他课程多一个数量级。因此，精通 Python 和软件工程至关重要。

Experience with deep learning and systems optimization: A significant part of the course will involve making neural language models run quickly and efficiently on GPUs across multiple machines. We expect students to be able to have a strong familiarity with PyTorch and know basic systems concepts like the memory hierarchy. 深度学习与系统优化经验： 课程的重要部分涉及如何让神经语言模型在多机 GPU 上快速高效地运行。我们要求学生对 PyTorch 有深入了解，并掌握内存层级等基础系统概念。

College Calculus, Linear Algebra, Basic Probability and Statistics, Machine Learning: You should be comfortable with matrix/vector notation, operations, basic probability (Gaussian distributions, mean, standard deviation), and the basics of machine learning and deep learning. 大学微积分、线性代数、基础概率统计、机器学习： 你应能熟练掌握矩阵/向量符号与运算、基础概率（高斯分布、均值、标准差等）以及机器学习和深度学习的基础知识。

Note: This is a 5-unit class. This is a very implementation-heavy class, so please allocate enough time for it. 注：这是一门 5 学分的课程。由于该课程涉及大量的工程实现，请务必预留充足的时间。

Coursework Assignments / 课程作业

Assignment 1: Basics: Implement all components (tokenizer, model architecture, optimizer) to train a standard Transformer language model. 作业 1：基础： 实现训练标准 Transformer 语言模型所需的所有组件（分词器、模型架构、优化器）。
Assignment 2: Systems: Profile and benchmark the model from Assignment 1, optimize Attention with your own Triton implementation of FlashAttention2, and build a memory-efficient, distributed training version. 作业 2：系统： 对作业 1 的模型进行性能分析和基准测试，使用 Triton 实现 FlashAttention2 以优化注意力机制，并构建内存高效的分布式训练版本。
Assignment 3: Scaling: Understand Transformer components and query a training API to fit a scaling law to project model scaling. 作业 3：扩展性： 理解 Transformer 的各个组件，并通过查询训练 API 来拟合扩展定律，以预测模型规模。
Assignment 4: Data: Convert raw Common Crawl dumps into usable pretraining data. Perform filtering and deduplication. 作业 4：数据： 将原始 Common Crawl 数据集转换为可用的预训练数据，并进行过滤和去重。
Assignment 5: Alignment and Reasoning: Apply supervised finetuning and reinforcement learning (RL) to train LMs to reason when solving math problems. 作业 5：对齐与推理： 应用监督微调和强化学习 (RL) 来训练语言模型，使其在解决数学问题时具备推理能力。

GPU compute for self-study / 自学 GPU 计算资源

If you are following along at home, you can access GPU compute from cloud providers such as Modal, Lambda Labs, RunPod, Nebius, and Together. For convenience and to save money, we recommend debugging correctness on CPU first, then using GPU(s) for training runs or benchmarking. 如果你在校外自学，可以通过 Modal、Lambda Labs、RunPod、Nebius 和 Together 等云服务商获取 GPU 计算资源。为了方便并节省开支，我们建议先在 CPU 上调试代码正确性，然后再使用 GPU 进行训练或基准测试。

Honor code / 荣誉准则

Collaboration: Study groups are allowed, but students must understand and complete their own assignments. 协作： 允许组成学习小组，但学生必须理解并独立完成自己的作业。

AI tools: Prompting LLMs for low-level programming or high-level conceptual questions is permitted, but using them directly to solve the problem is prohibited. We strongly encourage disabling AI autocomplete in your IDE. AI 工具： 允许使用大语言模型 (LLM) 辅助解决底层编程或高层概念问题，但禁止直接使用它们来完成作业。我们强烈建议在 IDE 中禁用 AI 自动补全功能。