LLM Study Diary #2: Tokenization

LLM 学习日记 #2：分词 (Tokenization)

Background: I did some research online and found a nice course that teaches how to build LLM from scratch. The course is shared publicly online and all the assignment resources are here: https://cs336.stanford.edu/. In the following series, I will put the summary and notes starting from lesson 1. 背景：我在网上做了一些调研，发现了一门很棒的课程，教大家如何从零开始构建大语言模型（LLM）。该课程已在网上公开，所有作业资源均可在此处找到：https://cs336.stanford.edu/。在接下来的系列文章中，我将从第一课开始整理总结和笔记。

Tokenization: Tokenization is at the very beginning of the LLM. There were many different tokenization algorithms, such as Character-based Tokenization, Byte-based Tokenization, Word-based Tokenization, and Byte Pair Encoding (BPE). 分词：分词是大语言模型的起点。目前有许多不同的分词算法，例如基于字符的分词（Character-based Tokenization）、基于字节的分词（Byte-based Tokenization）、基于词的分词（Word-based Tokenization）以及字节对编码（BPE）。

Character-based Tokenization: Pros: Simple to define by mapping characters to code points. Cons: Highly inefficient use of vocabulary because some characters are rare, and the compression ratio is suboptimal compared to more advanced methods. 基于字符的分词：优点：定义简单，只需将字符映射到代码点（code points）即可。缺点：词表利用率极低，因为某些字符非常罕见；且与更先进的方法相比，压缩比不够理想。

Byte-based Tokenization: Pros: Uses a very small, fixed vocabulary (0-256 indices), avoiding sparsity issues. Cons: Leads to very long sequences because the compression ratio is effectively 1:1 (one byte per token), which makes model training computationally expensive due to the quadratic nature of attention. 基于字节的分词：优点：使用非常小且固定的词表（0-256 个索引），避免了稀疏性问题。缺点：会导致序列非常长，因为压缩比实际上是 1:1（每个 token 对应一个字节），这使得模型训练在计算上非常昂贵，因为注意力机制（Attention）具有平方级的复杂度。

Word-based Tokenization: Pros: Captures semantic units through splitting strings by whitespace or regex. Cons: Results in an unbounded vocabulary size; it struggles with rare or unseen words, often necessitating an “UNK” (unknown) token which creates significant challenges for model training and evaluation. 基于词的分词：优点：通过空格或正则表达式分割字符串，能够捕捉语义单元。缺点：会导致词表大小无上限；难以处理罕见词或未见过的词，通常需要引入“UNK”（未知）标记，这给模型训练和评估带来了巨大挑战。

BPE: BPE is the best one out of all these. Here is how it works: Convert to Bytes: First, represent the input string as a sequence of bytes (integers). This ensures every character, even rare ones, can be represented. Count Frequencies: Scan the entire corpus to count the frequency of all adjacent pairs of bytes or existing tokens. Merge the Most Frequent: Identify the pair that appears most often and merge them into a new, single token. Add this new token to your vocabulary. Repeat: Repeat the process of counting and merging for a set number of iterations or until a desired vocabulary size is reached. This process allows the model to adaptively represent common sequences as single tokens and rare ones as multiple smaller components. BPE（字节对编码）：BPE 是上述方法中表现最好的一种。其工作原理如下：转换为字节：首先，将输入字符串表示为字节序列（整数）。这确保了每个字符，即使是罕见字符，都能被表示出来。统计频率：扫描整个语料库，统计所有相邻字节对或现有 token 的频率。合并最高频项：找出出现频率最高的对，并将它们合并为一个新的、单一的 token。将这个新 token 加入词表。重复：重复统计和合并的过程，直到达到设定的迭代次数或所需的词表大小。这一过程使模型能够自适应地将常见序列表示为单个 token，而将罕见序列表示为多个较小的组件。

Key Takeaways: Efficiency: BPE is effective because it learns the statistics of your specific data set, rather than relying on predefined word boundaries. Robustness: Unlike word-based tokenization, BPE handles unknown or rare words gracefully because it can always fall back to individual characters or smaller sub-word units, avoiding the need for “UNK” tokens. Historical Context: Originally a data compression algorithm from 1994, it was adopted for NLP to improve neural machine translation and eventually became a standard backbone for models like GPT-2 and beyond. 核心要点：效率：BPE 之所以有效，是因为它学习的是特定数据集的统计规律，而不是依赖预定义的词边界。鲁棒性：与基于词的分词不同，BPE 可以优雅地处理未知或罕见词，因为它总是可以回退到单个字符或更小的子词单元，从而避免了对“UNK”标记的需求。历史背景：BPE 最初是 1994 年的一种数据压缩算法，后来被引入自然语言处理（NLP）领域以改进神经机器翻译，并最终成为 GPT-2 及后续模型架构的标准基石。