LLM Study Diary #2: Tokenization
LLM Study Diary #2: Tokenization
LLM 学习日记 #2:分词 (Tokenization)
Background: I did some research online and found a nice course that teaches how to build LLM from scratch. The course is shared publicly online and all the assignment resources are here: https://cs336.stanford.edu/. In the following series, I will put the summary and notes starting from lesson 1. 背景:我在网上做了一些调研,发现了一门很棒的课程,教大家如何从零开始构建大语言模型(LLM)。该课程已在网上公开,所有作业资源均可在此处找到:https://cs336.stanford.edu/。在接下来的系列文章中,我将从第一课开始整理总结和笔记。
Tokenization: Tokenization is at the very beginning of the LLM. There were many different tokenization algorithms, such as Character-based Tokenization, Byte-based Tokenization, Word-based Tokenization, and Byte Pair Encoding (BPE). 分词:分词是大语言模型的起点。目前有许多不同的分词算法,例如基于字符的分词(Character-based Tokenization)、基于字节的分词(Byte-based Tokenization)、基于词的分词(Word-based Tokenization)以及字节对编码(BPE)。
Character-based Tokenization: Pros: Simple to define by mapping characters to code points. Cons: Highly inefficient use of vocabulary because some characters are rare, and the compression ratio is suboptimal compared to more advanced methods. 基于字符的分词: 优点:定义简单,只需将字符映射到代码点(code points)即可。 缺点:词表利用率极低,因为某些字符非常罕见;且与更先进的方法相比,压缩比不够理想。
Byte-based Tokenization: Pros: Uses a very small, fixed vocabulary (0-256 indices), avoiding sparsity issues. Cons: Leads to very long sequences because the compression ratio is effectively 1:1 (one byte per token), which makes model training computationally expensive due to the quadratic nature of attention. 基于字节的分词: 优点:使用非常小且固定的词表(0-256 个索引),避免了稀疏性问题。 缺点:会导致序列非常长,因为压缩比实际上是 1:1(每个 token 对应一个字节),这使得模型训练在计算上非常昂贵,因为注意力机制(Attention)具有平方级的复杂度。
Word-based Tokenization: Pros: Captures semantic units through splitting strings by whitespace or regex. Cons: Results in an unbounded vocabulary size; it struggles with rare or unseen words, often necessitating an “UNK” (unknown) token which creates significant challenges for model training and evaluation. 基于词的分词: 优点:通过空格或正则表达式分割字符串,能够捕捉语义单元。 缺点:会导致词表大小无上限;难以处理罕见词或未见过的词,通常需要引入“UNK”(未知)标记,这给模型训练和评估带来了巨大挑战。
BPE: BPE is the best one out of all these. Here is how it works: Convert to Bytes: First, represent the input string as a sequence of bytes (integers). This ensures every character, even rare ones, can be represented. Count Frequencies: Scan the entire corpus to count the frequency of all adjacent pairs of bytes or existing tokens. Merge the Most Frequent: Identify the pair that appears most often and merge them into a new, single token. Add this new token to your vocabulary. Repeat: Repeat the process of counting and merging for a set number of iterations or until a desired vocabulary size is reached. This process allows the model to adaptively represent common sequences as single tokens and rare ones as multiple smaller components. BPE(字节对编码):BPE 是上述方法中表现最好的一种。其工作原理如下: 转换为字节:首先,将输入字符串表示为字节序列(整数)。这确保了每个字符,即使是罕见字符,都能被表示出来。 统计频率:扫描整个语料库,统计所有相邻字节对或现有 token 的频率。 合并最高频项:找出出现频率最高的对,并将它们合并为一个新的、单一的 token。将这个新 token 加入词表。 重复:重复统计和合并的过程,直到达到设定的迭代次数或所需的词表大小。这一过程使模型能够自适应地将常见序列表示为单个 token,而将罕见序列表示为多个较小的组件。
Key Takeaways: Efficiency: BPE is effective because it learns the statistics of your specific data set, rather than relying on predefined word boundaries. Robustness: Unlike word-based tokenization, BPE handles unknown or rare words gracefully because it can always fall back to individual characters or smaller sub-word units, avoiding the need for “UNK” tokens. Historical Context: Originally a data compression algorithm from 1994, it was adopted for NLP to improve neural machine translation and eventually became a standard backbone for models like GPT-2 and beyond. 核心要点: 效率:BPE 之所以有效,是因为它学习的是特定数据集的统计规律,而不是依赖预定义的词边界。 鲁棒性:与基于词的分词不同,BPE 可以优雅地处理未知或罕见词,因为它总是可以回退到单个字符或更小的子词单元,从而避免了对“UNK”标记的需求。 历史背景:BPE 最初是 1994 年的一种数据压缩算法,后来被引入自然语言处理(NLP)领域以改进神经机器翻译,并最终成为 GPT-2 及后续模型架构的标准基石。