The Must-Know Topics for an LLM Engineer

LLM 工程师的必修课

Large Language Models (LLMs) have quickly become the foundation of modern AI systems — from chatbots and copilots to search, coding, and automation. But for engineers transitioning into this space, the learning curve can feel steep and fragmented. Concepts like tokenization, attention, fine-tuning, and evaluation are often explained in isolation, making it hard to form a coherent mental model of how everything fits together. 大型语言模型（LLM）已迅速成为现代人工智能系统的基石——从聊天机器人和辅助编程工具（Copilot）到搜索、编码和自动化。然而，对于转型进入该领域的工程师来说，学习曲线往往显得陡峭且零散。诸如分词（Tokenization）、注意力机制（Attention）、微调（Fine-tuning）和评估（Evaluation）等概念通常被孤立地解释，这使得人们很难形成一个关于它们如何协同工作的连贯思维模型。

I ran into this firsthand when moving from computer vision to LLMs. In a short span of time, I had to understand not just the theory behind transformers, but also the practical realities: training trade-offs, inference bottlenecks, alignment challenges, and evaluation pitfalls. This article is designed to bridge that gap. Rather than diving deep into a single component, it provides a structured map of the LLM engineering landscape — covering the key building blocks you need to understand to design, train, and deploy real-world LLM systems. 当我从计算机视觉领域转向 LLM 时，我亲身经历了这一点。在很短的时间内，我不仅要理解 Transformer 背后的理论，还要掌握实际操作中的现实问题：训练权衡、推理瓶颈、对齐挑战以及评估陷阱。本文旨在弥合这一差距。与其深入研究单一组件，不如提供一张 LLM 工程领域的结构化地图——涵盖了设计、训练和部署现实世界 LLM 系统所需理解的关键构建模块。

We’ll move from the fundamentals of how text is represented, through model architectures and training strategies, all the way to inference optimization, evaluation, and system-level considerations and practical consideration like prompt engineering and reducing hallucinations. 我们将从文本表示的基础知识出发，经过模型架构和训练策略，一直探讨到推理优化、评估、系统级考量，以及诸如提示工程（Prompt Engineering）和减少幻觉等实际问题。

Tokenisation

分词 (Tokenisation)

When feeding data to a model, we can’t just feed it letters or words directly — we need a way to convert text into numbers. Intuitively, we might think of assigning each word in the language a unique number and feeding those numbers to the model. However, there are hundreds of thousands of words in the English language, and training on such a vast vocabulary would be infeasible in terms of memory and efficiency. 在向模型输入数据时，我们不能直接输入字母或单词——我们需要一种将文本转换为数字的方法。直觉上，我们可能会想到为语言中的每个单词分配一个唯一的数字，然后将这些数字输入模型。然而，英语中有数十万个单词，在如此庞大的词汇表上进行训练，在内存和效率方面是不可行的。

So what can be done instead? Well, we could try encoding letters, since there are only 26 in the English alphabet. But this would lead to problems as well — models would struggle to capture the meaning of words from individual letters alone, and sequences would become unnecessarily long, making training difficult. A practical solution is tokenization. Instead of representing language at the word or character level, we split text into the most frequent and useful subword units. 那么，还有什么替代方案呢？我们可以尝试对字母进行编码，因为英语字母表中只有 26 个字母。但这也会导致问题——模型很难仅从单个字母中捕捉单词的含义，而且序列会变得过长，从而增加训练难度。一个实用的解决方案是分词。我们不再以单词或字符级别来表示语言，而是将文本拆分为最频繁且最有用的子词单元（Subword units）。

These subwords act as the building blocks of the model’s vocabulary: common words appear as whole tokens, while rare words can be represented as combinations of smaller subwords. A common algorithm for that is Byte-Pair-Encoding (BPE). BPE starts with individual characters as tokens, then repeatedly merges the most frequent pairs of tokens into new tokens, gradually building up a vocabulary of subword units until a desired vocabulary size is reached. At this stage each token is assigned a unique number — its ID in the vocabulary. 这些子词充当了模型词汇表的构建块：常用词以完整 Token 的形式出现，而生僻词则可以表示为较小子词的组合。一种常用的算法是字节对编码（BPE）。BPE 从单个字符作为 Token 开始，然后重复合并最频繁的 Token 对以形成新的 Token，逐渐建立起子词单元的词汇表，直到达到预期的词汇量。在此阶段，每个 Token 都会被分配一个唯一的数字——即它在词汇表中的 ID。

Embeddings

嵌入 (Embeddings)

After we have tokenized the data and assigned token IDs, we need to attach semantic meaning to these IDs. This is achieved through text embeddings — mappings from discrete token IDs into continuous vector spaces. In this space, words or tokens with similar meanings are placed close together, and even algebraic operations can capture semantic relationships (for example: embedding(queen) — embedding(woman) + embedding(man) ≈ embedding(king)). 在对数据进行分词并分配 Token ID 后，我们需要为这些 ID 赋予语义。这是通过文本嵌入（Text Embeddings）实现的——即从离散的 Token ID 到连续向量空间的映射。在这个空间中，含义相似的单词或 Token 被放置在一起，甚至代数运算也能捕捉到语义关系（例如：embedding(女王) - embedding(女人) + embedding(男人) ≈ embedding(国王)）。

Generally, embedding layers are trained to take token IDs as input and produce dense vectors as output. These vectors are optimized jointly with the model’s training objective (e.g., next-token prediction). Over time, the model learns embeddings that encode both syntactic and semantic information about words, subwords, or tokens. Popular embedding models are: word2vec, glove, BERT. 通常，嵌入层被训练为接收 Token ID 作为输入并输出稠密向量。这些向量与模型的训练目标（例如：下一个 Token 预测）共同优化。随着时间的推移，模型学习到的嵌入能够编码单词、子词或 Token 的句法和语义信息。流行的嵌入模型包括：word2vec、glove、BERT。

Positional encoding

位置编码 (Positional encoding)

Generally, LLMs are not inherently aware of the structure of language. Natural language has a sequential nature — word order matters — but at the same time, tokens that are far apart in a sentence may still be strongly related. To capture both local order and long-range dependencies, we inject positional information of the tokens into each embedding. 通常，LLM 本身并不了解语言的结构。自然语言具有顺序性——词序很重要——但同时，句子中相距较远的 Token 可能仍然密切相关。为了同时捕捉局部顺序和长距离依赖关系，我们将 Token 的位置信息注入到每个嵌入中。

There are several common to positional approaches: 有几种常见的位置编码方法：

Absolute positional encodings — Fixed patterns, such as sine and cosine functions at different frequencies, are added to token embeddings. This is simple and effective but may struggle to represent very long sequences, since it does not explicitly model relative distances. 绝对位置编码——将固定模式（如不同频率的正弦和余弦函数）添加到 Token 嵌入中。这种方法简单有效，但在表示超长序列时可能会遇到困难，因为它没有显式地建模相对距离。
Relative positional encodings — These represent the distance between tokens instead of their absolute positions. A popular method is RoPE (Rotary Positional Embeddings), which encodes position as vector rotations. This approach scales better to long sequences and captures relationships between distant tokens more naturally. 相对位置编码——这些方法表示 Token 之间的距离，而不是它们的绝对位置。一种流行的方法是 RoPE（旋转位置嵌入），它将位置编码为向量旋转。这种方法在长序列上扩展性更好，并能更自然地捕捉远距离 Token 之间的关系。
Learned positional encodings — Instead of relying on fixed mathematical functions, the model directly learns position embeddings during training. This allows flexibility but can be less generalizable to sequence lengths not seen in training. 学习型位置编码——模型不依赖固定的数学函数，而是在训练过程中直接学习位置嵌入。这提供了灵活性，但在训练中未见过的序列长度上，其泛化能力可能较差。

Model Architecture

模型架构 (Model Architecture)

After the data is tokenized, embedded, and enriched with positional encodings, it is passed through the model. The current state-of-the-art architecture for processing textual data is the transformer architecture, whose core is base on the attention mechanism. A transformer typically consists of a stack of transformer blocks: 在数据经过分词、嵌入并添加位置编码后，它会被输入到模型中。目前处理文本数据的最先进架构是 Transformer 架构，其核心基于注意力机制。Transformer 通常由一系列 Transformer 块堆叠而成：

Multi-Head Attention: Enables the model to focus on different parts of the input sequence simultaneously, capturing diverse context. It calculates Queries (Q), Keys (K), and Values (V) to define word relationships. 多头注意力机制（Multi-Head Attention）： 使模型能够同时关注输入序列的不同部分，从而捕捉多样化的上下文。它通过计算查询（Queries）、键（Keys）和值（Values）来定义单词之间的关系。
Position-wise Feed-Forward Network (FFN): A fully connected network applied to… 位置前馈网络（Position-wise Feed-Forward Network）： 一个应用于……的全连接网络。