92. BERT: The Model That Reads in Both Directions

92. BERT：双向阅读的模型

GPT generates text by predicting the next word. It reads left to right. BERT does something different. It masks random words in a sentence and tries to predict what they are. To do that well, it has to understand every word in relation to every other word simultaneously. Left and right context both matter. That bidirectional understanding is why BERT dominated NLP benchmarks when it came out in 2018, and why encoder-only transformers are still the go-to for understanding tasks. GPT 通过预测下一个词来生成文本，它是从左向右阅读的。而 BERT 则不同，它会遮盖句子中的随机词汇，并尝试预测它们是什么。为了做好这一点，它必须同时理解每个词与其他所有词之间的关系。左侧和右侧的上下文都很重要。这种双向理解能力正是 BERT 在 2018 年发布时能够统治 NLP 基准测试的原因，也是仅编码器（encoder-only）Transformer 至今仍是理解类任务首选架构的原因。

What You’ll Learn Here

你将学到什么

What makes BERT different from GPT
Masked Language Modeling: how BERT learns
Next Sentence Prediction: the second pretraining task
The [CLS] and [SEP] tokens and what they do
Fine-tuning BERT for text classification
Fine-tuning for Named Entity Recognition
Fine-tuning for Question Answering
Using HuggingFace to do all of this in under 20 lines
BERT 与 GPT 的区别
掩码语言模型（MLM）：BERT 的学习方式
下一句预测（NSP）：第二个预训练任务
[CLS] 和 [SEP] 标记及其作用
微调 BERT 进行文本分类
微调 BERT 进行命名实体识别
微调 BERT 进行问答任务
使用 HuggingFace 在 20 行代码内完成上述所有操作

BERT vs GPT: The Key Difference

BERT 与 GPT：核心差异

Both are transformer-based. The architecture is similar. The difference is in how they’re pretrained and which part of the transformer they use. 两者都基于 Transformer，架构相似。区别在于它们的预训练方式以及使用了 Transformer 的哪一部分。

GPT (decoder-only):

Reads left to right with causal masking
Trained to predict the next token
Great at generation
Context: only left side available

GPT（仅解码器）：

使用因果掩码从左向右阅读
训练目标是预测下一个标记
擅长生成任务
上下文：仅可获取左侧信息

BERT (encoder-only):

Reads all tokens simultaneously
Trained to predict masked tokens + next sentence
Great at understanding
Context: both left and right sides available

BERT（仅编码器）：

同时读取所有标记
训练目标是预测被遮盖的标记 + 下一句预测
擅长理解任务
上下文：可同时获取左侧和右侧信息

For classification tasks, BERT wins. For generation tasks, GPT wins. For most NLP applications you actually want to build, BERT is the starting point. 对于分类任务，BERT 胜出；对于生成任务，GPT 胜出。对于大多数你想要构建的 NLP 应用，BERT 是最佳起点。

How BERT Was Pretrained

BERT 是如何预训练的

BERT was pretrained on two tasks simultaneously on a massive corpus (BooksCorpus + English Wikipedia, 3.3 billion words). BERT 在一个庞大的语料库（BooksCorpus + 英文维基百科，共 33 亿词）上同时进行了两项任务的预训练。

Task 1: Masked Language Modeling (MLM) 15% of tokens are randomly masked. The model predicts the original token from context. 任务 1：掩码语言模型 (MLM) 15% 的标记被随机遮盖，模型根据上下文预测原始标记。

Input: “The cat [MASK] on the [MASK]” Target: “The cat sat on the mat” 输入：“The cat [MASK] on the [MASK]” 目标：“The cat sat on the mat”

Of the 15% selected tokens:

80% replaced with [MASK]
10% replaced with a random token
10% left unchanged 在选中的 15% 标记中：
80% 替换为 [MASK]
10% 替换为随机标记
10% 保持不变 The random and unchanged cases prevent the model from only learning to predict [MASK] tokens. 随机和不变的情况是为了防止模型仅仅学会预测 [MASK] 标记。

Task 2: Next Sentence Prediction (NSP) Two sentences are given. The model predicts whether sentence B actually follows sentence A in the original text. 任务 2：下一句预测 (NSP) 给定两个句子，模型预测句子 B 是否确实紧跟在原始文本中的句子 A 之后。

Input: [CLS] The cat sat on the mat. [SEP] It was a lazy afternoon. [SEP] Label: IsNext (1) Input: [CLS] The cat sat on the mat. [SEP] The stock market crashed. [SEP] Label: NotNext (0) 输入：[CLS] The cat sat on the mat. [SEP] It was a lazy afternoon. [SEP] 标签：IsNext (1) 输入：[CLS] The cat sat on the mat. [SEP] The stock market crashed. [SEP] 标签：NotNext (0)

NSP was later found to be less useful than MLM and was dropped in RoBERTa. But it’s part of the original BERT. 后来发现 NSP 的作用不如 MLM，因此在 RoBERTa 中被弃用，但它是原始 BERT 的一部分。

Special Tokens in BERT

BERT 中的特殊标记

BERT uses three special tokens you need to know: BERT 使用了三个你需要了解的特殊标记：

[CLS]: Classification token. Always the first token. Its final hidden state is used as the sentence-level representation for classification tasks.
[SEP]: Separator token. Marks the end of a sentence or separates two sentences in pairs.
[PAD]: Padding token. Used to make all sequences in a batch the same length.
[CLS]：分类标记。始终位于第一个位置。其最终隐藏状态被用作分类任务中的句子级表示。
[SEP]：分隔符标记。标记句子的结尾或分隔成对的两个句子。
[PAD]：填充标记。用于使批次中的所有序列长度一致。