Which tokens does a hybrid model predict better?

Which tokens does a hybrid model predict better?

混合模型在预测哪些 Token 时表现更好?

Which kinds of tokens does a model predict well, and which does it not? That question is especially intriguing in the case of hybrids, a language model architecture that’s begun to challenge the standard transformer and that we’ve been investigating with Olmo Hybrid. 模型擅长预测哪些类型的 Token,又不擅长预测哪些?这个问题在“混合模型”(hybrids)中尤为引人入胜。作为一种开始挑战标准 Transformer 架构的新型语言模型,我们一直在通过 Olmo Hybrid 对其进行研究。

Hybrids can match or beat transformers on standard benchmarks, but the headline numbers don’t reveal much about what specific advantages hybrid models have over transformers. In an attempt to shed light on these token-level behaviors, we recently conducted experiments comparing our own strongest 7B transformer, Olmo 3, and hybrid model, Olmo Hybrid, head-to-head. 混合模型在标准基准测试中可以媲美甚至超越 Transformer,但这些总分数据并不能揭示混合模型相对于 Transformer 的具体优势所在。为了阐明这些 Token 级别的行为,我们最近进行了一项实验,将我们最强的 7B Transformer 模型 Olmo 3 与混合模型 Olmo Hybrid 进行了正面交锋。

Specifically, we compare the differences in model predictions in a fine-grained way across different types of tokens, or units of information that appear as input to an LLM. Because Olmo 3 and Olmo Hybrid were built to be as alike as possible outside their architectures — closely matched in data, tokenizer, and training recipe — any difference in their predictions mostly reflects the architecture itself. Viewing these differences at the token level allows us to glean insights about the specific strengths of hybrid models over transformers. 具体而言,我们针对不同类型的 Token(即作为大语言模型输入的各种信息单元)对模型预测的差异进行了细粒度比较。由于 Olmo 3 和 Olmo Hybrid 在架构之外尽可能保持一致——在数据、分词器和训练配方上高度匹配——因此它们预测结果的任何差异主要反映了架构本身的特性。从 Token 层面观察这些差异,使我们能够深入了解混合模型相对于 Transformer 的具体优势。

Our results show that the hybrid’s advantage is real across many tokens, but not all. Olmo Hybrid is strongest on tokens that carry meaning, such as nouns, verbs, and adjectives, and on tokens that can only be predicted by following what’s going on, like which person a pronoun refers to. But the hybrid’s advantage almost disappears on tokens that simply repeat something already in the input — a word or phrase reproduced verbatim from earlier — where the answer is sitting right there to be looked up. That’s where the transformer’s strength lies. 我们的结果表明,混合模型的优势在许多 Token 上是真实存在的,但并非全部。Olmo Hybrid 在承载意义的 Token(如名词、动词和形容词)以及需要通过上下文逻辑才能预测的 Token(如代词指代的对象)上表现最强。然而,当 Token 只是简单重复输入中已有的内容(即逐字重复之前的词或短语)时,混合模型的优势几乎消失了,因为答案就在那里,只需直接“查找”即可。这正是 Transformer 的强项所在。

Attention versus recurrence, and measuring the difference

注意力机制与循环机制,以及差异的衡量

A language model is built from a stack of repeated layers, each one refining its representation of every token using the tokens around it. A transformer uses attention in every layer. The model can draw directly on every earlier token at once, weighing how relevant each is to the current prediction. That makes attention good at recalling a specific earlier token exactly, even when that token appeared far back in the input. The catch is that every token is compared against all the earlier ones, so attention’s cost climbs steeply as the input grows. 语言模型由一系列重复的层堆叠而成,每一层都利用周围的 Token 来优化对当前 Token 的表示。Transformer 在每一层都使用注意力机制。模型可以同时直接调用之前的所有 Token,并权衡每个 Token 与当前预测的相关性。这使得注意力机制非常擅长精确回忆之前的特定 Token,即使该 Token 在输入中出现得很早。其代价是,每个 Token 都要与之前的所有 Token 进行比较,因此随着输入长度的增加,注意力机制的计算成本会急剧上升。

Additionally, while attention is strong at recalling and aggregating information, it also struggles to represent information that evolves sequentially over time. A hybrid model keeps a few attention layers but swaps the rest for recurrent layers. Unlike an attention layer, a recurrent layer reads tokens left to right and carries a fixed-size memory, folding each new token into memory as it goes so the cost of processing each token stays flat however long the input gets. That memory is compressed and lossy, so a recurrent layer can’t reach back for an exact earlier token the way attention can. But it is well suited to keeping a running account of anything that changes as the model reads tokens, providing a complementary strength to attention. 此外,虽然注意力机制在回忆和聚合信息方面很强大,但在表示随时间序列演变的信息时却显得吃力。混合模型保留了少量注意力层,但将其余层替换为循环层。与注意力层不同,循环层从左到右读取 Token,并携带一个固定大小的内存,在处理过程中将每个新 Token 融入内存,因此无论输入多长,处理每个 Token 的成本都保持不变。这种内存是压缩且有损的,因此循环层无法像注意力机制那样回溯并提取之前的精确 Token。但它非常适合记录模型在读取 Token 时发生的变化,从而为注意力机制提供了互补的优势。

To isolate the areas of strength and weakness for attention and recurrent layers, we fed Olmo 3 and Olmo Hybrid passages of text: articles, Wikipedia entries, books, and scientific papers, as well as structured text like Python, HTML, and LaTeX. We scored each model on how well it predicted each token from the tokens before it in a given sample. Both models saw the same earlier tokens and assigned a probability to every possible next token. We recorded the probability each gave to the token that actually followed. We then summarize the difference between the two models token by token by computing the loss gap, or the difference in loss between the two models. A positive gap means the hybrid predicted the real next token better. A negative gap means the transformer did. 为了区分注意力层和循环层的优劣势,我们向 Olmo 3 和 Olmo Hybrid 输入了各类文本片段:文章、维基百科条目、书籍、科学论文,以及 Python、HTML 和 LaTeX 等结构化文本。我们根据模型在给定样本中基于前文预测后续 Token 的准确度进行评分。两个模型观察相同的历史 Token,并为每个可能的下一个 Token 分配概率。我们记录了它们为实际出现的下一个 Token 所给出的概率。随后,我们通过计算“损失差”(loss gap,即两个模型损失值之差)来逐个 Token 地总结两者的差异。正值差意味着混合模型对下一个真实 Token 的预测更好,负值差则意味着 Transformer 表现更好。

To find where the loss gaps might concentrate, we ran several analyses. First, we sorted each token into a category and averaged the loss gap within these categories. Because a raw average can be skewed by other factors, such as a category’s rarity or how often tokens repeat in a sample of text, we re-checked each pattern with a regression that estimates the category’s own effect while holding other factors constant. 为了找出损失差集中的领域,我们进行了多项分析。首先,我们将每个 Token 分类,并计算这些类别内的平均损失差。由于原始平均值可能会受到其他因素(如类别的稀有度或 Token 在文本样本中重复出现的频率)的偏差影响,我们通过回归分析重新验证了每种模式,在保持其他因素不变的情况下估算该类别本身的影响。

What real text shows

真实文本揭示了什么

We find that Olmo Hybrid has lower loss than Olmo 3 on most kinds of tokens, though not by the same amount on each. In prose, the clearest divide is between content words — meaning-bearing nouns, verbs, and adjectives — and function words like “the,” “of,” and “is.” The hybrid predicts content words better than the transformer, with a loss gap around 0.04, whereas the gap is closer to 0.02 on function words. In particular, on content-word categories like adverbs and adjectives, the advantage of hybrid models is especially pronounced, though some function-word categories like existentials, such as “there,” also show a large advantage for hybrid models. 我们发现,在大多数类型的 Token 上,Olmo Hybrid 的损失都低于 Olmo 3,尽管在不同类型上的优势幅度不同。在散文中,最明显的区别在于实词(承载意义的名词、动词和形容词)与虚词(如“the”、“of”、“is”)之间。混合模型在预测实词方面优于 Transformer,损失差约为 0.04,而在虚词上的差距则接近 0.02。特别是在副词和形容词等实词类别中,混合模型的优势尤为显著,尽管一些虚词类别(如存在词“there”)也显示出混合模型的巨大优势。

In short, the hybrid’s edge is biggest on the words that say what a sentence is about and smallest on the grammatical words any model can nearly guess from syntax. In contrast, we find some specific contexts where the advantage of hybrid models over transformers disappears. The first is closing, but not opening, braces, a pattern that is robust across brackets in language, code, and markup. Why? It’s known that attention suffices for representing bracket matching, which suggests attention alone suffices for closing brace prediction. The second place where the hybrid’s advantage all but disappears is when the next token simply repeats something already in the passage. We spot these cases by looking for repeated n-grams: runs of text where the token that completes a sequence has appeared, verba… 简而言之,混合模型的优势在那些决定句子含义的词汇上最大,而在任何模型都能通过语法几乎猜出的功能性词汇上最小。相比之下,我们发现混合模型相对于 Transformer 的优势在某些特定语境下会消失。第一种情况是闭合括号(而非开启括号),这种模式在语言、代码和标记语言的括号匹配中表现稳健。为什么?众所周知,注意力机制足以表示括号匹配,这表明仅靠注意力机制就足以预测闭合括号。混合模型优势几乎消失的第二个地方是当下一个 Token 只是简单重复文章中已有的内容时。我们通过寻找重复的 n-gram 来识别这些情况:即序列中完成该序列的 Token 之前已经出现过……