The Context-Ready Transformer
The Context-Ready Transformer
Abstract: We introduce the context-ready transformer, a new recurrent neural network architecture built from a D-layer transformer block that pre-contextualizes each token before it enters the block. During left-to-right generation, a correction network combines the previous position’s block output — a cached summary of past context — with the current token embedding, so the token enters the block already contextualized rather than as a raw embedding.
摘要: 我们引入了“上下文就绪 Transformer”(Context-Ready Transformer),这是一种基于 D 层 Transformer 块构建的新型循环神经网络架构,它在每个 Token 进入 Transformer 块之前对其进行预上下文处理。在从左到右的生成过程中,一个校正网络会将前一个位置的块输出(即过去上下文的缓存摘要)与当前的 Token 嵌入相结合,使得 Token 在进入 Transformer 块时已经具备了上下文信息,而非仅仅是原始嵌入。
At sequential inference, the correction chain makes the architecture a recurrent neural network. For training, we unroll the correction process K times over the full sequence, processing all positions in parallel at each step. A pretrained transformer can also be converted to a context-ready model by adding a zero-initialized correction FFN and fine-tuning.
在顺序推理时,校正链使该架构成为一种循环神经网络。在训练过程中,我们将校正过程在整个序列上展开 K 次,并在每一步并行处理所有位置。通过添加一个零初始化的校正前馈网络(FFN)并进行微调,预训练的 Transformer 也可以转换为上下文就绪模型。
We evaluate across widths, depths, block sizes, and two datasets, with all comparisons against standard transformers, variants, and ablations. A D=5 model beats a 12-layer transformer while generating 1.7x faster on an A100. With K=10, a single-layer model (D=1) beats a 6-layer transformer with a 2.6x inference speedup, and sequential inference matches parallel K=10 to within 0.01 PPL.
我们针对不同的宽度、深度、块大小以及两个数据集进行了评估,并将所有结果与标准 Transformer、变体及消融实验进行了对比。一个 D=5 的模型在 A100 上不仅击败了 12 层 Transformer,生成速度还提升了 1.7 倍。当 K=10 时,单层模型(D=1)击败了 6 层 Transformer,推理速度提升了 2.6 倍,且顺序推理的困惑度(PPL)与并行 K=10 的结果差距在 0.01 以内。
The architecture benefits most from wide representations and long contexts. On a pointer-chasing task, D=1 trained with BPTT solves all 10 composition levels, while standard transformers exhibit staircase-like depth dependence.
该架构在宽表示(wide representations)和长上下文场景下表现最为优异。在指针追踪(pointer-chasing)任务中,使用 BPTT 训练的 D=1 模型解决了所有 10 个组合层级的问题,而标准 Transformer 则表现出阶梯状的深度依赖性。