The Context-Ready Transformer

Abstract: We introduce the context-ready transformer, a new recurrent neural network architecture built from a D-layer transformer block that pre-contextualizes each token before it enters the block. During left-to-right generation, a correction network combines the previous position’s block output — a cached summary of past context — with the current token embedding, so the token enters the block already contextualized rather than as a raw embedding.

摘要： 我们引入了“上下文就绪 Transformer”（Context-Ready Transformer），这是一种基于 D 层 Transformer 块构建的新型循环神经网络架构，它在每个 Token 进入 Transformer 块之前对其进行预上下文处理。在从左到右的生成过程中，一个校正网络会将前一个位置的块输出（即过去上下文的缓存摘要）与当前的 Token 嵌入相结合，使得 Token 在进入 Transformer 块时已经具备了上下文信息，而非仅仅是原始嵌入。

At sequential inference, the correction chain makes the architecture a recurrent neural network. For training, we unroll the correction process K times over the full sequence, processing all positions in parallel at each step. A pretrained transformer can also be converted to a context-ready model by adding a zero-initialized correction FFN and fine-tuning.

在顺序推理时，校正链使该架构成为一种循环神经网络。在训练过程中，我们将校正过程在整个序列上展开 K 次，并在每一步并行处理所有位置。通过添加一个零初始化的校正前馈网络（FFN）并进行微调，预训练的 Transformer 也可以转换为上下文就绪模型。

We evaluate across widths, depths, block sizes, and two datasets, with all comparisons against standard transformers, variants, and ablations. A D=5 model beats a 12-layer transformer while generating 1.7x faster on an A100. With K=10, a single-layer model (D=1) beats a 6-layer transformer with a 2.6x inference speedup, and sequential inference matches parallel K=10 to within 0.01 PPL.

我们针对不同的宽度、深度、块大小以及两个数据集进行了评估，并将所有结果与标准 Transformer、变体及消融实验进行了对比。一个 D=5 的模型在 A100 上不仅击败了 12 层 Transformer，生成速度还提升了 1.7 倍。当 K=10 时，单层模型（D=1）击败了 6 层 Transformer，推理速度提升了 2.6 倍，且顺序推理的困惑度（PPL）与并行 K=10 的结果差距在 0.01 以内。

The architecture benefits most from wide representations and long contexts. On a pointer-chasing task, D=1 trained with BPTT solves all 10 composition levels, while standard transformers exhibit staircase-like depth dependence.

该架构在宽表示（wide representations）和长上下文场景下表现最为优异。在指针追踪（pointer-chasing）任务中，使用 BPTT 训练的 D=1 模型解决了所有 10 个组合层级的问题，而标准 Transformer 则表现出阶梯状的深度依赖性。