ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models

ADE: Adaptive Dictionary Embeddings — Scaling Multi-Anchor Representations to Large Language Models

ADE：自适应字典嵌入——将多锚点表示扩展至大型语言模型

Abstract: Word embeddings are fundamental to natural language processing, yet traditional approaches represent each word with a single vector, creating representational bottlenecks for polysemous words and limiting semantic expressiveness. While multi-anchor representations have shown promise by representing words as combinations of multiple vectors, they have been limited to small-scale models due to computational inefficiency and lack of integration with modern transformer architectures.

摘要： 词嵌入是自然语言处理的基础，然而传统方法通常用单个向量来表示每个词，这为多义词带来了表示瓶颈，并限制了语义表达能力。尽管多锚点（multi-anchor）表示法通过将词表示为多个向量的组合展现出了潜力，但由于计算效率低下且难以与现代 Transformer 架构集成，其应用一直局限于小规模模型。

We introduce Adaptive Dictionary Embeddings (ADE), a framework that successfully scales multi-anchor word representations to large language models. ADE makes three key contributions: (1) Vocabulary Projection (VP), which transforms the costly two-stage anchor lookup into a single efficient matrix operation; (2) Grouped Positional Encoding (GPE), a novel positional encoding scheme where anchors of the same word share positional information, preserving semantic coherence while enabling anchor-level variation; and (3) context-aware anchor reweighting, which leverages self-attention to dynamically compose anchor contributions based on sequence context.

我们引入了自适应字典嵌入（ADE），这是一个成功将多锚点词表示扩展到大型语言模型的框架。ADE 提出了三个关键贡献：(1) 词汇投影（VP），将昂贵的两阶段锚点查找转换为单一的高效矩阵运算；(2) 分组位置编码（GPE），这是一种新颖的位置编码方案，使同一词的锚点共享位置信息，在保持语义连贯性的同时实现锚点级的变化；(3) 上下文感知锚点重加权，利用自注意力机制根据序列上下文动态组合锚点的贡献。

We integrate these components into the Segment-Aware Transformer (SAT), which provides context-aware reweighting of anchor contributions at inference time. We evaluate ADE on AG News and DBpedia-14 text classification benchmarks. With 98.7% fewer trainable parameters than DeBERTa-v3-base, ADE surpasses DeBERTa on DBpedia-14 (98.06% vs. 97.80%) and approaches it on AG News (90.64% vs. 94.50%), while compressing the embedding layer over 40x — demonstrating that multi-anchor representations are a practical and parameter-efficient alternative to single-vector embeddings in modern transformer architectures.

我们将这些组件集成到分段感知 Transformer（SAT）中，该架构在推理时提供锚点贡献的上下文感知重加权。我们在 AG News 和 DBpedia-14 文本分类基准上对 ADE 进行了评估。与 DeBERTa-v3-base 相比，ADE 的可训练参数减少了 98.7%，在 DBpedia-14 上超越了 DeBERTa（98.06% 对 97.80%），在 AG News 上表现接近（90.64% 对 94.50%），同时将嵌入层压缩了 40 倍以上——这证明了在现代 Transformer 架构中，多锚点表示是单向量嵌入的一种实用且参数高效的替代方案。