ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
ADE: Adaptive Dictionary Embeddings — Scaling Multi-Anchor Representations to Large Language Models
ADE:自适应字典嵌入——将多锚点表示扩展至大型语言模型
Abstract: Word embeddings are fundamental to natural language processing, yet traditional approaches represent each word with a single vector, creating representational bottlenecks for polysemous words and limiting semantic expressiveness. While multi-anchor representations have shown promise by representing words as combinations of multiple vectors, they have been limited to small-scale models due to computational inefficiency and lack of integration with modern transformer architectures.
摘要: 词嵌入是自然语言处理的基础,然而传统方法通常用单个向量来表示每个词,这为多义词带来了表示瓶颈,并限制了语义表达能力。尽管多锚点(multi-anchor)表示法通过将词表示为多个向量的组合展现出了潜力,但由于计算效率低下且难以与现代 Transformer 架构集成,其应用一直局限于小规模模型。
We introduce Adaptive Dictionary Embeddings (ADE), a framework that successfully scales multi-anchor word representations to large language models. ADE makes three key contributions: (1) Vocabulary Projection (VP), which transforms the costly two-stage anchor lookup into a single efficient matrix operation; (2) Grouped Positional Encoding (GPE), a novel positional encoding scheme where anchors of the same word share positional information, preserving semantic coherence while enabling anchor-level variation; and (3) context-aware anchor reweighting, which leverages self-attention to dynamically compose anchor contributions based on sequence context.
我们引入了自适应字典嵌入(ADE),这是一个成功将多锚点词表示扩展到大型语言模型的框架。ADE 提出了三个关键贡献:(1) 词汇投影(VP),将昂贵的两阶段锚点查找转换为单一的高效矩阵运算;(2) 分组位置编码(GPE),这是一种新颖的位置编码方案,使同一词的锚点共享位置信息,在保持语义连贯性的同时实现锚点级的变化;(3) 上下文感知锚点重加权,利用自注意力机制根据序列上下文动态组合锚点的贡献。
We integrate these components into the Segment-Aware Transformer (SAT), which provides context-aware reweighting of anchor contributions at inference time. We evaluate ADE on AG News and DBpedia-14 text classification benchmarks. With 98.7% fewer trainable parameters than DeBERTa-v3-base, ADE surpasses DeBERTa on DBpedia-14 (98.06% vs. 97.80%) and approaches it on AG News (90.64% vs. 94.50%), while compressing the embedding layer over 40x — demonstrating that multi-anchor representations are a practical and parameter-efficient alternative to single-vector embeddings in modern transformer architectures.
我们将这些组件集成到分段感知 Transformer(SAT)中,该架构在推理时提供锚点贡献的上下文感知重加权。我们在 AG News 和 DBpedia-14 文本分类基准上对 ADE 进行了评估。与 DeBERTa-v3-base 相比,ADE 的可训练参数减少了 98.7%,在 DBpedia-14 上超越了 DeBERTa(98.06% 对 97.80%),在 AG News 上表现接近(90.64% 对 94.50%),同时将嵌入层压缩了 40 倍以上——这证明了在现代 Transformer 架构中,多锚点表示是单向量嵌入的一种实用且参数高效的替代方案。