Language Acquisition Device in Large Language Models
Language Acquisition Device in Large Language Models
大型语言模型中的语言习得装置
Abstract: Large Language Models (LLMs) remain substantially less data-efficient than humans. Pre-pretraining (PPT) on synthetic languages has been proposed to close this gap, with prior work emphasizing highly expressive formal languages such as $k$-Shuffle Dyck.
摘要: 大型语言模型(LLM)在数据效率方面仍远逊于人类。为了缩小这一差距,研究人员提出了在合成语言上进行预预训练(PPT)的方法,此前的工作多强调使用如 $k$-Shuffle Dyck 等具有高度表达能力的各种形式语言。
Inspired by the Language Acquisition Device (LAD) hypothesis, which posits that innate constraints preemptively restrict the learner’s hypothesis space to natural-language-like structure, we propose LAD-inspired PPT: pre-pretraining on MP-STRUCT, a formal language whose strings encode hierarchical composition, feature-based dependencies, and long-distance displacement via MERGE, AGREE, and MOVE.
受“语言习得装置”(LAD)假说的启发——该假说认为先天约束会将学习者的假设空间预先限制在类自然语言的结构中——我们提出了受 LAD 启发的 PPT 方法:在 MP-STRUCT 上进行预预训练。MP-STRUCT 是一种形式语言,其字符串通过 MERGE(合并)、AGREE(一致)和 MOVE(移位)操作编码了层级组合、基于特征的依赖关系以及长距离位移。
A brief 500-step PPT with MP-STRUCT matches strong formal-language baselines in token efficiency while additionally imparting a human-like resistance to structurally implausible languages (e.g., REVERSE).
仅需 500 步的 MP-STRUCT 预预训练,模型在 Token 效率上即可媲美强大的形式语言基准,同时还能赋予模型类似人类的特性,即对结构上不合理的语言(如 REVERSE)产生抵制。
Analyzing simplified variants, we find that MP-STRUCT CORE outperforms $k$-Shuffle Dyck despite not being definable in C-RASP (a formal bound on transformer expressivity), challenging the prior hypothesis that effective PPT languages must be both hierarchically expressive and circuit-theoretically learnable.
通过分析简化变体,我们发现 MP-STRUCT CORE 的表现优于 $k$-Shuffle Dyck,尽管它无法在 C-RASP(一种针对 Transformer 表达能力的正式界限)中定义。这一发现挑战了此前的假设,即有效的 PPT 语言必须同时具备层级表达能力和电路理论上的可学习性。
We show that functional landmarks, which reduce dependency resolution ambiguity, are a key driver, suggesting that effective PPT design depends not only on expressivity but also on the accessibility of dependency resolution.
我们证明了能够降低依赖解析歧义的功能性地标(functional landmarks)是关键驱动因素,这表明有效的 PPT 设计不仅取决于表达能力,还取决于依赖解析的可访问性。