Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

采样越多，收获越少：校准是大型语言模型（LLM）多样性的瓶颈

Abstract: Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks for measuring this lack of diversity, less is known about how the step-by-step probability distributions at inference time cause the problem.

摘要： 多样性对于从创意生成到科学发现的各类语言模型应用至关重要，然而现代大型语言模型（LLM）往往会陷入仅能输出狭窄子集结果的困境。尽管先前的研究已经开发了用于衡量这种多样性缺失的基准，但对于推理阶段的逐步概率分布如何导致这一问题，目前尚不明确。

We introduce a validity—diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations during decoding. This framework decomposes the bottleneck into two complementary forms of miscalibration.

我们引入了一个“有效性-多样性”框架，将多样性崩溃归因于 LLM 在解码过程中如何在有效和无效的续写之间分配概率质量。该框架将这一瓶颈分解为两种互补的校准偏差形式。

First, order calibration: valid tokens are not reliably ranked above invalid tokens, so rank-based cutoff rules must trade off between recovering valid continuations and admitting invalid ones. Second, shape calibration: probability mass is overly concentrated only on few valid continuations while having a heavy-tail of mixed valid and invalid tokens, so maintaining high validity limits diversity.

首先是“顺序校准”（order calibration）：有效标记（token）无法可靠地排在无效标记之前，因此基于排名的截断规则必须在恢复有效续写和接纳无效续写之间进行权衡。其次是“形状校准”（shape calibration）：概率质量过度集中在少数几个有效续写上，同时存在包含大量混合有效和无效标记的长尾分布，因此保持高有效性会限制多样性。

We formalize both mechanisms and show that local failures compound across decoding steps, producing strong sequence-level losses in diversity. Empirically, we develop controlled diagnostics for probing these bottlenecks, including tasks with exactly known valid sets and oracle cutoff baselines.

我们对这两种机制进行了形式化，并证明了局部失效会在解码步骤中不断累积，从而导致序列层面的多样性严重损失。在实证方面，我们开发了用于探测这些瓶颈的受控诊断方法，包括具有精确已知有效集和预言机（oracle）截断基准的任务。

Across 14 language models spanning multiple families and scales, we find that diversity collapse is not merely a limitation of particular sampling heuristics, but a consequence of order and shape miscalibration in the LLM distribution.

通过对涵盖多个系列和规模的 14 个语言模型进行测试，我们发现多样性崩溃不仅仅是特定采样启发式算法的局限性，更是 LLM 分布中顺序和形状校准偏差的必然结果。