UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

UnpredictaBench：评估大语言模型分布随机性的基准测试

Abstract: We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs.

摘要： 我们推出了 UnpredictaBench，这是一个旨在测试大语言模型（LLM）捕捉真实底层分布能力的评估基准。随着大语言模型越来越多地被用作其他实体的替代品（例如在经济模拟中替代人类），许多模型倾向于收敛到单一的合理答案，这意味着它们无法捕捉真实系统的不可预测性。近期关于提高输出多样性的研究对于这一场景而言是不够的：模拟需要的是根据目标分布进行校准的样本，而不仅仅是多样的输出。

UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty.

UnpredictaBench 将这一问题的简化但本质的版本进行了隔离：即从个体目标分布中采样结果，包括典型统计分布、由随机程序诱导的分布，以及描述随机过程的自然语言场景。我们引入了 448 个此类问题，并提出了 KS@N，这是一种通用的评估指标，通过 Kolmogorov-Smirnov 统计检验来量化模型输出对黑盒目标分布的近似程度。该指标衡量的是在针对真实样本进行检验时，我们无法拒绝大小为 N 的模型样本的比率，N 值越大，难度越高。

Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.

通过对开源和闭源模型的测试，我们发现模型在分布捕捉能力上存在巨大差异。例如，当模型生成大小为 100 的样本时（KS@100，我们的标准指标），得分范围从接近 0 到超过 20% 不等。没有任何模型在 KS@100 上能达到 40% 以上，这表明作为一项能力，分布采样仍有巨大的提升空间。尽管增加推理能力可以在一定程度上提高分数，但我们尚未找到解决此问题的直接方案。UnpredictaBench 表明，即使是简单的分布模拟依然具有挑战性，这使其成为将大语言模型用作复杂系统替代品的重要第一步。