BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation

BayesBench：评估多轮证据积累下大语言模型的信念轨迹

Large language models (LLMs) are typically deployed in multi-turn conversations, where each turn provides new evidence that should reduce epistemic uncertainty about their environment. Acting rationally then requires inferring the unobserved quantities that govern it and updating beliefs about them as evidence accumulates. 大语言模型（LLM）通常部署在多轮对话场景中，每一轮对话都会提供新的证据，理应减少模型对其环境的认知不确定性。要实现理性行为，模型需要推断出支配环境的未观测变量，并随着证据的积累不断更新其信念。

Yet most evaluations only score the model’s final-turn answer in a single-turn format, leaving this process unexamined. We ask how closely LLMs’ belief updates match those of a rational Bayesian reasoner in multi-turn settings, and introduce BayesBench, a suite of simulation environments that probe this across three progressively complex tasks: (i) Bayesian estimation, where the model infers an unknown parameter from sequential evidence; (ii) Bayesian prediction, where the model turns inferred beliefs about a latent variable into outcome forecasts; and (iii) latent-framed Bayesian prediction, where observations are filtered through a user-persona framing, requiring joint inference over the latent state and the persona. 然而，大多数评估仅针对单轮格式下的最终答案进行评分，导致这一过程未得到充分检验。我们探讨了大语言模型的信念更新在多轮设置中与理性贝叶斯推理者的匹配程度，并引入了 BayesBench——一套模拟环境，旨在通过三个复杂度递增的任务进行探测：(i) 贝叶斯估计，模型根据序列证据推断未知参数；(ii) 贝叶斯预测，模型将对潜在变量的推断信念转化为结果预测；以及 (iii) 潜在框架下的贝叶斯预测，观察结果通过用户角色框架进行过滤，要求对潜在状态和角色进行联合推断。

Across seven LLMs (3B—70B), scaling improves latent inference and evidence accumulation, with updates occasionally matching the Bayesian posterior. However, these gains do not reliably carry over to downstream prediction, exposing a gap between inferring latent structure and using it to rationally update beliefs about the target outcome. 在七种大语言模型（3B 到 70B 参数）的测试中，模型规模的扩大改善了潜在推断和证据积累能力，其更新结果偶尔能与贝叶斯后验概率相匹配。然而，这些提升并不能可靠地转化为下游预测能力的增强，这揭示了在推断潜在结构与利用该结构理性更新目标结果信念之间仍存在差距。