ForecastBench-Sim: A Simulated-World Forecasting Benchmark

Abstract: Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score.

摘要： 通用人工智能系统的预测基准测试通常受限于现实世界的约束：结果解析缓慢、尾部事件稀少，且反事实问题难以评分。

We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series.

我们引入了 ForecastBench-Sim，这是一个基于《Freeciv》（一款模仿《文明》系列的策略游戏）游戏推演构建的模拟世界预测基准测试。

Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts.

预测者会收到一份固定的世界报告（当前游戏状态的结构化快照），并回答关于隐藏未来状态的问题；随后，基准测试系统会继续进行模拟并对预测结果进行评分。

Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of resolved examples of rare or disruptive outcomes.

由于世界是模拟的，同一设置可以在任意时间跨度内生成连续或二元预测问题，为条件或因果问题提供配对的干预世界，并生成罕见或颠覆性结果的已解析示例。

We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot.

我们描述了该基准测试的流程、问题类别、评分协议和发布产物，并报告了来自模型评估和匿名人类试点研究的验证片段。

ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.

ForecastBench-Sim 旨在通过提供受控且可立即解析的任务，来补充现实世界的预测基准测试，从而研究动态世界状态下的概率推理。