CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

CrowdMath:众包数学研究讨论数据集

Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof.

大型语言模型在数学推理方面已经取得了实质性进展,但现有的基准测试通常仅评估具有明确答案、分步解答或完整证明的规范化问题。这些基准测试无法捕捉协作式开放问题求解的过程:在这种场景下,参与者提出部分论点、识别先前步骤中的漏洞或错误、修复有缺陷的推理,并逐渐将增量贡献综合成完整的证明。

We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES—Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification.

我们推出了 CrowdMath,这是一个包含 164 条经专家标注的进展链的数据集,源自 MIT PRIMES 与 Art of Problem Solving (AoPS) 的 CrowdMath 项目(2016-2025 年)。这是一个协作研究计划,其讨论成果已发表在同行评审期刊上。每一条进展链都追踪了从开放问题陈述到最终证明完成的多人论坛讨论过程。帖子根据其在演进求解过程中的功能角色进行了标注,包括部分进展、证明完成、错误推理和错误识别。

We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.

我们定义了评估任务并对六个前沿模型进行了基准测试。模型在“下一条帖子预测”任务上达到了 83-88% 的准确率,这表明它们能够跟上数学讨论的局部逻辑流。然而,它们在识别个人贡献的功能意义方面表现吃力,表现最好的模型在“帖子角色分类”任务上的宏观 F1 分数仅为 0.42。CrowdMath 揭示了解决规范化数学问题与理解协作式数学进展过程之间存在的差距。