GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

GeoSym127K：用于多模态几何推理的可扩展符号可验证合成框架

Abstract: Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams.

摘要： 大型多模态模型（LMMs）由于视觉幻觉和缺乏数学精确的思维链（CoT）数据，往往难以进行几何推理。为了解决这一问题，我们提出了 GeoSym Engine，这是一个自动化且可扩展的神经符号框架。通过利用类型条件语法和解析式 SymGT 求解器，该框架能够推导出精确的符号真值，并与强大的渲染流水线无缝集成，从而生成高精度的几何图形。

Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation.

利用该引擎，我们构建了 GeoSym127K 数据集。这是一个按难度分层的数据集，包含 5.1 万张高分辨率图像、12.7 万个带有符号真值的问题，以及 5.5 万个经过答案验证的 CoT 问答对。此外，我们还推出了 GeoSym-Bench，这是一套由专家精心挑选的、包含 511 个复杂样本的基准测试集，用于进行严格评估。

Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8.

通过广泛的监督微调（SFT），我们证明了 GeoSym 能够显著提升模型在依赖图表和多步几何任务上的表现。我们的 Qwen3-VL-8B 模型在 MathVerse Vision-Only 子集上的绝对准确率提升了 22.21%，在 WeMath 上达到了 61.52%（提升了 6.19%），有效缓解了长程逻辑碎片化问题，并超越了如豆包-1.8 等先进的闭源模型。

Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at this https URL and this https URL.

此外，通过 GRPO 应用带有可验证奖励的强化学习（RLVR）表明，从结构化 SFT 检查点进行初始化，比零样本强化学习更能显著提高性能上限。在确定性精确匹配信号的驱动下，这展示了我们可验证推理合成技术的强大扩展潜力。数据集和代码可在相关链接中获取。