VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

Abstract: Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool’s output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making.

摘要： 多模态大语言模型在复杂推理方面的能力日益增强，然而当它们必须通过工具将问题外部化，并针对工具的输出进行推理时，其性能往往会下降，特别是在依赖视觉辅助的情况下。这一差距尤为重要，因为实际的工程和科学工作流程通常依赖可视化工具进行分析、验证和决策。

To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc.

为了研究这一差异，我们引入了 VAMPS（视觉辅助数学问题求解），这是一个用于图表辅助数学的基准测试。VAMPS 包含 1,168 个多模态、双语选择题问答对，这些题目选自伊朗大学入学考试的代数和微积分问题，并辅以经人工审核的 LLM 生成的合成变体。所有题目均经过精心挑选，使得绘图能够通过揭示交点、极值、渐近线等，提供一种自然的求解策略。

Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

VAMPS 旨在进行基准测试和诊断，它超越了以往主要评估固定视觉输入推理能力的基准测试，通过测试模型是否能从构建有用的图表中获益，并将答案建立在所得的可视化结果之上。总体而言，我们发现，在多种模型中，直接解析求解的表现出人意料地优于工具辅助的视觉求解，即使在绘图是自然求解策略的问题上也是如此。