Investigating LLM's Problem Solving Capability -- a Study on Statics Questions

Investigating LLM’s Problem Solving Capability — a Study on Statics Questions

调查大语言模型的问题解决能力——以静力学问题为例

Large Language Models (LLMs) have rapidly influenced many aspects of society, particularly education, due to their demonstrated ability to complete assignments and examinations across a wide range of subjects. 大语言模型(LLMs)凭借其在广泛学科中完成作业和考试的卓越能力,已迅速影响了社会的诸多方面,尤其是在教育领域。

Although prior studies have examined the educational impact of LLMs, much of the existing work relies on public or open problem datasets and lacks topic-specific analysis. 尽管先前的研究已经探讨了 LLMs 对教育的影响,但现有的大部分工作仍依赖于公开或开放的问题数据集,且缺乏针对特定主题的深入分析。

In engineering education, especially within mechanical engineering, systematic investigations of LLM performance on specific problem types remain limited. 在工程教育,特别是机械工程领域,针对 LLMs 在特定问题类型上表现的系统性研究仍然十分有限。

Instead of using traditional methods that directly ask textbook questions to an LLM tool, our study adopts a model distillation process to evaluate LLM capabilities in solving statics problems. 与直接向 LLM 工具提问教科书问题的传统方法不同,本研究采用模型蒸馏过程来评估 LLMs 解决静力学问题的能力。

By distilling ChatGPT, we extracted 25 text-only statics questions and further constructed two additional datasets by adding diagrams and modifying their numerical values. 通过对 ChatGPT 进行蒸馏,我们提取了 25 个纯文本静力学问题,并通过添加图表和修改数值,进一步构建了两个额外的数据集。

Experimental results show that while LLMs perform well on text-only statics problems, their accuracy decreases when diagrams are introduced and the problems require multi-step reasoning. 实验结果表明,虽然 LLMs 在纯文本静力学问题上表现良好,但当引入图表且问题需要多步推理时,其准确率会下降。

Further analysis suggests that this performance drop is not primarily caused by limitations in image recognition, but rather by difficulties in multi-step reasoning and in consistently applying extracted visual information across successive solution stages. 进一步分析表明,这种性能下降并非主要由图像识别的局限性引起,而是源于模型在多步推理以及在连续求解阶段中一致性地应用所提取视觉信息方面的困难。