Confidence Estimation in Automatic Short Answer Grading with LLMs

基于大语言模型的自动简答题评分置信度评估

Abstract: Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making.

摘要： 近期，基于生成式大语言模型（LLM）的自动简答题评分（ASAG）在无需特定任务微调的情况下展现出了强大的性能，并能够为教育评估生成合成反馈。尽管取得了这些进展，但基于大模型的评分系统仍不完美，因此，可靠的置信度评估对于教育决策中人机协作的安全性和有效性至关重要。

In this work, we investigate confidence estimation for ASAG with LLMs by jointly considering model-based confidence signals and dataset-derived uncertainty. We systematically compare three model-based confidence estimation strategies, namely verbalizing, latent, and consistency-based confidence estimation, and show that model-based confidence alone is insufficient to reliably capture uncertainty in ASAG.

在这项工作中，我们通过综合考虑基于模型的置信度信号和数据集衍生的不确定性，研究了基于大模型的 ASAG 置信度评估方法。我们系统地比较了三种基于模型的置信度评估策略，即：口语化（verbalizing）、潜在（latent）和基于一致性（consistency-based）的置信度评估，并指出仅靠基于模型的置信度不足以可靠地捕捉 ASAG 中的不确定性。

To address this limitation, we propose a hybrid confidence framework that integrates model-based confidence signals with an explicit estimate of dataset-derived aleatoric uncertainty. Aleatoric uncertainty is operationalized by clustering semantically embedded student responses and quantifying within-cluster heterogeneity.

为了解决这一局限性，我们提出了一个混合置信度框架，将基于模型的置信度信号与数据集衍生的偶然不确定性（aleatoric uncertainty）的显式估计相结合。偶然不确定性通过对学生回答的语义嵌入进行聚类，并量化簇内异质性来实现。

Our results demonstrate that the proposed hybrid confidence measure yields more reliable confidence estimates and improves selective grading performance compared to single-source approaches. Overall, this work advances confidence-aware LLM-based grading for human-in-the-loop assessment, supporting more trustworthy AI-assisted educational assessment systems.

研究结果表明，与单一来源的方法相比，所提出的混合置信度度量方法能够产生更可靠的置信度估计，并提升选择性评分的性能。总的来说，这项工作推进了面向“人在回路”（human-in-the-loop）评估的置信度感知大模型评分技术，为构建更值得信赖的 AI 辅助教育评估系统提供了支持。