From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

从评分到解释:评估用于基于量规的教学质量评估的 SHAP 与大语言模型(LLM)推理

Abstract: Automated scoring models are increasingly used to assign rubric-based quality ratings to complex language performances, including classroom transcripts, yet they typically provide little insight into why a particular score is produced. 摘要: 自动化评分模型正越来越多地被用于对复杂的语言表现(包括课堂实录)进行基于量规的质量评级,但这些模型通常无法提供关于为何产生特定分数的深入见解。

We propose a general framework for sentence-level interpretability of rubric-based scoring that combines model-agnostic Shapley-value attributions with rationales generated by large language models (LLMs). 我们提出了一种用于基于量规评分的句子级可解释性通用框架,该框架结合了与模型无关的 Shapley 值归因与大语言模型(LLM)生成的推理(Rationales)。

Instantiated on the Quality of Feedback dimension of the CLASS framework using the NCTE corpus, the framework enables systematic comparison of fine-tuned pretrained language models (PLMs) and prompted LLMs on both scoring performance and explanation faithfulness. 通过使用 NCTE 语料库在 CLASS 框架的“反馈质量”维度上进行实例化,该框架能够系统地比较微调后的预训练语言模型(PLM)与提示工程下的 LLM 在评分表现和解释忠实度方面的差异。

Across 6k annotated transcript segments, fine-tuned PLMs outperform LLMs in prediction accuracy but exhibit label compression toward mid-scale scores. 在 6,000 个标注的实录片段中,微调后的 PLM 在预测准确性上优于 LLM,但在评分上表现出向中间分值集中的标签压缩现象。

Deletion-based tests show that SHAP identifies sentences that reliably drive model predictions, producing typically larger and more coherent prediction shifts than LLM-generated rationales. 基于删除的测试表明,SHAP 能够识别出可靠驱动模型预测的句子,且相比 LLM 生成的推理,其产生的预测偏移通常更大且更具连贯性。

Cross-model analyses further reveal that SHAP attributions transfer robustly across architectures, whereas LLM rationales exert limited and inconsistent influence. 跨模型分析进一步揭示,SHAP 归因在不同架构间具有稳健的迁移性,而 LLM 推理的影响力则有限且不一致。

Overall, the findings demonstrate that SHAP provides more faithful and transferable explanations for rubric-based scoring, and that the proposed framework offers a principled basis for evaluating both scoring models and their explanations in high-stakes educational settings and other rubric-based language assessment tasks. 总体而言,研究结果表明,SHAP 为基于量规的评分提供了更忠实且可迁移的解释;同时,所提出的框架为在高风险教育环境及其他基于量规的语言评估任务中评估评分模型及其解释提供了一个原则性的基础。