LLM Performance on a Real, Double-Marked GCSE Benchmark

大语言模型在真实双重评分 GCSE 基准测试中的表现

Abstract: We introduce a dataset of 32,534 double-marked real student responses to GCSE mock exams (GCSEs are the UK’s national exams, taken at age ~16), spanning 328 questions across five subjects and including handwritten work. 摘要： 我们引入了一个包含 32,534 份真实学生 GCSE 模拟考试答卷的数据集，这些答卷均经过双重评分（GCSE 是英国的全国性考试，通常在 16 岁左右参加）。该数据集涵盖了五个学科的 328 道题目，并包含了手写答题内容。

We test whether off-the-shelf large language models agree with examiners as closely as the two examiners agree with each other. 我们测试了现成的大语言模型在评分时与考官的一致性，并将其与两位考官之间的一致性进行了对比。

We find that models overwhelmingly agree well with the examiner consensus across subjects, with the top performing models agreeing more closely with examiners than examiners agree with each other. 研究发现，模型在各学科中与考官共识的一致性表现极佳，表现最好的模型与考官的一致性甚至超过了考官之间的一致性。

Models achieve high scores for subjective tasks like English essay marking, as well as handling complex and messy handwritten Maths paper scripts. 模型在英语作文评分等主观性任务中取得了高分，同时也能够处理复杂且凌乱的数学手写试卷。

Agreement is uniform near the examiner line, and not massively discriminated by model size, providing cost-effective automated marking solutions. 评分一致性在考官基准线附近表现稳定，且并未因模型规模的大小而产生显著差异，这为自动化评分提供了具有成本效益的解决方案。