Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

前沿大语言模型中的领域级元认知监控:33个模型图谱

Abstract: Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0-100). Total observations: 47,151.

摘要: 聚合的元认知质量分数掩盖了模型在 MMLU 基准测试各领域内的差异。我们向来自 8 个模型家族的 33 个前沿大语言模型(LLM)投放了 1,500 个 MMLU 测试项(按照预设的六大领域分组,每个领域 250 个),并利用口头置信度(0-100)计算了每个模型-领域单元的 Type-2 AUROC。总观测样本量为 47,151 个。

Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall’s W = .164).

每个在聚合监控表现上优于随机水平的模型,都表现出了显著的领域级差异。“应用/专业知识”领域被证明是最容易监控的基准领域(平均 AUROC = .742,在 33 个模型中有 21 个排名在前两位);而“形式推理”和“自然科学”则是最难监控的领域(在 27 个模型中,这两个领域中有其一排名在后两位)。其余三个中间领域的表现统计上无显著差异(Kendall’s W = .164)。

A subject-level coherence analysis (within-domain similarity ratio = 0.95) confirms the six-domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within-family profile-shape clustering is significant for Anthropic, Google-Gemini, and Qwen (permutation p < .0001) but not DeepSeek, Google-Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B.

通过主题级一致性分析(领域内相似度比率为 0.95)证实,这种六领域分组是一种实用的基准分类法,而非经过验证的潜在结构。在模型家族内部,Anthropic、Google-Gemini 和 Qwen 的配置形状聚类具有显著性(置换检验 p < .0001),但 DeepSeek、Google-Gemma 或 OpenAI 则不显著。Gemma 4 31B 的 AUROC 较 Gemma 3 27B 提升了 +.202。

Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe-format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split-half aggregate stability r = .893; profile-level split-half is weaker (grand median r = .184). These results show stable benchmark-domain variation obscured by aggregate metrics, and support benchmark-stage domain screening as a step before deployment in specific application areas.

三个在二元“保留/撤回”(KEEP/WITHDRAW)探测中被归类为无效的模型,在口头置信度测试下产生了正常的配置曲线,这证实了探测格式的特异性。198 个单元的 Bootstrap 95% 置信区间中位宽度为 .199。折半聚合稳定性 r = .893;配置级折半稳定性较弱(总中位数 r = .184)。这些结果表明,聚合指标掩盖了稳定的基准领域差异,并支持在特定应用领域部署前,将“基准阶段领域筛选”作为必要步骤。