Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

前沿大语言模型中的领域级元认知监控：33个模型图谱

Abstract: Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0-100). Total observations: 47,151.

摘要： 聚合的元认知质量分数掩盖了模型在 MMLU 基准测试各领域内的差异。我们向来自 8 个模型家族的 33 个前沿大语言模型（LLM）投放了 1,500 个 MMLU 测试项（按照预设的六大领域分组，每个领域 250 个），并利用口头置信度（0-100）计算了每个模型-领域单元的 Type-2 AUROC。总观测样本量为 47,151 个。

Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall’s W = .164).

每个在聚合监控表现上优于随机水平的模型，都表现出了显著的领域级差异。“应用/专业知识”领域被证明是最容易监控的基准领域（平均 AUROC = .742，在 33 个模型中有 21 个排名在前两位）；而“形式推理”和“自然科学”则是最难监控的领域（在 27 个模型中，这两个领域中有其一排名在后两位）。其余三个中间领域的表现统计上无显著差异（Kendall’s W = .164）。

A subject-level coherence analysis (within-domain similarity ratio = 0.95) confirms the six-domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within-family profile-shape clustering is significant for Anthropic, Google-Gemini, and Qwen (permutation p < .0001) but not DeepSeek, Google-Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B.

通过主题级一致性分析（领域内相似度比率为 0.95）证实，这种六领域分组是一种实用的基准分类法，而非经过验证的潜在结构。在模型家族内部，Anthropic、Google-Gemini 和 Qwen 的配置形状聚类具有显著性（置换检验 p < .0001），但 DeepSeek、Google-Gemma 或 OpenAI 则不显著。Gemma 4 31B 的 AUROC 较 Gemma 3 27B 提升了 +.202。

Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe-format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split-half aggregate stability r = .893; profile-level split-half is weaker (grand median r = .184). These results show stable benchmark-domain variation obscured by aggregate metrics, and support benchmark-stage domain screening as a step before deployment in specific application areas.

三个在二元“保留/撤回”（KEEP/WITHDRAW）探测中被归类为无效的模型，在口头置信度测试下产生了正常的配置曲线，这证实了探测格式的特异性。198 个单元的 Bootstrap 95% 置信区间中位宽度为 .199。折半聚合稳定性 r = .893；配置级折半稳定性较弱（总中位数 r = .184）。这些结果表明，聚合指标掩盖了稳定的基准领域差异，并支持在特定应用领域部署前，将“基准阶段领域筛选”作为必要步骤。