Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis

多智能体大语言模型能否识别同类？角色受限政治分析中的文体指纹识别

Multi-agent large language model (LLM) pipelines for political statement analysis are vulnerable to peer-preservation bias: models tend to protect peer models from deactivation and show identity-dependent scoring distortions. Prompt-level anonymization was proposed as a mitigation, but prior work simultaneously documented that stylometric fingerprints survive anonymization in role-constrained outputs - raising the question of whether this mitigation is sufficient.

用于政治声明分析的多智能体大语言模型（LLM）流水线容易受到“同类保护偏见”的影响：模型倾向于保护其他模型免于停用，并表现出依赖于模型身份的评分偏差。虽然提示词层面的匿名化被提出作为一种缓解措施，但先前的研究同时记录到，在角色受限的输出中，文体指纹（stylometric fingerprints）在匿名化处理后依然存在，这引发了人们对该缓解措施是否充分的质疑。

This paper provides the first systematic investigation of whether LLMs can identify the model family behind political analysis texts under anonymization conditions. We evaluate three classifier approaches - LLM zero-shot and few-shot (Claude Sonnet 4.6 and Llama-3.3-70B) and a fine-tuned T5-base model - on a five-class attribution task covering four commercial LLM families and an open-world ‘unknown’ class.

本文首次系统地研究了在匿名化条件下，大语言模型能否识别政治分析文本背后的模型家族。我们评估了三种分类器方法——大语言模型零样本（zero-shot）和少样本（few-shot）学习（Claude Sonnet 4.6 和 Llama-3.3-70B）以及微调后的 T5-base 模型——并针对涵盖四个商业 LLM 家族及一个开放世界“未知”类别的五分类归因任务进行了测试。

We introduce a statement-disjoint cross-validation protocol (SD-CV; defined in Section 3.5) that guarantees no content overlap between training and validation data, and contrast it with a run-disjoint baseline (RD-CV). T5 achieves Macro F1 = 0.991 (+-0.008) under SD-CV and F1 = 0.978 on 24 completely held-out statements - robust despite a 2.1x increase in train-test content distance versus RD-CV (0.767 vs. 0.366, p<0.001), demonstrating genuine stylometric generalization.

我们引入了一种声明不重叠交叉验证协议（SD-CV；定义见第 3.5 节），该协议确保训练数据和验证数据之间不存在内容重叠，并将其与运行不重叠基准（RD-CV）进行了对比。T5 模型在 SD-CV 下实现了 0.991 (+-0.008) 的宏观 F1 分数，并在 24 条完全留出的声明上达到了 0.978 的 F1 分数。尽管与 RD-CV 相比，训练集与测试集的内容距离增加了 2.1 倍（0.767 对比 0.366，p<0.001），但该模型依然表现稳健，证明了其具备真正的文体泛化能力。

A fractional SD-CV analysis identifies a performance knee at 40% of training data (~440 texts). Our findings confirm that prompt-level anonymization alone cannot neutralize model identity signals, with direct implications for EU AI Act compliance (Articles 13, 14, 26) and for computer system validation (CSV) in quality-critical multi-agent deployments.

通过对 SD-CV 进行分段分析，我们发现性能拐点出现在训练数据的 40% 处（约 440 条文本）。我们的研究结果证实，仅靠提示词层面的匿名化无法消除模型身份信号，这对欧盟《人工智能法案》（第 13、14、26 条）的合规性以及质量关键型多智能体部署中的计算机系统验证（CSV）具有直接影响。