Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis
Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis
多智能体大语言模型能否识别同类?角色受限政治分析中的文体指纹识别
Multi-agent large language model (LLM) pipelines for political statement analysis are vulnerable to peer-preservation bias: models tend to protect peer models from deactivation and show identity-dependent scoring distortions. Prompt-level anonymization was proposed as a mitigation, but prior work simultaneously documented that stylometric fingerprints survive anonymization in role-constrained outputs - raising the question of whether this mitigation is sufficient.
用于政治声明分析的多智能体大语言模型(LLM)流水线容易受到“同类保护偏见”的影响:模型倾向于保护其他模型免于停用,并表现出依赖于模型身份的评分偏差。虽然提示词层面的匿名化被提出作为一种缓解措施,但先前的研究同时记录到,在角色受限的输出中,文体指纹(stylometric fingerprints)在匿名化处理后依然存在,这引发了人们对该缓解措施是否充分的质疑。
This paper provides the first systematic investigation of whether LLMs can identify the model family behind political analysis texts under anonymization conditions. We evaluate three classifier approaches - LLM zero-shot and few-shot (Claude Sonnet 4.6 and Llama-3.3-70B) and a fine-tuned T5-base model - on a five-class attribution task covering four commercial LLM families and an open-world ‘unknown’ class.
本文首次系统地研究了在匿名化条件下,大语言模型能否识别政治分析文本背后的模型家族。我们评估了三种分类器方法——大语言模型零样本(zero-shot)和少样本(few-shot)学习(Claude Sonnet 4.6 和 Llama-3.3-70B)以及微调后的 T5-base 模型——并针对涵盖四个商业 LLM 家族及一个开放世界“未知”类别的五分类归因任务进行了测试。
We introduce a statement-disjoint cross-validation protocol (SD-CV; defined in Section 3.5) that guarantees no content overlap between training and validation data, and contrast it with a run-disjoint baseline (RD-CV). T5 achieves Macro F1 = 0.991 (+-0.008) under SD-CV and F1 = 0.978 on 24 completely held-out statements - robust despite a 2.1x increase in train-test content distance versus RD-CV (0.767 vs. 0.366, p<0.001), demonstrating genuine stylometric generalization.
我们引入了一种声明不重叠交叉验证协议(SD-CV;定义见第 3.5 节),该协议确保训练数据和验证数据之间不存在内容重叠,并将其与运行不重叠基准(RD-CV)进行了对比。T5 模型在 SD-CV 下实现了 0.991 (+-0.008) 的宏观 F1 分数,并在 24 条完全留出的声明上达到了 0.978 的 F1 分数。尽管与 RD-CV 相比,训练集与测试集的内容距离增加了 2.1 倍(0.767 对比 0.366,p<0.001),但该模型依然表现稳健,证明了其具备真正的文体泛化能力。
A fractional SD-CV analysis identifies a performance knee at 40% of training data (~440 texts). Our findings confirm that prompt-level anonymization alone cannot neutralize model identity signals, with direct implications for EU AI Act compliance (Articles 13, 14, 26) and for computer system validation (CSV) in quality-critical multi-agent deployments.
通过对 SD-CV 进行分段分析,我们发现性能拐点出现在训练数据的 40% 处(约 440 条文本)。我们的研究结果证实,仅靠提示词层面的匿名化无法消除模型身份信号,这对欧盟《人工智能法案》(第 13、14、26 条)的合规性以及质量关键型多智能体部署中的计算机系统验证(CSV)具有直接影响。