AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection
AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection
AEyeDE:一种用于 AI 生成文本检测的基于注意力归因框架
Abstract: Detecting AI-generated text is becoming increasingly challenging as modern language models approach human-level fluency and can evade detectors that rely on surface statistics or likelihood-based signals. 摘要: 随着现代语言模型在流畅度上日益接近人类水平,并能够规避那些依赖表面统计特征或基于似然信号的检测器,检测 AI 生成文本正变得越来越具有挑战性。
We propose \textsc{AEyeDE}, an attribution-driven approach to human-AI authorship detection that leverages model attention as a discriminative signal. Specifically, we extract attention-based attribution matrices for both human- and AI-generated text using a \emph{proxy} Transformer model with white-box access and train a lightweight Convolutional Neural Network to learn representations from these attribution maps. 我们提出了 \textsc{AEyeDE},这是一种基于归因的人机文本作者身份检测方法,它利用模型注意力作为判别信号。具体而言,我们使用具有白盒访问权限的“代理”(proxy)Transformer 模型,提取人类和 AI 生成文本的基于注意力的归因矩阵,并训练一个轻量级卷积神经网络(CNN)来从这些归因图中学习特征表示。
Across encoder-decoder translation settings, our method consistently outperforms a text-only baseline. In decoder-only settings, it performs strongly in generator-specific detection, remains competitive on standard benchmarks, and shows robustness under cross-dataset transfer and alternative-spelling perturbations. 在编码器-解码器(encoder-decoder)翻译设置中,我们的方法始终优于仅基于文本的基准模型。在仅解码器(decoder-only)设置中,它在特定生成器检测方面表现强劲,在标准基准测试中保持竞争力,并展现出在跨数据集迁移和拼写变体扰动下的鲁棒性。
We further show that attention maps exhibit recurring local structures whose relative frequencies differ consistently between human- and AI-generated text across datasets and proxy models. These findings suggest that attention-based attribution maps provide a complementary and interpretable signal for AI-generated text detection. We will make the code publicly available to support future research. 我们进一步证明,注意力图表现出重复的局部结构,这些结构在不同数据集和代理模型中,其相对频率在人类和 AI 生成文本之间存在显著差异。这些发现表明,基于注意力的归因图为 AI 生成文本检测提供了一种互补且可解释的信号。我们将公开代码以支持未来的研究。