Small edits, large models: How Wikipedia advocacy shapes LLM values
Small edits, large models: How Wikipedia advocacy shapes LLM values
微小编辑,大型模型:维基百科倡导活动如何塑造大语言模型的价值观
Abstract: Can a small group of volunteers shape how AI systems discuss animal welfare, just by editing Wikipedia? We show that they can. Wikipedia appears in nearly every major language model training dataset and is weighted more heavily than web-crawled text. The Pro-Animal Wikipedians (PAW), a group of advocates who add sourced animal welfare content to relevant articles, have made 125 edits across 115 pages.
摘要: 一小群志愿者仅仅通过编辑维基百科,就能影响人工智能系统讨论动物福利的方式吗?我们的研究表明,答案是肯定的。维基百科几乎出现在每一个主流大语言模型的训练数据集中,且其权重高于一般的网络爬取文本。“动物权益维基人”(Pro-Animal Wikipedians, PAW)是一个倡导者团体,他们通过在相关条目中添加有据可查的动物福利内容,在 115 个页面上进行了 125 次编辑。
Using gradient-based data attribution (Bergson; MAGIC), we traced how these edits influence language model behavior. TrackStar retrieval attribution on Llama 3.1 8B found that PAW-edited sections made up 68 percent of the highest-attributed documents for animal welfare queries (p < 0.0001) but only 52 percent for unrelated queries about the same companies (p = 0.53): the model links PAW content specifically to animal welfare topics, not to the entities in general.
我们利用基于梯度的数据归因方法(Bergson; MAGIC)追踪了这些编辑如何影响大语言模型的行为。在 Llama 3.1 8B 模型上进行的 TrackStar 检索归因分析发现,在动物福利相关的查询中,PAW 编辑的内容占最高归因文档的 68%(p < 0.0001);而在针对同一公司但无关动物福利的查询中,这一比例仅为 52%(p = 0.53)。这表明模型将 PAW 的内容专门与动物福利主题关联,而非与这些实体本身建立普遍联系。
MAGIC counterfactual influence estimation on Llama-3.2-1B, run across five random training-order seeds, gave the same picture even more sharply: in every seed, the top-10 most influential documents on animal welfare queries were all PAW edits (10 of 10, 5 of 5 seeds), while on general queries the same top-10 sat at chance (4 to 6 of 10). Mean PAW influence exceeded mean control influence on animal welfare queries with p < 0.0001 in every seed, an effect 6 to 30 times larger than on general queries. Leave-subset-out validation gave Spearman rho = 1.00 for all 10 runs.
在 Llama-3.2-1B 模型上,通过五个随机训练顺序种子运行的 MAGIC 反事实影响评估,更清晰地印证了这一结论:在每一个种子实验中,针对动物福利查询,影响力排名前 10 的文档全部来自 PAW 的编辑(5 个种子实验中,10 次中有 10 次);而在一般性查询中,前 10 名的分布则处于随机水平(10 次中有 4 到 6 次)。在所有种子实验中,PAW 内容对动物福利查询的平均影响力均超过了对照组(p < 0.0001),其效应强度是一般性查询的 6 到 30 倍。留一子集验证(Leave-subset-out validation)在所有 10 次运行中得出的斯皮尔曼等级相关系数(Spearman rho)均为 1.00。
When we fine-tuned separate models on PAW content versus control content, each model performed better specifically on the type of text it was trained on: the PAW-trained model cut perplexity on animal welfare text from 12.4 to 8.4, while the control-trained model cut perplexity on control text from 16.1 to 11.4. A small, coordinated Wikipedia editing campaign therefore measurably shapes how language models handle the topics those edits address.
当我们分别使用 PAW 内容和对照组内容对模型进行微调时,每个模型在各自训练的文本类型上表现更好:PAW 训练的模型将动物福利文本的困惑度(perplexity)从 12.4 降低至 8.4,而对照组训练的模型将对照文本的困惑度从 16.1 降低至 11.4。因此,一场小规模、有组织的维基百科编辑活动,能够显著且可衡量地塑造大语言模型处理相关议题的方式。