Book publishers sue Meta over AI’s ‘word-for-word’ copying

出版商起诉 Meta：指控其 AI 模型“逐字”抄袭

Macmillan, McGraw Hill, Cengage, and others claim Meta carried out ‘one of the most massive infringements of copyrighted materials in history.’ Macmillan、McGraw Hill、Cengage 等出版商声称，Meta 实施了“历史上最大规模的版权材料侵权行为之一”。

Meta is facing a class action lawsuit filed by five major book publishers and one author over claims the company “engaged in one of the most massive infringements of copyrighted materials in history” when training its Llama AI models, as reported earlier by The New York Times. In their suit, Macmillan, McGraw Hill, Elsevier, Hachette, Cengage, and author Scott Turow allege that Meta “repeatedly copied” their books and journal articles without permission. 据《纽约时报》早先报道，Meta 正面临由五家大型图书出版商和一位作家提起的集体诉讼。原告指控该公司在训练其 Llama AI 模型时，“实施了历史上最大规模的版权材料侵权行为之一”。在诉讼中，Macmillan、McGraw Hill、Elsevier、Hachette、Cengage 以及作家 Scott Turow 指控 Meta 未经许可“反复复制”了他们的书籍和期刊文章。

The lawsuit accuses Meta of knowingly ripping copyrighted work from “notorious pirate sites,” such as LibGen, Anna’s Archive, Sci-Hub, Sci-Mag, and others, and then feeding that material into its AI model. It also claims that Meta trained Llama with information inside the Common Crawl dataset, which is allegedly “full of unauthorized copies of copyrighted works.” As a result, Llama “outputs verbatim and near-verbatim substitutes” of copyrighted material: 该诉讼指控 Meta 明知故犯，从 LibGen、Anna’s Archive、Sci-Hub、Sci-Mag 等“臭名昭著的盗版网站”窃取受版权保护的作品，并将这些材料输入其 AI 模型。诉讼还声称，Meta 使用 Common Crawl 数据集中的信息训练 Llama，而该数据集据称“充斥着未经授权的版权作品副本”。因此，Llama 会“输出受版权保护材料的逐字或近乎逐字的替代内容”：

For example, when prompted with two brief sentences from Cengage’s best-selling textbook, Calculus: Early Transcendentals, 9th edition, by James Stewart, Llama begins reproducing word-for-word the continuation of the section. 例如，当输入 Cengage 畅销教材——James Stewart 所著的《微积分：早期的超越函数》（第 9 版）中的两句简短句子作为提示时，Llama 开始逐字复现该章节的后续内容。

Several authors have already sued Meta for alleged copyright infringement, which brought to light the company’s internal discussions about how to handle “media coverage suggesting we have used a dataset we know to be pirated.” Last year, a federal judge ruled in favor of Meta in one of these lawsuits, though he pointed out that his ruling “does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful.” 此前已有几位作者起诉 Meta 涉嫌版权侵权，这使得该公司内部关于如何处理“媒体报道暗示我们使用了已知是盗版的数据集”的讨论被曝光。去年，一位联邦法官在其中一起诉讼中裁定 Meta 胜诉，但他同时指出，该裁决“并不代表 Meta 使用版权材料训练其语言模型的行为是合法的”。

A group of authors also sued Anthropic over copyright infringement. While a federal judge ruled that training AI models on legally purchased books without permission is considered fair use, he allowed the authors to move forward with a class action lawsuit over the “millions” of works Anthropic allegedly pirated. Anthropic agreed to pay writers $1.5 billion last year to settle the class action lawsuit. 一群作者也曾因版权侵权起诉 Anthropic。虽然一位联邦法官裁定，在未经许可的情况下使用合法购买的书籍训练 AI 模型属于“合理使用”，但他允许作者们就 Anthropic 涉嫌盗版的“数百万”部作品继续进行集体诉讼。去年，Anthropic 同意支付 15 亿美元与作家们达成集体诉讼和解。

Turow and the group of publishers are suing Meta for damages, and ask that the court order the company to block its allegedly unlawful activities. They also ask the court to require the company to provide a list of books, journal articles, and other copyrighted works that it trained its Llama AI models on. Turow 和出版商团体正就损害赔偿起诉 Meta，并要求法院下令该公司停止其所谓的非法活动。他们还要求法院强制该公司提供一份其用于训练 Llama AI 模型的书籍、期刊文章及其他版权作品的清单。

“AI is powering transformative innovations, productivity and creativity for individuals and companies, and courts have rightly found that training AI on copyrighted material can qualify as fair use,” Meta spokesperson Dave Arnold said in an emailed statement to The Verge. “We will fight this lawsuit aggressively.” Meta 发言人 Dave Arnold 在给《The Verge》的电子邮件声明中表示：“人工智能正在为个人和企业推动变革性的创新、生产力和创造力，法院也已正确地认定，使用版权材料训练人工智能可以被视为合理使用。我们将积极应对这场诉讼。”