The Atlantic created a searchable database of the music used to train AI
The Atlantic created a searchable database of the music used to train AI
《大西洋月刊》创建了一个可搜索的数据库,用于展示训练 AI 所使用的音乐
Millions of tracks are freely available in datasets, even if they’re not supposed to be. 数以百万计的曲目在数据集中免费提供,尽管它们本不应出现在那里。
Atlantic reporter Alex Reisner recently uncovered four datasets of music being used to train AI models and made them fully searchable for the public. Two of the sets are absolutely enormous at 12 million and 9 million tracks. The other two are much smaller, but still represent a significant amount of training data at over 100,000 songs each. 《大西洋月刊》记者 Alex Reisner 最近发现了四个用于训练 AI 模型的音乐数据集,并将其制作成可供公众全面搜索的数据库。其中两个数据集规模极其庞大,分别包含 1200 万和 900 万首曲目。另外两个规模较小,但也分别包含超过 10 万首歌曲,代表了相当可观的训练数据量。
According to Reisner, the sets have been downloaded thousands of times and, while it’s impossible to know exactly who has used them, Google and Stability have both confirmed they have in research papers. Some of the sources, like the Free Music Archive dataset, are free to stream for personal use but require licensing for commercial applications. 据 Reisner 称,这些数据集已被下载了数千次。虽然无法确切知道谁使用了它们,但 Google 和 Stability 公司已在研究论文中证实曾使用过这些数据。其中一些来源(如 Free Music Archive 数据集)虽然可供个人免费收听,但用于商业用途时则需要获得授权。
While the datasets are freely available on the internet in theory, using them as training data is not as simple as downloading a ZIP file and feeding it to an AI model. As Reisner explains: 虽然这些数据集在理论上可以在互联网上免费获取,但将它们用作训练数据并不像下载一个 ZIP 文件并将其喂给 AI 模型那么简单。正如 Reisner 所解释的那样:
Three of the datasets I found are distributed as a list of links to songs on YouTube or Spotify. AI developers download the actual audio using tools that automate the job, some of which allow developers to bypass logins, advertisements, and mechanisms that might earn money or subscribers for creators. Such tools violate the terms of service of these platforms. “我发现的其中三个数据集是以 YouTube 或 Spotify 上的歌曲链接列表形式分发的。AI 开发人员使用自动化工具下载实际音频,其中一些工具允许开发人员绕过登录、广告以及可能为创作者带来收入或订阅者的机制。这些工具违反了这些平台的服务条款。”
Names that pop up in the dataset range from pop stars like Lady Gaga and Fred Again.., to Radiohead, Aphex Twin, Wu-Tang Clan, Bruce Springsteen, and experimental composer Hainbach. You can hop over to the Atlantic’s AI Watchdog site and search through the songs, books, and other media being used to train the world’s AI models yourself. 数据集中出现的名单范围很广,从 Lady Gaga 和 Fred Again.. 等流行歌星,到 Radiohead、Aphex Twin、Wu-Tang Clan、Bruce Springsteen 以及实验作曲家 Hainbach 等。你可以前往《大西洋月刊》的“AI 监督”(AI Watchdog)网站,亲自搜索用于训练全球 AI 模型的歌曲、书籍和其他媒体内容。