Aggressive AI scrapers are making it kinda suck to run wikis

Aggressive AI scrapers are making it kinda suck to run wikis

激进的 AI 爬虫让运营维基网站变得越来越糟糕

Bots are currently scraping the internet for LLM training data at unprecedented rates, driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months. 目前,机器人正以史无前例的速度抓取互联网数据用于大语言模型(LLM)训练,这不仅推高了运营成本,还导致公共网站变得不稳定。我想谈谈这对维基网站造成的巨大困扰,而且这种情况在过去几个月里变得愈发严重。

I run Weird Gloop, which hosts some of the biggest video game wikis ever, like Minecraft, OSRS and League. Over the last 3 years, we’ve had to spend more and more of our time fighting with this bot traffic that is spiky, disproportionately expensive, and getting harder to distinguish from humans. 我运营着 Weird Gloop,旗下托管着一些全球最大的游戏维基,例如《我的世界》(Minecraft)、《古墓丽影》(OSRS)和《英雄联盟》(League)。在过去三年里,我们不得不投入越来越多的时间来应对这些机器人流量,它们的特点是波动剧烈、成本高得离谱,且越来越难以与人类用户区分。

If we weren’t constantly mitigating the bots, they would use ~10x more of our compute resources than everything else put together - even though that “everything else” includes tens of millions of (human) pageviews and tens of thousands of edits a day. 如果我们不持续进行拦截,这些机器人消耗的计算资源将是其他所有流量总和的 10 倍左右——尽管“其他所有流量”已经包含了每天数千万次的人类页面浏览量和数万次编辑。

Everyone who runs wikis is dealing with the exact same problem. The Wikimedia Foundation has a post about it impacting operations, every major wiki farm has had varying degrees of service outages, and some smaller independent wikis have been knocked completely offline. Overall, I’d guess that about 95% of all server issues in the wiki ecosystem this year have been caused by bad scrapers. 每一位维基运营者都在面临同样的问题。维基媒体基金会(Wikimedia Foundation)曾发文称其运营受到了影响,各大维基托管平台都出现了不同程度的服务中断,一些较小的独立维基甚至被彻底挤下线。总的来说,我估计今年维基生态系统中约 95% 的服务器问题都是由恶意爬虫引起的。

Every wiki sysadmin I’ve talked to is dealing with these specific problems: 我交流过的每一位维基系统管理员都在处理以下具体问题:

The scrapers are pretending to be human visitors, and getting pretty good at it

爬虫正在伪装成人类访客,而且伪装得越来越像

Most of the discussion I’ve seen about scrapers has focused on bots operated by the major AI companies (GPTBot, ClaudeBot, PerplexityBot, etc). Although these “official” bots have at times struggled to respect robots.txt, at least they usually properly identify themselves as bots in their User Agent string, which makes it really easy for a website operator to block them with Cloudflare, nginx, or any number of other techniques. 我所见过的关于爬虫的大多数讨论,都集中在大型 AI 公司运营的机器人(如 GPTBot、ClaudeBot、PerplexityBot 等)上。虽然这些“官方”机器人有时难以遵守 robots.txt 协议,但它们通常会在 User Agent 字符串中正确标识自己的身份,这使得网站运营者可以很容易地通过 Cloudflare、nginx 或其他多种技术手段将其屏蔽。

The problem is that when webmasters started blocking AI scrapers based on User Agent, it created a massive incentive for bots to pretend to be human traffic, so as to avoid getting blocked. This game of cat and mouse has played out over the last few years, and the bots have gotten pretty darn good at imitating human requests. 问题在于,当站长开始根据 User Agent 屏蔽 AI 爬虫时,这反而极大地刺激了机器人伪装成人类流量以规避封锁。这场猫鼠游戏在过去几年里不断上演,而机器人现在已经非常擅长模仿人类的请求行为了。

Now the majority of AI scraper traffic that hits our wikis is carefully crafting the requests, sending the right headers so it can pretend to be recent versions of Google Chrome, which eliminates the obvious “bot or real person” signals that we previously could use to block them. 现在,访问我们维基的大多数 AI 爬虫流量都在精心构造请求,发送正确的请求头,伪装成最新版本的 Google Chrome 浏览器,这消除了我们以前用来识别并屏蔽它们的明显“机器人或真人”信号。

They’re using tens of millions of IP addresses

它们正在使用数千万个 IP 地址

Before 2023, if we had a problem with how someone was scraping the wiki, 95% of the time they would only be using a single IP address, or a single datacenter with a small subnet of IPs. So it was mostly effective to block bad actors based on IP or ISP characteristics. 2023 年之前,如果我们遇到爬虫抓取问题,95% 的情况下对方只会使用单个 IP 地址,或者来自同一个数据中心的小型 IP 子网。因此,基于 IP 或 ISP 特征来屏蔽恶意行为者通常是有效的。

…Enter residential proxies, where anyone with a credit card can get all of their scraping requests “laundered” through a network of millions of IP addresses. The wikis get hit sometimes by scraper runs that cycle through a million IPs a day, and they >look like< they’re coming from legit places: mostly residential ISPs (Comcast, AT&T, Charter, etc) where the customer probably doesn’t even know their IP is being used as an exit node for a residential proxy. ……直到住宅代理(residential proxies)出现,任何拥有信用卡的人都可以通过数百万个 IP 地址组成的网络来“清洗”他们的抓取请求。维基网站有时会遭到每天轮换一百万个 IP 的爬虫攻击,而且这些请求看起来像是来自合法的地点:主要是住宅 ISP(如 Comcast、AT&T、Charter 等),而这些宽带用户可能根本不知道自己的 IP 正被用作住宅代理的出口节点。

Beyond residential proxies, a lot of the scraping is happening on IPs that belong to Facebook and Google. Bad actors are able to use facebookexternalhit link preview or Google Translate to make the requests happen on Google/Facebook servers, which completely obscures the source of the requests. At times we’ve had to break Google Translate’s URL tool for all our wikis, because 99.99% of the requests coming through it are abusive. 除了住宅代理,大量的抓取行为还发生在属于 Facebook 和 Google 的 IP 上。恶意行为者能够利用 facebookexternalhit 链接预览或 Google 翻译,让请求在 Google/Facebook 的服务器上发起,这完全掩盖了请求的真实来源。有时我们不得不为所有维基禁用 Google 翻译的 URL 工具,因为通过它发起的请求中 99.99% 都是滥用行为。

They’re mostly crawling stupid URLs

它们大多在抓取毫无意义的 URL

Most of these AI scrapers seem to select their targets in the dumbest way possible: visit the homepage of the wiki, visit all the links on that page, visit all the links on THOSE pages… repeat until all links are visited. They don’t seem to have any awareness that there’s a robots.txt and sitemap that tells them which URLs are worth scraping. 大多数 AI 爬虫似乎以最愚蠢的方式选择目标:访问维基首页,访问该页面上的所有链接,再访问那些页面上的所有链接……不断重复,直到所有链接都被访问。它们似乎完全没有意识到 robots.txt 和站点地图的存在,而这些文件本可以告诉它们哪些 URL 值得抓取。

There’s a reason this is an especially dumb strategy for wikis. OSRS Wiki has about 40,000 “articles”, so that’s 40,000 URLs that make up the vast majority of the useful information on the site. But once you account for all the old revisions, edit screens and special pages that are used by people editing the wiki, there’s at least a billion navigable URLs. That means two things for scrapers hitting wikis: this naive scraping process is never going to finish; the vast majority of the requests are not doing anything useful. 对于维基网站来说,这种策略之所以特别愚蠢是有原因的。OSRS Wiki 大约有 40,000 篇“文章”,这 40,000 个 URL 构成了网站绝大部分有价值的信息。但如果算上所有的历史修订版本、编辑页面以及供编辑者使用的特殊页面,至少有十亿个可访问的 URL。这意味着对于攻击维基的爬虫来说:这种天真的抓取过程永远无法完成;绝大多数请求都没有任何实际用途。

Most of these URLs can’t possibly be useful data for training an LLM, but it seems like that’s what they’re spending most of their resources on. These weird requests are also unusually expensive for us to serve, since they bypass the various layers of caching that most requests from real users hit. Cache hits usually take less than 20 milliseconds of processing time, but these weird old diffs can frequently take 1-2 seconds. 这些 URL 中的大多数不可能成为训练 LLM 的有用数据,但它们似乎将大部分资源都浪费在了这些地方。这些奇怪的请求对我们来说处理成本也异常高昂,因为它们绕过了真实用户请求通常会命中的多层缓存。缓存命中通常只需不到 20 毫秒的处理时间,但这些奇怪的旧版本差异对比页面往往需要 1-2 秒。

This means that top-line metrics (“8 million bot requests a day”, “bots are using 65% of my bandwidth”, etc) seriously undersell the scope of the problem, because CPU capacity is usually the important bottleneck, and the bot requests with all the weird query parameters are often 50-100x as expensive to serve. 这意味着顶层指标(如“每天 800 万次机器人请求”、“机器人占用了我 65% 的带宽”等)严重低估了问题的严重性,因为 CPU 容量通常才是关键瓶颈,而那些带有各种奇怪查询参数的机器人请求,其处理成本往往是正常请求的 50 到 100 倍。

The worst bot traffic is very spiky, so aggregate metrics undersell the problem

最糟糕的机器人流量波动极大,因此汇总指标掩盖了问题的严重性

I said earlier that we get about 250 million bot requests per month (about 100 per second), but that’s just the long term average: these scrapers frequently operate in short bursts of 1000+ requests a second, almost indistinguishable from a good old-fashioned DDOS attack. So even though the bots might only be ~50% of our total CPU usage long-term, their abusive traffic spikes are responsible for ~95% of the slowness and outages that wikis have been dealing with. 我之前提到我们每月收到约 2.5 亿次机器人请求(每秒约 100 次),但这只是长期平均值:这些爬虫经常以每秒 1000 次以上的短时爆发式频率运行,几乎与老式的 DDOS 攻击无异。因此,尽管从长期来看,机器人可能只占我们总 CPU 使用量的 50% 左右,但它们造成的流量峰值却是维基网站所面临的 95% 的卡顿和宕机问题的罪魁祸首。

It’s not clear who’s doing it. I keep calling this bad traffic “AI scrapers”, but because… 目前还不清楚是谁在幕后操纵。我一直把这些恶意流量称为“AI 爬虫”,但因为……