Building a Resilient Meta Ads Scraper: What Breaks (and What I Learned Fixing It)

Building a Resilient Meta Ads Scraper: What Breaks (and What I Learned Fixing It)

构建高韧性的 Meta 广告爬虫:哪些环节会崩溃(以及我的修复经验)

When I set out to build a tool for pulling ad data from Meta’s platforms, the brief I gave myself was deceptively simple: let someone search for ads by keyword and country, and get clean, structured data out the other end. The actual problem turned out to be everything in between — Meta’s official API doesn’t always cover what you need, the alternative (scraping the ad library directly) breaks every time the frontend changes, and “ad data” coming out of either path is messier than it looks. Here’s how I approached it, and the decisions that mattered most. 当我着手开发一个从 Meta 平台抓取广告数据的工具时,我给自己设定的目标看似简单:让用户能够通过关键词和国家搜索广告,并获得干净、结构化的数据。但实际问题在于中间的每一个环节——Meta 的官方 API 并不总是能满足需求,而替代方案(直接抓取广告库)又会在前端每次改动时失效,且无论通过哪种路径获取的“广告数据”都比看起来要混乱得多。以下是我解决这些问题的方法,以及那些至关重要的决策。

The core problem: pick one access method, and you’ve already lost

核心问题:只选一种访问方式,你就已经输了

My first instinct was to build against the Meta Graph API and stop there — it’s official, structured, and well-documented. But the Graph API has real limits: certain queries need access tiers you don’t always have, and once you hit those walls, there’s no fallback. So instead of committing to one approach, I built the extraction layer around a Strategy Pattern, with two interchangeable backends: the Graph API for high-volume structured access, and a Playwright-based browser path for everything the API won’t give you. The caller doesn’t need to know which one is running underneath — it just asks for ads, and the tool picks the right strategy. 我的第一直觉是基于 Meta Graph API 进行开发并就此打住——它官方、结构化且文档齐全。但 Graph API 有明显的局限性:某些查询需要你未必拥有的访问权限级别,一旦触及这些限制,就没有回旋余地。因此,我没有局限于单一方案,而是围绕“策略模式”(Strategy Pattern)构建了提取层,并配备了两个可互换的后端:用于大规模结构化访问的 Graph API,以及用于处理 API 无法获取数据的 Playwright 浏览器路径。调用者无需知道底层运行的是哪一个——它只需请求广告,工具会自动选择合适的策略。

Why I scrape JSON, not HTML

为什么我抓取 JSON 而非 HTML

The browser-based path was the harder design decision. Most scrapers I’d seen parse rendered HTML, which means every time Meta tweaks a class name or restructures a component, the scraper breaks. Instead, I had the browser engine intercept the raw XHR traffic — the JSON responses the frontend itself depends on to render the page. Meta’s design team can change the layout as often as they want; the underlying data contract the frontend consumes is far more stable, because breaking it would break their own product. That one decision made the scraper meaningfully more durable against UI changes than a typical HTML-parsing approach. 基于浏览器的路径是设计上最难的决策。我见过的大多数爬虫都是解析渲染后的 HTML,这意味着每当 Meta 微调类名或重构组件时,爬虫就会崩溃。相反,我让浏览器引擎拦截原始的 XHR 流量——即前端渲染页面所依赖的 JSON 响应。Meta 的设计团队可以随意更改布局,但前端所消费的底层数据契约要稳定得多,因为一旦破坏它,他们自己的产品也会瘫痪。这一决策使得该爬虫在应对 UI 变更时,比传统的 HTML 解析方法要稳健得多。

Treating “scraped data” as untrusted input

将“抓取数据”视为不可信输入

Whether the data comes from the Graph API or the browser path, it arrives messy: inconsistent currency formatting, locale quirks, occasional malformed records. I didn’t want any of that reaching the user silently, so I put a validation layer in front of everything using Pydantic v2 models. Every record gets normalized and checked at the boundary — if something doesn’t conform, it’s filtered out rather than passed through to quietly corrupt someone’s downstream analysis. It’s a small architectural choice, but it’s the difference between a tool people can trust and one they have to double-check by hand. 无论数据来自 Graph API 还是浏览器路径,获取到的数据都很混乱:货币格式不一致、地区差异、偶尔出现的格式错误记录。我不希望这些问题悄无声息地传导给用户,因此我使用 Pydantic v2 模型在所有数据入口前设置了一个验证层。每一条记录都在边界处进行标准化和检查——如果不符合规范,它会被过滤掉,而不是被放行去悄悄破坏用户的下游分析。这是一个微小的架构选择,但它决定了工具是值得信赖的,还是需要人工二次核对的。

Memory matters more than you think at scale

在大规模场景下,内存比你想象的更重要

A search across multiple countries and keywords can return a lot of records, and an early version of this tool held everything in memory before writing it out — which works fine until it doesn’t. I rebuilt the persistence layer as a streaming exporter for both CSV and JSON, writing each record as it arrives instead of batching everything first. Memory usage stays flat regardless of how many ads come back, which matters a lot more once a search returns thousands of results instead of a handful. 跨多个国家和关键词的搜索可能会返回大量记录,该工具的早期版本在写入前会将所有内容保存在内存中——这在数据量小时没问题,但数据量大时就会崩溃。我将持久化层重构为 CSV 和 JSON 的流式导出器,在每条记录到达时立即写入,而不是先进行批量处理。无论返回多少广告,内存占用始终保持平稳,这在搜索结果从几条变成几千条时显得尤为重要。

Rate limits aren’t an edge case, they’re the default

速率限制不是边缘情况,而是常态

Anyone who’s worked with Meta’s APIs knows HTTP 429s aren’t rare — they’re expected behavior once you’re making any real volume of requests. A scraper that crashes on the first rate limit isn’t really a scraper, it’s a demo. I integrated tenacity for exponential backoff retries on both the API and browser paths, so transient errors and rate-limiting get absorbed instead of taking the whole run down. 任何使用过 Meta API 的人都知道 HTTP 429 错误并不罕见——一旦你的请求量达到一定规模,这就是预期行为。一个遇到速率限制就崩溃的爬虫不是真正的爬虫,那只是个演示程序。我集成了 tenacity 库,在 API 和浏览器路径上都实现了指数退避重试机制,这样瞬时错误和速率限制就能被自动处理,而不会导致整个任务中断。

The CLI is part of the product, not an afterthought

CLI 是产品的一部分,而不是事后补救

It’s easy to treat the command line as a throwaway wrapper around the “real” logic. But I didn’t want to run a tool that gave me no feedback during a multi-minute scrape, so I built the CLI with click and rich — progress spinners, formatted logs, and a search summary at the end. None of that changes what the tool does under the hood, but it changes whether you actually want to run it. 人们很容易将命令行视为“核心”逻辑之外的临时包装。但我不想运行一个在长达数分钟的抓取过程中没有任何反馈的工具,所以我使用 clickrich 构建了 CLI——加入了进度转轮、格式化日志以及最后的搜索摘要。这些功能虽然不会改变工具底层的运作方式,但却决定了你是否真的愿意去使用它。

What I’d do differently

我会做哪些不同的改进

If I rebuilt this today, I’d push validation even earlier — catching malformed records right at the point of interception rather than after they’ve already been parsed into intermediate objects. I’d also want a caching layer so repeated searches for the same keyword/country pair don’t re-hit Meta unnecessarily. Scraping tools age fast; the parts that survive UI redesigns and API changes are usually the parts you over-invested in early — validation, retries, and a clean separation between “how we get the data” and “what we do with it.” 如果今天重新构建,我会将验证环节提前——在拦截点就捕获格式错误的记录,而不是等到它们被解析为中间对象之后。我还希望增加一个缓存层,这样对相同关键词/国家对的重复搜索就不会不必要地再次请求 Meta。爬虫工具老化很快;那些能在 UI 改版和 API 变更中幸存下来的部分,通常是你早期投入最多精力的地方——即验证、重试机制,以及“如何获取数据”与“如何处理数据”之间的清晰解耦。

That’s roughly the shape of it: two interchangeable extraction strategies, a validation boundary that doesn’t trust anything coming in, streaming I/O so memory never becomes the bottleneck, and resilience built in from the start rather than bolted on after the first crash. If you’re building anything that talks to a platform you don’t control, that combination — multiple access paths, strict validation, and assuming failure is normal — will save you more time than any single clever trick. 这就是整个架构的大致轮廓:两种可互换的提取策略、一个不信任任何输入数据的验证边界、防止内存成为瓶颈的流式 I/O,以及从一开始就内置的韧性,而不是在第一次崩溃后才去修补。如果你正在构建任何与你无法控制的平台交互的工具,这种组合——多访问路径、严格验证以及假设失败是常态——将比任何单一的“小聪明”技巧为你节省更多时间。