I wrapped a backlink API in an MCP server so I could do SEO gap analysis from inside Claude

I wrapped a backlink API in an MCP server so I could do SEO gap analysis from inside Claude

我将反向链接 API 封装进 MCP 服务器,从而在 Claude 内部进行 SEO 差距分析

I do a fair amount of competitor backlink research, and the workflow always annoyed me: open a dashboard, run a query, export a CSV, eyeball it, copy domains into a doc, switch to email. Lots of tab-hopping for what is fundamentally a data-filtering problem an agent should handle. 我经常进行竞争对手的反向链接研究,但其工作流程总是让我感到烦恼:打开仪表板、运行查询、导出 CSV、肉眼查看、将域名复制到文档中,然后再切换到电子邮件。对于本质上是一个代理(Agent)应该处理的数据过滤问题,这需要频繁地在标签页之间切换。

So I wrapped the backlink API I’d been using into an MCP server. Now I stay in Claude Code (or Cursor, Cline, Zed, Windsurf) and just describe the goal. This is the build: the architecture, the four tools, and the one design decision I’m still not sure about. 因此,我将一直在使用的反向链接 API 封装进了一个 MCP 服务器。现在,我只需留在 Claude Code(或 Cursor、Cline、Zed、Windsurf)中描述目标即可。以下是构建过程:架构、四个工具,以及一个我至今仍不确定的设计决策。

The data source

数据源

The server runs on the Common Crawl hyperlink webgraph — about 4.4 billion edges across 120 million domains, published quarterly as Parquet. That matters for an MCP tool specifically: the data is open, so there’s no scraped-proprietary-index liability in handing it to an agent, and the same query is reproducible by anyone. 该服务器运行在 Common Crawl 超链接网络图上——包含约 1.2 亿个域名之间的 44 亿条边,以 Parquet 格式按季度发布。这对 MCP 工具而言尤为重要:数据是公开的,因此将其交给代理处理不存在抓取专有索引的法律风险,且任何人都可以复现相同的查询。

The HTTP API in front of it (CrawlGraph) does the heavy DuckDB work; the MCP server is a thin TypeScript stdio client over it. Keeping the server thin was deliberate — all the query cost, caching, and quota logic lives server-side, so the MCP package stays a ~300-line wrapper that’s easy to audit before you hand it your API key. 前端的 HTTP API (CrawlGraph) 负责繁重的 DuckDB 计算工作;MCP 服务器只是一个轻量级的 TypeScript stdio 客户端。保持服务器轻量化是刻意为之的——所有的查询成本、缓存和配额逻辑都位于服务器端,因此 MCP 包保持在约 300 行代码的封装,在你提供 API 密钥之前很容易进行审计。

The four tools

四个工具

  • backlinks → referring domains for a target, with authority scores

  • gap_analysis → domains linking to your competitors but not to you

  • gap_outreach_targets → the composite play (below)

  • releases → list the Common Crawl snapshots

  • backlinks → 获取目标的引用域名,包含权威度评分

  • gap_analysis → 获取链接到竞争对手但未链接到你的域名

  • gap_outreach_targets → 组合玩法(见下文)

  • releases → 列出 Common Crawl 的快照

backlinks and gap_analysis map 1:1 to API endpoints. gap_analysis is the interesting primitive: submit your domain plus 2-5 competitors, and it returns every domain that links to at least one competitor but not to you, each tagged with a found_on array listing which competitors it links to. backlinks 和 gap_analysis 与 API 端点是一一对应的。gap_analysis 是一个有趣的基元:提交你的域名加上 2-5 个竞争对手,它会返回所有链接到至少一个竞争对手但未链接到你的域名,并为每个域名标记一个 found_on 数组,列出它链接到的竞争对手。

The composite tool, and the decision I’m unsure about

组合工具,以及我尚不确定的决策

Most API-wrapper MCP servers are pure 1:1 mappings. I added one opinionated composite tool, gap_outreach_targets, because the raw gap output isn’t the thing you actually want — it’s the raw material for the thing you want. 大多数 API 封装的 MCP 服务器都是纯粹的 1:1 映射。我添加了一个带有主观判断的组合工具 gap_outreach_targets,因为原始的差距分析输出并不是你真正想要的——它只是你想要的结果的原材料。

What it does on top of gap_analysis: 它在 gap_analysis 之上做了以下处理:

  1. Filters to total overlap. Keep only domains whose found_on covers every competitor you listed. A site linking to one competitor might be a fluke or a paid placement. A site linking to all three is a publisher who covers your whole niche and has simply never heard of you. That overlap is the qualifier. 过滤完全重叠项。 只保留 found_on 覆盖了你列出的所有竞争对手的域名。链接到一个竞争对手的网站可能是偶然或付费投放。而链接到所有三个竞争对手的网站,则是一个覆盖了你整个利基市场却从未听说过你的发布者。这种重叠就是筛选条件。
  2. Strips platform noise. amazonaws.com, github.io, facebook.com, CDNs, URL shorteners — they show up in every backlink profile and are never outreach targets. There’s a denylist with suffix matching so subdomains get caught too. 剔除平台噪音。 amazonaws.comgithub.iofacebook.com、CDN、短链接服务——它们出现在每一个反向链接配置文件中,且绝不是外联目标。我设置了一个带有后缀匹配的黑名单,以便子域名也能被过滤掉。
  3. Ranks by authority. For the top N survivors it makes a cheap per-domain authority lookup and sorts, so the highest-value warm targets surface first. This is opt-out (enrich_top: 0) because each lookup costs one API call against quota. 按权威度排名。 对于筛选出的前 N 个结果,它会进行低成本的单域名权威度查询并排序,从而让最高价值的潜在目标优先显示。这是可选的(enrich_top: 0),因为每次查询都会消耗 API 配额。
// the core filter, roughly
const priority = gaps
  .filter(g => !isPlatformNoise(g.linking_domain))
  .filter(g => g.found_on.length === competitors.length)
  .sort((a, b) => (b.cg_authority ?? -1) - (a.cg_authority ?? -1));

The decision I keep going back and forth on: is a composite, opinionated tool the right call for an MCP server, or should it stay a pure API mirror and let the agent do the filtering/ranking in its own reasoning? 我一直在纠结的决策是:对于 MCP 服务器来说,提供一个带有主观判断的组合工具是正确的选择吗?还是应该保持纯粹的 API 镜像,让代理在自己的推理过程中进行过滤和排名?

  • Arguments for the composite tool: It encodes a workflow the model would otherwise have to reconstruct each time, costing tokens and inviting mistakes (I watched an agent forget to filter platforms more than once). It returns a small, ranked, decision-ready list instead of a 1,000-row dump the model has to chew through. 支持组合工具的理由: 它编码了一个模型本需要每次重新构建的工作流程,这会消耗 Token 并导致错误(我曾多次看到代理忘记过滤平台)。它返回的是一个精简、已排序、可直接决策的列表,而不是让模型去处理 1000 行的数据转储。
  • Arguments against: It’s a leaky abstraction. The moment someone wants a slightly different filter (2-of-3 overlap, a different noise list), they’re fighting my opinion instead of composing primitives. It hides the platform denylist, which is a judgment call that should arguably be visible. 反对的理由: 这是一个“泄漏的抽象”。一旦有人想要稍微不同的过滤条件(例如 3 个中匹配 2 个,或不同的噪音列表),他们就必须对抗我的预设逻辑,而不是组合基元。它隐藏了平台黑名单,这属于一种主观判断,理应是可见的。

I landed on “ship both” — the raw gap_analysis primitive and the composite — but I’m genuinely unsure that’s not just indecision dressed up as flexibility. If you’ve built MCP servers, I’d like to hear where you draw the primitive-vs-composite line. 我最终决定“两者都发布”——即原始的 gap_analysis 基元和组合工具——但我真的不确定这是否只是将优柔寡断伪装成了灵活性。如果你构建过 MCP 服务器,我很想听听你是如何划分基元与组合工具界限的。

Using it

如何使用

{
  "mcpServers": {
    "crawlgraph": {
      "command": "npx",
      "args": ["-y", "crawlgraph-mcp"],
      "env": {
        "CRAWLGRAPH_API_KEY": "cg_live_..."
      }
    }
  }
}

Then the whole workflow collapses to one sentence: “Use gap_outreach_targets for mydomain.com against competitor-a.com and competitor-b.com, then draft a short outreach email to each priority target.” The agent submits the gap job, polls it, filters and ranks, and writes the emails — all in one turn. 然后,整个工作流程简化为一句话:“使用 gap_outreach_targets 分析 mydomain.com 相对于 competitor-a.com 和 competitor-b.com 的差距,然后为每个优先目标起草一封简短的外联邮件。” 代理会提交差距分析任务、轮询结果、过滤并排序,最后撰写邮件——所有这些都在一次对话中完成。

Honest limitations

诚实的局限性

  • Quarterly snapshot. Common Crawl publishes ~4x/year, so this is for one-off prospecting, not live link monitoring. If you need “what changed this week,” it’s the wrong tool. 季度快照。 Common Crawl 每年发布约 4 次,因此这适用于一次性的潜在客户挖掘,而非实时链接监控。如果你需要“本周发生了什么变化”,那这用错了工具。
  • No anchor text in the gap result. The webgraph is (src, dst) edges; anchor text needs a separate WARC pass I didn’t wire into the MCP. 差距分析结果中没有锚文本。 网络图是 (src, dst) 边;锚文本需要单独的 WARC 处理,我没有将其接入 MCP。
  • Authority enrichment costs calls. Each scored domain is one API call, hence the cap. 权威度丰富化需要消耗调用次数。 每个评分域名都需要一次 API 调用,因此会有上限。

Code is MIT, on GitHub and npm (npx -y crawlgraph-mcp). Feedback on the composite-tool question especially welcome — it’s the part of the design I’m least settled on. 代码采用 MIT 协议,托管在 GitHub 和 npm 上 (npx -y crawlgraph-mcp)。特别欢迎对组合工具问题的反馈——这是我设计中最不确定的部分。