I wrapped a backlink API in an MCP server so I could do SEO gap analysis from inside Claude

我将反向链接 API 封装进 MCP 服务器，从而在 Claude 内部进行 SEO 差距分析

I do a fair amount of competitor backlink research, and the workflow always annoyed me: open a dashboard, run a query, export a CSV, eyeball it, copy domains into a doc, switch to email. Lots of tab-hopping for what is fundamentally a data-filtering problem an agent should handle. 我经常进行竞争对手的反向链接研究，但其工作流程总是让我感到烦恼：打开仪表板、运行查询、导出 CSV、肉眼查看、将域名复制到文档中，然后再切换到电子邮件。对于本质上是一个代理（Agent）应该处理的数据过滤问题，这需要频繁地在标签页之间切换。

So I wrapped the backlink API I’d been using into an MCP server. Now I stay in Claude Code (or Cursor, Cline, Zed, Windsurf) and just describe the goal. This is the build: the architecture, the four tools, and the one design decision I’m still not sure about. 因此，我将一直在使用的反向链接 API 封装进了一个 MCP 服务器。现在，我只需留在 Claude Code（或 Cursor、Cline、Zed、Windsurf）中描述目标即可。以下是构建过程：架构、四个工具，以及一个我至今仍不确定的设计决策。

The data source

数据源

The server runs on the Common Crawl hyperlink webgraph — about 4.4 billion edges across 120 million domains, published quarterly as Parquet. That matters for an MCP tool specifically: the data is open, so there’s no scraped-proprietary-index liability in handing it to an agent, and the same query is reproducible by anyone. 该服务器运行在 Common Crawl 超链接网络图上——包含约 1.2 亿个域名之间的 44 亿条边，以 Parquet 格式按季度发布。这对 MCP 工具而言尤为重要：数据是公开的，因此将其交给代理处理不存在抓取专有索引的法律风险，且任何人都可以复现相同的查询。

The HTTP API in front of it (CrawlGraph) does the heavy DuckDB work; the MCP server is a thin TypeScript stdio client over it. Keeping the server thin was deliberate — all the query cost, caching, and quota logic lives server-side, so the MCP package stays a ~300-line wrapper that’s easy to audit before you hand it your API key. 前端的 HTTP API (CrawlGraph) 负责繁重的 DuckDB 计算工作；MCP 服务器只是一个轻量级的 TypeScript stdio 客户端。保持服务器轻量化是刻意为之的——所有的查询成本、缓存和配额逻辑都位于服务器端，因此 MCP 包保持在约 300 行代码的封装，在你提供 API 密钥之前很容易进行审计。

The four tools

四个工具

backlinks → referring domains for a target, with authority scores
gap_analysis → domains linking to your competitors but not to you
gap_outreach_targets → the composite play (below)
releases → list the Common Crawl snapshots
backlinks → 获取目标的引用域名，包含权威度评分
gap_analysis → 获取链接到竞争对手但未链接到你的域名
gap_outreach_targets → 组合玩法（见下文）
releases → 列出 Common Crawl 的快照

backlinks and gap_analysis map 1:1 to API endpoints. gap_analysis is the interesting primitive: submit your domain plus 2-5 competitors, and it returns every domain that links to at least one competitor but not to you, each tagged with a found_on array listing which competitors it links to. backlinks 和 gap_analysis 与 API 端点是一一对应的。gap_analysis 是一个有趣的基元：提交你的域名加上 2-5 个竞争对手，它会返回所有链接到至少一个竞争对手但未链接到你的域名，并为每个域名标记一个 found_on 数组，列出它链接到的竞争对手。

The composite tool, and the decision I’m unsure about

组合工具，以及我尚不确定的决策

Most API-wrapper MCP servers are pure 1:1 mappings. I added one opinionated composite tool, gap_outreach_targets, because the raw gap output isn’t the thing you actually want — it’s the raw material for the thing you want. 大多数 API 封装的 MCP 服务器都是纯粹的 1:1 映射。我添加了一个带有主观判断的组合工具 gap_outreach_targets，因为原始的差距分析输出并不是你真正想要的——它只是你想要的结果的原材料。

What it does on top of gap_analysis: 它在 gap_analysis 之上做了以下处理：

Filters to total overlap. Keep only domains whose found_on covers every competitor you listed. A site linking to one competitor might be a fluke or a paid placement. A site linking to all three is a publisher who covers your whole niche and has simply never heard of you. That overlap is the qualifier. 过滤完全重叠项。 只保留 found_on 覆盖了你列出的所有竞争对手的域名。链接到一个竞争对手的网站可能是偶然或付费投放。而链接到所有三个竞争对手的网站，则是一个覆盖了你整个利基市场却从未听说过你的发布者。这种重叠就是筛选条件。
Strips platform noise. amazonaws.com, github.io, facebook.com, CDNs, URL shorteners — they show up in every backlink profile and are never outreach targets. There’s a denylist with suffix matching so subdomains get caught too. 剔除平台噪音。 amazonaws.com、github.io、facebook.com、CDN、短链接服务——它们出现在每一个反向链接配置文件中，且绝不是外联目标。我设置了一个带有后缀匹配的黑名单，以便子域名也能被过滤掉。
Ranks by authority. For the top N survivors it makes a cheap per-domain authority lookup and sorts, so the highest-value warm targets surface first. This is opt-out (enrich_top: 0) because each lookup costs one API call against quota. 按权威度排名。 对于筛选出的前 N 个结果，它会进行低成本的单域名权威度查询并排序，从而让最高价值的潜在目标优先显示。这是可选的（enrich_top: 0），因为每次查询都会消耗 API 配额。

// the core filter, roughly
const priority = gaps
  .filter(g => !isPlatformNoise(g.linking_domain))
  .filter(g => g.found_on.length === competitors.length)
  .sort((a, b) => (b.cg_authority ?? -1) - (a.cg_authority ?? -1));

The decision I keep going back and forth on: is a composite, opinionated tool the right call for an MCP server, or should it stay a pure API mirror and let the agent do the filtering/ranking in its own reasoning? 我一直在纠结的决策是：对于 MCP 服务器来说，提供一个带有主观判断的组合工具是正确的选择吗？还是应该保持纯粹的 API 镜像，让代理在自己的推理过程中进行过滤和排名？

Arguments for the composite tool: It encodes a workflow the model would otherwise have to reconstruct each time, costing tokens and inviting mistakes (I watched an agent forget to filter platforms more than once). It returns a small, ranked, decision-ready list instead of a 1,000-row dump the model has to chew through. 支持组合工具的理由： 它编码了一个模型本需要每次重新构建的工作流程，这会消耗 Token 并导致错误（我曾多次看到代理忘记过滤平台）。它返回的是一个精简、已排序、可直接决策的列表，而不是让模型去处理 1000 行的数据转储。
Arguments against: It’s a leaky abstraction. The moment someone wants a slightly different filter (2-of-3 overlap, a different noise list), they’re fighting my opinion instead of composing primitives. It hides the platform denylist, which is a judgment call that should arguably be visible. 反对的理由： 这是一个“泄漏的抽象”。一旦有人想要稍微不同的过滤条件（例如 3 个中匹配 2 个，或不同的噪音列表），他们就必须对抗我的预设逻辑，而不是组合基元。它隐藏了平台黑名单，这属于一种主观判断，理应是可见的。

I landed on “ship both” — the raw gap_analysis primitive and the composite — but I’m genuinely unsure that’s not just indecision dressed up as flexibility. If you’ve built MCP servers, I’d like to hear where you draw the primitive-vs-composite line. 我最终决定“两者都发布”——即原始的 gap_analysis 基元和组合工具——但我真的不确定这是否只是将优柔寡断伪装成了灵活性。如果你构建过 MCP 服务器，我很想听听你是如何划分基元与组合工具界限的。

Using it

如何使用

{
  "mcpServers": {
    "crawlgraph": {
      "command": "npx",
      "args": ["-y", "crawlgraph-mcp"],
      "env": {
        "CRAWLGRAPH_API_KEY": "cg_live_..."
      }
    }
  }
}

Then the whole workflow collapses to one sentence: “Use gap_outreach_targets for mydomain.com against competitor-a.com and competitor-b.com, then draft a short outreach email to each priority target.” The agent submits the gap job, polls it, filters and ranks, and writes the emails — all in one turn. 然后，整个工作流程简化为一句话：“使用 gap_outreach_targets 分析 mydomain.com 相对于 competitor-a.com 和 competitor-b.com 的差距，然后为每个优先目标起草一封简短的外联邮件。” 代理会提交差距分析任务、轮询结果、过滤并排序，最后撰写邮件——所有这些都在一次对话中完成。

Honest limitations

诚实的局限性

Quarterly snapshot. Common Crawl publishes ~4x/year, so this is for one-off prospecting, not live link monitoring. If you need “what changed this week,” it’s the wrong tool. 季度快照。 Common Crawl 每年发布约 4 次，因此这适用于一次性的潜在客户挖掘，而非实时链接监控。如果你需要“本周发生了什么变化”，那这用错了工具。
No anchor text in the gap result. The webgraph is (src, dst) edges; anchor text needs a separate WARC pass I didn’t wire into the MCP. 差距分析结果中没有锚文本。 网络图是 (src, dst) 边；锚文本需要单独的 WARC 处理，我没有将其接入 MCP。
Authority enrichment costs calls. Each scored domain is one API call, hence the cap. 权威度丰富化需要消耗调用次数。 每个评分域名都需要一次 API 调用，因此会有上限。

Code is MIT, on GitHub and npm (npx -y crawlgraph-mcp). Feedback on the composite-tool question especially welcome — it’s the part of the design I’m least settled on. 代码采用 MIT 协议，托管在 GitHub 和 npm 上 (npx -y crawlgraph-mcp)。特别欢迎对组合工具问题的反馈——这是我设计中最不确定的部分。