I wrapped a backlink API in an MCP server so I could do SEO gap analysis from inside Claude
I wrapped a backlink API in an MCP server so I could do SEO gap analysis from inside Claude
我将反向链接 API 封装进 MCP 服务器,从而在 Claude 内部进行 SEO 差距分析
I do a fair amount of competitor backlink research, and the workflow always annoyed me: open a dashboard, run a query, export a CSV, eyeball it, copy domains into a doc, switch to email. Lots of tab-hopping for what is fundamentally a data-filtering problem an agent should handle. 我经常进行竞争对手的反向链接研究,但其工作流程总是让我感到烦恼:打开仪表板、运行查询、导出 CSV、肉眼查看、将域名复制到文档中,然后再切换到电子邮件。对于本质上是一个代理(Agent)应该处理的数据过滤问题,这需要频繁地在标签页之间切换。
So I wrapped the backlink API I’d been using into an MCP server. Now I stay in Claude Code (or Cursor, Cline, Zed, Windsurf) and just describe the goal. This is the build: the architecture, the four tools, and the one design decision I’m still not sure about. 因此,我将一直在使用的反向链接 API 封装进了一个 MCP 服务器。现在,我只需留在 Claude Code(或 Cursor、Cline、Zed、Windsurf)中描述目标即可。以下是构建过程:架构、四个工具,以及一个我至今仍不确定的设计决策。
The data source
数据源
The server runs on the Common Crawl hyperlink webgraph — about 4.4 billion edges across 120 million domains, published quarterly as Parquet. That matters for an MCP tool specifically: the data is open, so there’s no scraped-proprietary-index liability in handing it to an agent, and the same query is reproducible by anyone. 该服务器运行在 Common Crawl 超链接网络图上——包含约 1.2 亿个域名之间的 44 亿条边,以 Parquet 格式按季度发布。这对 MCP 工具而言尤为重要:数据是公开的,因此将其交给代理处理不存在抓取专有索引的法律风险,且任何人都可以复现相同的查询。
The HTTP API in front of it (CrawlGraph) does the heavy DuckDB work; the MCP server is a thin TypeScript stdio client over it. Keeping the server thin was deliberate — all the query cost, caching, and quota logic lives server-side, so the MCP package stays a ~300-line wrapper that’s easy to audit before you hand it your API key. 前端的 HTTP API (CrawlGraph) 负责繁重的 DuckDB 计算工作;MCP 服务器只是一个轻量级的 TypeScript stdio 客户端。保持服务器轻量化是刻意为之的——所有的查询成本、缓存和配额逻辑都位于服务器端,因此 MCP 包保持在约 300 行代码的封装,在你提供 API 密钥之前很容易进行审计。
The four tools
四个工具
-
backlinks → referring domains for a target, with authority scores
-
gap_analysis → domains linking to your competitors but not to you
-
gap_outreach_targets → the composite play (below)
-
releases → list the Common Crawl snapshots
-
backlinks → 获取目标的引用域名,包含权威度评分
-
gap_analysis → 获取链接到竞争对手但未链接到你的域名
-
gap_outreach_targets → 组合玩法(见下文)
-
releases → 列出 Common Crawl 的快照
backlinks and gap_analysis map 1:1 to API endpoints. gap_analysis is the interesting primitive: submit your domain plus 2-5 competitors, and it returns every domain that links to at least one competitor but not to you, each tagged with a found_on array listing which competitors it links to.
backlinks 和 gap_analysis 与 API 端点是一一对应的。gap_analysis 是一个有趣的基元:提交你的域名加上 2-5 个竞争对手,它会返回所有链接到至少一个竞争对手但未链接到你的域名,并为每个域名标记一个 found_on 数组,列出它链接到的竞争对手。
The composite tool, and the decision I’m unsure about
组合工具,以及我尚不确定的决策
Most API-wrapper MCP servers are pure 1:1 mappings. I added one opinionated composite tool, gap_outreach_targets, because the raw gap output isn’t the thing you actually want — it’s the raw material for the thing you want.
大多数 API 封装的 MCP 服务器都是纯粹的 1:1 映射。我添加了一个带有主观判断的组合工具 gap_outreach_targets,因为原始的差距分析输出并不是你真正想要的——它只是你想要的结果的原材料。
What it does on top of gap_analysis:
它在 gap_analysis 之上做了以下处理:
- Filters to total overlap. Keep only domains whose
found_oncovers every competitor you listed. A site linking to one competitor might be a fluke or a paid placement. A site linking to all three is a publisher who covers your whole niche and has simply never heard of you. That overlap is the qualifier. 过滤完全重叠项。 只保留found_on覆盖了你列出的所有竞争对手的域名。链接到一个竞争对手的网站可能是偶然或付费投放。而链接到所有三个竞争对手的网站,则是一个覆盖了你整个利基市场却从未听说过你的发布者。这种重叠就是筛选条件。 - Strips platform noise.
amazonaws.com,github.io,facebook.com, CDNs, URL shorteners — they show up in every backlink profile and are never outreach targets. There’s a denylist with suffix matching so subdomains get caught too. 剔除平台噪音。amazonaws.com、github.io、facebook.com、CDN、短链接服务——它们出现在每一个反向链接配置文件中,且绝不是外联目标。我设置了一个带有后缀匹配的黑名单,以便子域名也能被过滤掉。 - Ranks by authority. For the top N survivors it makes a cheap per-domain authority lookup and sorts, so the highest-value warm targets surface first. This is opt-out (
enrich_top: 0) because each lookup costs one API call against quota. 按权威度排名。 对于筛选出的前 N 个结果,它会进行低成本的单域名权威度查询并排序,从而让最高价值的潜在目标优先显示。这是可选的(enrich_top: 0),因为每次查询都会消耗 API 配额。
// the core filter, roughly
const priority = gaps
.filter(g => !isPlatformNoise(g.linking_domain))
.filter(g => g.found_on.length === competitors.length)
.sort((a, b) => (b.cg_authority ?? -1) - (a.cg_authority ?? -1));
The decision I keep going back and forth on: is a composite, opinionated tool the right call for an MCP server, or should it stay a pure API mirror and let the agent do the filtering/ranking in its own reasoning? 我一直在纠结的决策是:对于 MCP 服务器来说,提供一个带有主观判断的组合工具是正确的选择吗?还是应该保持纯粹的 API 镜像,让代理在自己的推理过程中进行过滤和排名?
- Arguments for the composite tool: It encodes a workflow the model would otherwise have to reconstruct each time, costing tokens and inviting mistakes (I watched an agent forget to filter platforms more than once). It returns a small, ranked, decision-ready list instead of a 1,000-row dump the model has to chew through. 支持组合工具的理由: 它编码了一个模型本需要每次重新构建的工作流程,这会消耗 Token 并导致错误(我曾多次看到代理忘记过滤平台)。它返回的是一个精简、已排序、可直接决策的列表,而不是让模型去处理 1000 行的数据转储。
- Arguments against: It’s a leaky abstraction. The moment someone wants a slightly different filter (2-of-3 overlap, a different noise list), they’re fighting my opinion instead of composing primitives. It hides the platform denylist, which is a judgment call that should arguably be visible. 反对的理由: 这是一个“泄漏的抽象”。一旦有人想要稍微不同的过滤条件(例如 3 个中匹配 2 个,或不同的噪音列表),他们就必须对抗我的预设逻辑,而不是组合基元。它隐藏了平台黑名单,这属于一种主观判断,理应是可见的。
I landed on “ship both” — the raw gap_analysis primitive and the composite — but I’m genuinely unsure that’s not just indecision dressed up as flexibility. If you’ve built MCP servers, I’d like to hear where you draw the primitive-vs-composite line.
我最终决定“两者都发布”——即原始的 gap_analysis 基元和组合工具——但我真的不确定这是否只是将优柔寡断伪装成了灵活性。如果你构建过 MCP 服务器,我很想听听你是如何划分基元与组合工具界限的。
Using it
如何使用
{
"mcpServers": {
"crawlgraph": {
"command": "npx",
"args": ["-y", "crawlgraph-mcp"],
"env": {
"CRAWLGRAPH_API_KEY": "cg_live_..."
}
}
}
}
Then the whole workflow collapses to one sentence: “Use gap_outreach_targets for mydomain.com against competitor-a.com and competitor-b.com, then draft a short outreach email to each priority target.” The agent submits the gap job, polls it, filters and ranks, and writes the emails — all in one turn.
然后,整个工作流程简化为一句话:“使用 gap_outreach_targets 分析 mydomain.com 相对于 competitor-a.com 和 competitor-b.com 的差距,然后为每个优先目标起草一封简短的外联邮件。” 代理会提交差距分析任务、轮询结果、过滤并排序,最后撰写邮件——所有这些都在一次对话中完成。
Honest limitations
诚实的局限性
- Quarterly snapshot. Common Crawl publishes ~4x/year, so this is for one-off prospecting, not live link monitoring. If you need “what changed this week,” it’s the wrong tool. 季度快照。 Common Crawl 每年发布约 4 次,因此这适用于一次性的潜在客户挖掘,而非实时链接监控。如果你需要“本周发生了什么变化”,那这用错了工具。
- No anchor text in the gap result. The webgraph is (src, dst) edges; anchor text needs a separate WARC pass I didn’t wire into the MCP. 差距分析结果中没有锚文本。 网络图是 (src, dst) 边;锚文本需要单独的 WARC 处理,我没有将其接入 MCP。
- Authority enrichment costs calls. Each scored domain is one API call, hence the cap. 权威度丰富化需要消耗调用次数。 每个评分域名都需要一次 API 调用,因此会有上限。
Code is MIT, on GitHub and npm (npx -y crawlgraph-mcp). Feedback on the composite-tool question especially welcome — it’s the part of the design I’m least settled on.
代码采用 MIT 协议,托管在 GitHub 和 npm 上 (npx -y crawlgraph-mcp)。特别欢迎对组合工具问题的反馈——这是我设计中最不确定的部分。