How I built the OSS alternatives directory: GitHub ETL, Turso, and the UPSERT trap I hit
How I built the OSS alternatives directory: GitHub ETL, Turso, and the UPSERT trap I hit
我是如何构建开源替代品目录的:GitHub ETL、Turso 以及我踩过的 UPSERT 陷阱
When I launched three programmatic directory sites in April 2026, the open-source alternatives site had the most interesting data model. The AI tools directory indexes HuggingFace models — that’s a pull from one API. The indie games directory reads Steam. But the OSS alternatives site has to answer a different question: for this SaaS product, which open-source repos actually cover the same use case, and how do they compare? Getting that right required a two-phase ETL approach, a careful UPSERT strategy I initially got wrong, and some deliberate choices about where to use Claude Haiku and where to use a fallback template.
2026 年 4 月,当我推出三个程序化生成的目录网站时,开源替代品网站的数据模型是最有趣的。AI 工具目录索引的是 HuggingFace 模型——那只是从一个 API 获取数据;独立游戏目录读取的是 Steam 数据。但开源替代品网站必须回答一个不同的问题:对于这款 SaaS 产品,哪些开源仓库真正涵盖了相同的用例,它们之间又该如何比较?为了实现这一点,我采用了两阶段的 ETL 方法,制定了谨慎的 UPSERT(插入或更新)策略(尽管我最初弄错了),并针对何时使用 Claude Haiku 以及何时使用回退模板做出了一些深思熟虑的选择。
What the data model looks like
数据模型概览
Three tables in Turso libSQL:
- saas — the SaaS tool being replaced (Datadog, Notion, Figma, etc.)
- alternatives — GitHub repos that serve the same use case, linked by
saas_slug - saas_content — Claude-generated per-entry text: an intro, comparison notes, and migration tips
Turso libSQL 中的三张表:
- saas — 被替代的 SaaS 工具(如 Datadog、Notion、Figma 等)
- alternatives — 提供相同用例的 GitHub 仓库,通过
saas_slug关联 - saas_content — 由 Claude 生成的条目文本:简介、比较说明和迁移建议
The alternatives table stores everything the GitHub API returns that matters for a directory: stars, forks, language, license, last_pushed, description. The saas_content table stores only what Claude adds — the editorial layer that turns raw repo metadata into something useful. The full export lives in a JSON file that Astro reads at build time. No database connection at build. The ETL pipeline and the Astro build are separate processes.
alternatives 表存储了 GitHub API 返回的对目录有意义的所有信息:星标数、分支数、编程语言、许可证、最后推送时间、描述。saas_content 表仅存储 Claude 添加的内容——即赋予原始仓库元数据以实际价值的编辑层。完整的数据导出到一个 JSON 文件中,Astro 在构建时读取该文件。构建过程中不连接数据库。ETL 流水线和 Astro 构建是两个独立的进程。
Phase 1: seeding from JSON
第一阶段:从 JSON 播种
The first time the site runs on a new machine, there’s no database. Rather than block a local build on a live GitHub API pass, I wrote a seed.ts script that bootstraps the database from a hand-curated saas.json file. The JSON contains: SaaS name, slug, homepage, category, and a list of owner/repo strings. Stars, forks, license, and last_pushed are deliberately omitted — they’ll come from the live fetch. What I do include in JSON is pre-polished content for some entries where the Claude default output was weak.
当网站第一次在新的机器上运行时,数据库是空的。为了避免在本地构建时阻塞实时 GitHub API 调用,我编写了一个 seed.ts 脚本,通过人工整理的 saas.json 文件来引导数据库。该 JSON 包含:SaaS 名称、slug、主页、类别以及一系列 owner/repo 字符串。星标数、分支数、许可证和最后推送时间被刻意省略了——它们将通过实时抓取获取。我在 JSON 中包含的是针对某些 Claude 默认输出效果不佳的条目所预先润色的内容。
for (const e of entries) {
await db.execute({
sql: `INSERT INTO saas (slug, name, homepage, category, fetched_at) VALUES (?, ?, ?, ?, ?) ON CONFLICT(slug) DO NOTHING`,
args: [e.slug, e.name, e.homepage, e.category, now],
});
for (const a of e.alternatives) {
await db.execute({
sql: `INSERT INTO alternatives (saas_slug, repo, name, description, ...) VALUES (?, ?, ?, ?, ...) ON CONFLICT(saas_slug, repo) DO NOTHING`,
args: [e.slug, a.repo, a.name, a.description, ...],
});
}
}
DO NOTHING on conflict for alternatives is correct: once GitHub data is live, the seed shouldn’t clobber fresh stars counts with the static values from the JSON. But for saas_content, I initially used the same DO NOTHING — and that was a mistake I’ll get to below.
对于 alternatives 表,冲突时执行 DO NOTHING 是正确的:一旦 GitHub 数据上线,播种脚本就不应该用 JSON 中的静态值覆盖最新的星标数。但对于 saas_content,我最初也使用了同样的 DO NOTHING——这是一个错误,我稍后会提到。
Phase 2: live GitHub data
第二阶段:实时 GitHub 数据获取
fetch-alternatives.ts calls the GitHub REST API for every owner/repo in the database and upserts the live fields. Unlike the seed, this is DO UPDATE — we want fresh data. The sleep interval is 100ms between GitHub API calls. For an authenticated token that rate limit is conservative (GitHub’s REST API allows 5000 requests per hour for authenticated users, so 100ms is well under the minimum gap needed). Unauthenticated would be 60 per hour, which is 60 seconds per call — completely impractical at scale. The monorepo authenticates with a secret in GitHub Actions.
fetch-alternatives.ts 会为数据库中的每个 owner/repo 调用 GitHub REST API 并更新实时字段。与播种不同,这里使用的是 DO UPDATE——我们需要最新的数据。GitHub API 调用之间的休眠间隔为 100 毫秒。对于已认证的令牌,这个速率限制非常保守(GitHub 的 REST API 允许已认证用户每小时请求 5000 次,因此 100 毫秒远低于所需的最小间隔)。未认证的情况下每小时只有 60 次,即每调用一次需要 60 秒——在大规模场景下完全不可行。该 monorepo 通过 GitHub Actions 中的密钥进行认证。
Errors per-repo are caught and logged but don’t abort the batch:
每个仓库的错误都会被捕获并记录,但不会中止整个批处理:
for (const repoFull of s.alternatives) {
const [owner, name] = repoFull.split("/");
try {
const r = await getRepo(owner, name);
await db.execute({
sql: `INSERT INTO alternatives (...) VALUES (...) ON CONFLICT(saas_slug, repo) DO UPDATE SET description = excluded.description, stars = excluded.stars, ...`,
args: [s.slug, repoFull, r.name, r.description, r.stargazers_count, ...],
});
await sleep(100);
} catch (err) {
console.error(` ! Failed ${repoFull}:`, err instanceof Error ? err.message : err);
}
}
One field worth noting: r.license?.spdx_id returns null when GitHub sees a license file but can’t identify the SPDX identifier. That happens more than you’d expect with non-standard licenses. I render those rows with “see repo” instead of a badge so I’m not misleading visitors about the license type.
值得注意的一个字段是:当 GitHub 识别到许可证文件但无法确定 SPDX 标识符时,r.license?.spdx_id 会返回 null。这种情况在非标准许可证中比你预想的要多。我将这些行渲染为“查看仓库”而不是徽章,这样就不会在许可证类型上误导访问者。
Content generation with Claude Haiku
使用 Claude Haiku 生成内容
After the GitHub data is fresh, generate-content.ts queries for SaaS entries that either have no content row or whose model_used column is ‘fallback-template’ or ‘seeded-from-json’. For each, it asks Claude Haiku for:
- intro — 2 sentences on what the SaaS is and why teams seek OSS alternatives
- comparison_notes — 2-3 sentences on actual tradeoffs (self-hosting overhead, feature gaps)
- migration_tips — a 2-4 item array of concrete migration steps
在 GitHub 数据更新后,generate-content.ts 会查询那些没有内容行,或者 model_used 列为 ‘fallback-template’ 或 ‘seeded-from-json’ 的 SaaS 条目。对于每一个条目,它会要求 Claude Haiku 提供:
- 简介 — 用两句话描述该 SaaS 是什么,以及为什么团队会寻求开源替代品
- 比较说明 — 用 2-3 句话描述实际的权衡(自托管开销、功能差距)
- 迁移建议 — 一个包含 2-4 项具体迁移步骤的数组
I use the shared Claude Haiku client with system-prompt caching here. The system prompt is identical for every call in a batch, so caching it saves input tokens on all subsequent calls. On a 50-entry pass, the cost difference is real. The fallback template — which runs when ANTHROPIC_API_KEY is absent — generates deterministic placeholder text. This matters for CI: the Astro build needs a content row for every SaaS entry. Missing content produces a blank page, which would then trigger the noindex gate I use for thin programmatic pages.
我在这里使用了带有系统提示词缓存(system-prompt caching)的共享 Claude Haiku 客户端。系统提示词在批处理的每次调用中都是相同的,因此缓存它可以节省后续所有调用的输入 Token。在处理 50 个条目时,成本差异非常明显。当 ANTHROPIC_API_KEY 不存在时运行的回退模板会生成确定性的占位文本。这对 CI 很重要:Astro 构建需要每个 SaaS 条目都有内容行。缺少内容会产生空白页面,这会触发我为薄内容程序化页面设置的 noindex 门控。
The three-tier content quality ladder I described earlier puts these generated entries at the middle tier — better than the raw repo description, worse than hand-edited content.
我之前描述的三级内容质量阶梯将这些生成的内容放在了中间层——比原始仓库描述更好,但比人工编辑的内容稍逊一筹。
The UPSERT trap
UPSERT 陷阱
Original seed.ts for saas_content:
INSERT INTO saas_content (saas_slug, intro, comparison_notes, migration_tips, generated_at, model_used)
VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT(saas_slug) DO NOTHING
最初的 saas_content 的 seed.ts:
INSERT INTO saas_content (saas_slug, intro, comparison_notes, migration_tips, generated_at, model_used)
VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT(saas_slug) DO NOTHING
That looked safe. But the problem was subtle. When I seeded with model_used = null (the original JSON had…
这看起来很安全。但问题很微妙。当我使用 model_used = null 进行播种时(原始 JSON 有……)