How I built a three-tier content quality ladder for programmatic directory ETL

How I built a three-tier content quality ladder for programmatic directory ETL

我是如何为程序化目录 ETL 构建三级内容质量阶梯的

The three directory sites I launched in April — Top AI Tools, Find Games Like, and Open Alternative To — all generate editorial content the same way: fetch metadata from an external API, send it through Claude Haiku 4.5, write the result to Turso. But that description skips the part that actually matters for a programmatic site at scale: what happens when Claude can’t run. The answer is a content quality ladder with three tiers, tracked by a single model_used column.

我在四月份推出的三个目录网站——Top AI Tools、Find Games Like 和 Open Alternative To——都以相同的方式生成编辑内容:从外部 API 获取元数据,通过 Claude Haiku 4.5 进行处理,并将结果写入 Turso。但这种描述忽略了程序化网站在规模化运营中真正关键的部分:当 Claude 无法运行时会发生什么?答案是一个包含三个层级的内容质量阶梯,并通过一个 model_used 列进行追踪。

The three tiers

三个层级

Every content table across all three sites has a model_used column. It takes one of three values:

这三个网站的所有内容表都包含一个 model_used 列。它有以下三个值之一:

ValueOriginQuality
seeded-from-jsonLoaded from a curated JSON file at bootstrapMinimal — structured but thin
fallback-templateClaude unavailable or API key absentAcceptable — technically correct, not editorial
claude-haiku-4-5Generated by Claude Haiku 4.5Target — editorial summaries, named examples, nuanced caveats
来源质量
seeded-from-json引导时从精选 JSON 文件加载最低限度 — 结构化但内容单薄
fallback-templateClaude 不可用或缺少 API Key可接受 — 技术上正确,非编辑内容
claude-haiku-4-5由 Claude Haiku 4.5 生成目标状态 — 编辑摘要、命名示例、细致的注意事项

Seeded content exists because each site ships with a JSON file of curated entries. Those entries have names, descriptions, and metadata from their upstream source (HuggingFace, Steam, GitHub), but no editorial layer yet. The page renders — but it reads like a database dump, not a directory.

“种子内容”(Seeded content)的存在是因为每个网站都附带了一个精选条目的 JSON 文件。这些条目拥有来自上游源(HuggingFace、Steam、GitHub)的名称、描述和元数据,但还没有编辑层。页面可以渲染,但读起来就像数据库转储,而不是一个目录。

Fallback-template content is what you get when the API key isn’t present or when a Claude call fails. For the AI tools site, the fallback for a model named qwen2-7b in the text-generation pipeline looks like this: “qwen2-7b is an open-source text-generation model available on HuggingFace. Details are sourced from the public model registry.” That’s not wrong. It just doesn’t help anyone decide whether to use the model.

“回退模板内容”(Fallback-template content)是在 API Key 不存在或 Claude 调用失败时生成的内容。对于 AI 工具网站,文本生成流水线中名为 qwen2-7b 的模型,其回退内容如下:“qwen2-7b 是 HuggingFace 上可用的开源文本生成模型。详细信息来源于公共模型注册表。”这并没有错,只是它无法帮助用户决定是否应该使用该模型。

Claude Haiku content is the target state. A good generation for the same model says something like: “Qwen2-7B is a 7-billion parameter instruction-tuned model from Alibaba Cloud optimized for multilingual generation, showing strong performance on Chinese and English benchmarks while fitting in 16GB of VRAM.” The difference is editorial voice and specificity — neither of which template-filling can produce.

Claude Haiku 内容是目标状态。对于同一个模型,好的生成结果会这样描述:“Qwen2-7B 是阿里巴巴云推出的 70 亿参数指令微调模型,针对多语言生成进行了优化,在中英文基准测试中表现出色,且能在 16GB 显存中运行。”区别在于编辑语调和具体细节——这两者都是模板填充无法产生的。

The upgrade query

升级查询

The ETL generation step doesn’t blindly regenerate everything on each run. It targets only entries that need work:

ETL 生成步骤不会在每次运行时盲目地重新生成所有内容,它只针对需要处理的条目:

SELECT m.id, m.name, m.pipeline_tag, m.tags 
FROM models m 
LEFT JOIN model_content c ON c.model_id = m.id 
WHERE c.model_id IS NULL 
   OR c.model_used IN ('fallback-template', 'seeded-from-json') 
ORDER BY m.downloads DESC 
LIMIT ?

Three things happen simultaneously here: 这里同时发生了三件事:

  1. LEFT JOIN ... WHERE c.model_id IS NULL catches brand-new entries added by the nightly fetch that have no content row yet. LEFT JOIN ... WHERE c.model_id IS NULL 会捕获每晚抓取新增的、尚无内容行的条目。
  2. OR c.model_used IN ('fallback-template', 'seeded-from-json') catches existing rows that were written with lower-quality content. OR c.model_used IN ('fallback-template', 'seeded-from-json') 会捕获现有的、以较低质量内容写入的行。
  3. ORDER BY m.downloads DESC means when the LIMIT is hit, the most-downloaded (most-visited) entries are upgraded first. ORDER BY m.downloads DESC 意味着当达到 LIMIT 时,下载量最高(访问量最大)的条目会优先被升级。

This identical query pattern appears in all three sites with different table names. The abstraction was a late realization — I wrote it three times before noticing it was the same thing. A shared buildUpgradeQuery(tableName, pkField, contentTable) helper would have been the right call from the start.

这种相同的查询模式出现在所有三个网站中,只是表名不同。这种抽象是我后来才意识到的——我写了三次才发现它们其实是一回事。如果从一开始就使用一个共享的 buildUpgradeQuery(tableName, pkField, contentTable) 辅助函数会是更好的选择。

The fallback chain

回退链

Inside the generation loop, every entry goes through the same decision tree:

在生成循环中,每个条目都会经过相同的决策树:

const hasApiKey = !!process.env.ANTHROPIC_API_KEY;
if (hasApiKey) {
  try {
    const result = await generate({ systemPrompt: SYSTEM_PROMPT, userPrompt, cacheSystem: true, maxTokens: 1024 });
    content = parseOrFallback(result.text, fb);
    modelUsed = "claude-haiku-4-5";
    generated++;
  } catch (err) {
    console.error(`! Claude error for ${id}:`, err instanceof Error ? err.message : err);
    content = fb;
    fallback++;
  }
} else {
  content = fb;
  fallback++;
}

The cacheSystem: true flag marks the system prompt block with cache_control: { type: "ephemeral" }. All three sites have fixed system prompts — the same AI tools instruction across every model generation, the same game critic instruction across every game — so the first call in a batch primes the cache and the remaining ~99 calls read it at the reduced input rate.

cacheSystem: true 标志将系统提示词块标记为 cache_control: { type: "ephemeral" }。所有三个网站都有固定的系统提示词——每个模型生成都使用相同的 AI 工具指令,每个游戏都使用相同的游戏评论家指令——因此批处理中的第一次调用会预热缓存,其余约 99 次调用则以降低后的输入费率读取缓存。

I covered the mechanics in the article on the shared Haiku client. With a ~900-token system prompt and 100 entries per run, the cache saves roughly 90,000 input tokens per nightly run. Anthropic’s prompt caching documentation has the exact pricing for cache creation vs cache read tokens.

我在关于共享 Haiku 客户端的文章中介绍过其机制。对于约 900 token 的系统提示词和每次运行 100 个条目,缓存每晚大约能节省 90,000 个输入 token。Anthropic 的提示词缓存文档中有关于缓存创建与缓存读取 token 的确切定价。

The error path is deliberately non-throwing. Any Claude failure — rate limit, network timeout, malformed response — drops through to content = fb and increments fallback. The run continues. If 10 of 100 Claude calls fail due to transient rate limits, 90 get written with claude-haiku-4-5 and the 10 failures get fallback-template. Those 10 rows surface in the next night’s upgrade query automatically.

错误路径被刻意设计为不抛出异常。任何 Claude 的失败——速率限制、网络超时、格式错误的响应——都会回退到 content = fb 并增加 fallback 计数。运行会继续。如果 100 次 Claude 调用中有 10 次因瞬时速率限制而失败,90 次会以 claude-haiku-4-5 写入,而 10 次失败的会得到 fallback-template。这 10 行数据会自动出现在次日的升级查询中。

The upsert write

Upsert 写入

Every content row is written with INSERT ... ON CONFLICT ... DO UPDATE SET:

每一行内容都通过 INSERT ... ON CONFLICT ... DO UPDATE SET 写入:

INSERT INTO game_content (appid, summary, similar_games, good_for, avoid_if, generated_at, model_used)
VALUES (?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(appid) DO UPDATE SET 
  summary = excluded.summary, 
  similar_games = excluded.similar_games, 
  good_for = excluded.good_for, 
  avoid_if = excluded.avoid_if, 
  generated_at = excluded.generated_at, 
  model_used = excluded.model_used

The upsert makes the ETL fully idempotent: running it twice produces the same state as running it once. More importantly, it means the model_used column gets overwritten when an upgrade succeeds. A row that was fallback-template becomes claude-haiku-4-5 in-place, without any explicit “mark upgraded” step. The column just reflects what actually produced the current content.

Upsert 操作使 ETL 完全幂等:运行两次的结果与运行一次相同。更重要的是,这意味着当升级成功时,model_used 列会被覆盖。原本是 fallback-template 的行会原地变为 claude-haiku-4-5,无需任何显式的“标记为已升级”步骤。该列只是如实反映了当前内容是由什么生成的。

The compare-page ETL uses a different pattern: check-before-insert with an explicit SELECT 1 to skip already-generated pairs. Both patterns are valid. Check-before-insert is better when reprocessing is expensive (large Claude calls, multi-step generation). Upsert-overwrite is better when you always want the latest generation to win regardless of what was.

对比页面的 ETL 使用了不同的模式:先检查后插入,通过显式的 SELECT 1 跳过已生成的对。两种模式都是有效的。当重新处理成本较高(如大型 Claude 调用、多步生成)时,先检查后插入更好。而当你总是希望最新的生成结果覆盖旧结果时,Upsert-overwrite 更好。