How I built pairwise AI model compare pages with Claude Haiku and a budget cap

How I built pairwise AI model compare pages with Claude Haiku and a budget cap

我是如何利用 Claude Haiku 和预算上限构建 AI 模型两两对比页面的

When I added compare pages to the Top AI Tools directory, the first question I had to answer was: how many pairs am I actually looking at? With roughly 200 models across 8 pipeline tags, the naive upper bound is 200 × 199 / 2 ≈ 19,900 pairs. Generating content for each one with Claude Haiku would cost somewhere around $20 per run — not ruinous, but not something I wanted to run daily without thinking carefully. Here’s what I actually built, where it falls short, and what I’d do differently if starting over. 当我为“顶级 AI 工具”目录添加对比页面时,我首先要回答的问题是:我到底需要处理多少对模型?在 8 个流水线标签下共有约 200 个模型,简单的上限计算是 200 × 199 / 2 ≈ 19,900 对。如果用 Claude Haiku 为每一对生成内容,每次运行的成本大约在 20 美元左右——虽然不至于破产,但也不是我想要在不加思索的情况下每天运行的。以下是我实际构建的内容、它的不足之处,以及如果重来一次我会做出的改变。

The combinatorics problem

组合数学问题

Model compare pages exist for a specific type of query: “llama 3 vs mistral 7b”, “stable diffusion vs sdxl”, “whisper vs wav2vec2”. These are high-intent queries — the user has already narrowed down to a shortlist and wants a concrete decision nudge. The static SSG approach I’m running means I need to precompute each compare page at build time, which puts pressure on how many pages I can afford to generate. The solution I landed on: group by pipeline_tag, pair the top-4 models by download count within each group, then cap total pairs with a COMPARE_LIMIT env var. Within a single pipeline like text-generation, the top 4 models give 6 pairs (4 choose 2). Across 8 active pipelines that’s roughly 48 pairs. The env cap of 50 means I stay within that budget while having room to grow. 模型对比页面是为了满足特定类型的查询而存在的,例如:“llama 3 vs mistral 7b”、“stable diffusion vs sdxl”、“whisper vs wav2vec2”。这些都是高意图查询——用户已经缩小了选择范围,需要一个具体的决策建议。我所采用的静态 SSG(静态站点生成)方法意味着我需要在构建时预先计算每个对比页面,这限制了我能生成的页面数量。我最终的解决方案是:按 pipeline_tag 分组,在每组中选取下载量最高的前 4 个模型进行两两配对,然后通过 COMPARE_LIMIT 环境变量限制总对数。在像 text-generation 这样的单一流水线中,前 4 个模型会产生 6 对(4 选 2)。在 8 个活跃的流水线中,总共大约有 48 对。环境变量设为 50,意味着我既能控制在预算内,又留有增长空间。

const byPipe = new Map<string, typeof models>();
for (const m of models) {
  if (!m.pipeline_tag) continue;
  const arr = byPipe.get(m.pipeline_tag) ?? [];
  arr.push(m);
  byPipe.set(m.pipeline_tag, arr);
}

const pairs: Array<[Model, Model]> = [];
for (const [, list] of byPipe) {
  const sorted = [...list].sort((a, b) => b.downloads - a.downloads);
  const take = sorted.slice(0, Math.min(4, sorted.length));
  for (let i = 0; i < take.length; i++) {
    for (let j = i + 1; j < take.length; j++) {
      pairs.push([take[i]!, take[j]!]);
    }
  }
}
const chosen = pairs.slice(0, MAX);

The pairing happens entirely within pipelines right now, which means I’m covering “llama vs mistral” (both text-generation) but not “whisper vs gemini-vision” (cross-pipeline). Cross-pipeline comparisons are actually more valuable for users who don’t know the landscape yet — that’s the next iteration. 目前的配对完全是在流水线内部进行的,这意味着我涵盖了“llama vs mistral”(都是文本生成),但没有涵盖“whisper vs gemini-vision”(跨流水线)。对于还不了解领域全貌的用户来说,跨流水线对比实际上更有价值——这是我下一阶段的迭代目标。

The pair_slug and idempotent inserts

对比标识符 (pair_slug) 与幂等插入

The slug for each compare pair is constructed deterministically: sort the two model slugs alphabetically, join with —vs—. So whether the ETL processes (llama-3, mistral-7b) or (mistral-7b, llama-3), the slug is always llama-3—vs—mistral-7b. 每个对比对的 slug(URL 标识符)是确定性构建的:将两个模型的 slug 按字母顺序排序,并用 —vs— 连接。因此,无论 ETL 处理的是 (llama-3, mistral-7b) 还是 (mistral-7b, llama-3),slug 始终是 llama-3—vs—mistral-7b。

const pairSlug = [a.slug, b.slug].sort().join("--vs--");

This makes the entire ETL idempotent. The script runs every night. If all pairs already exist in the DB, it exits in a couple of seconds without a single Claude call. I check before inserting rather than using INSERT OR IGNORE at the SQL level — the explicit check lets me count skipped vs generated in the same run, which I log: [compare] done — generated: 3, skipped: 47. This matters for monitoring. A run that generates 0 and skips 50 is healthy. A run that generates 0 and skips 0 (nothing in DB, nothing processed) would indicate a bug. 这使得整个 ETL 过程具有幂等性。脚本每晚运行。如果所有对比对已存在于数据库中,它会在几秒钟内退出,而不会调用 Claude。我在插入前进行检查,而不是在 SQL 层面使用 INSERT OR IGNORE——这种显式检查让我可以在同一次运行中统计跳过和生成的数量,并记录日志:[compare] done — generated: 3, skipped: 47。这对监控很重要。生成 0 个、跳过 50 个的运行是健康的。如果生成 0 个且跳过 0 个(数据库中没有,也没有处理任何内容),则表明存在 Bug。

Claude Haiku with system-prompt caching

使用系统提示词缓存的 Claude Haiku

I reuse the shared Haiku client I built in week one, which handles cacheSystem: true on the system prompt. Since the system prompt — the JSON schema instruction — is identical across all compare calls, the first call primes the cache and subsequent calls see near-zero token cost on that prefix. 我重用了第一周构建的共享 Haiku 客户端,它在系统提示词上处理了 cacheSystem: true。由于系统提示词(即 JSON 模式指令)在所有对比调用中都是相同的,第一次调用会预热缓存,后续调用在该前缀上的 Token 成本几乎为零。

The user prompt includes both model names, their authors, pipeline tags, and up to 400 characters of their existing summaries (which come from the earlier content generation step): 用户提示词包含了两个模型的名称、作者、流水线标签,以及它们现有摘要的前 400 个字符(这些摘要来自之前的生成步骤):

const userPrompt = `Compare these two AI models:
A: ${a.name} (author: ${a.author ?? "unknown"}, pipeline: ${a.pipeline_tag ?? "unknown"})
Summary: ${a.summary?.slice(0, 400) ?? "(none)"}
B: ${b.name} (author: ${b.author ?? "unknown"}, pipeline: ${b.pipeline_tag ?? "unknown"})
Summary: ${b.summary?.slice(0, 400) ?? "(none)"}
Produce the JSON comparison.`;

Truncating summaries at 400 characters keeps the user prompt lean. Compare pages are about the delta between two models, not a rehash of each model individually. I already have dedicated model pages for depth; the compare page needs to answer “which one, for what” — that takes maybe 6 sentences total. The system prompt requests a JSON object with summary, differences (array), similarities (array), and recommendation. Keeping the output shape narrow means Haiku rarely wanders off-schema. 将摘要截断为 400 个字符可以保持用户提示词的精简。对比页面关注的是两个模型之间的差异,而不是对每个模型的单独重述。我已经有了专门的模型页面来提供深度信息;对比页面需要回答的是“选哪个,用于什么场景”——这总共只需要大约 6 句话。系统提示词要求返回一个包含摘要、差异(数组)、相似点(数组)和建议的 JSON 对象。保持输出结构紧凑意味着 Haiku 很少会偏离预设的 Schema。

JSON parsing with a regex fence

使用正则表达式围栏进行 JSON 解析

Even with tight prompting, Haiku occasionally produces JSON with an explanation preamble: “Here is the comparison:” followed by the actual object. Strict JSON.parse on the raw output would throw. I extract the outermost {…} block with a regex before parsing: 即使有严格的提示词,Haiku 有时也会在 JSON 前加上解释性前言,例如:“Here is the comparison:”,后面跟着实际的对象。直接对原始输出使用 JSON.parse 会报错。我在解析前使用正则表达式提取最外层的 {…} 块:

function parseCompare(text: string, fb: CompareData): CompareData {
  try {
    const m = text.match(/\{[\s\S]*\}/);
    if (!m) return fb;
    const p = JSON.parse(m[0]);
    return {
      summary: typeof p.summary === "string" ? p.summary : fb.summary,
      differences: Array.isArray(p.differences) ? p.differences.map(String) : fb.differences,
      similarities: Array.isArray(p.similarities) ? p.similarities.map(String) : fb.similarities,
      recommendation: typeof p.recommendation === "string" ? p.recommendation : fb.recommendation,
    };
  } catch { return fb; }
}

Each field is validated individually before being accepted. If differences comes back as a string (occasional Haiku behavior when it conflates the array with a comma-separated list), the page falls back to the template for that field rather than crashing. The fallback struct is worth writing carefully. I spent five minutes on mine and it shows: 每个字段在被接受前都会经过单独验证。如果 differences 返回的是字符串(Haiku 有时会将数组混淆为逗号分隔的列表),页面会回退到该字段的模板,而不是崩溃。回退结构值得仔细编写。我花了五分钟编写它,效果很不错:

const fb: CompareData = {
  summary: `${a.name} and ${b.name} are both ${a.pipeline_tag} models. See each entry for specifics.`,
  differences: ["See individual model pages for architecture and use cases."],
  similarities: ["Both are open-source models on HuggingFace."],
  recommendation: "Pick based on your compute budget and specific task requirements.",
};