Phase 2 Shipped: 5 Things I Got Wrong About Embedding-Based Routing

第二阶段上线：关于基于嵌入（Embedding）路由，我犯的 5 个错误

A follow-up to Teaching an AI to Pick Its Own Brain. In the last post, I ended with a plan: replace the Groq LLM categorizer with local multilingual-e5-large embeddings. Find similar past messages, vote on the category, skip the API call. Simple. 这是《教 AI 如何自主决策》的后续文章。在上一篇博文中，我提出了一个计划：用本地的 multilingual-e5-large 嵌入模型替换 Groq LLM 分类器。通过查找相似的历史消息、对类别进行投票，从而跳过 API 调用。听起来很简单。

It took a Groq outage to actually make me ship it. On 2026-05-22, Groq went down for two hours. 503 requests fell back to medium tier silently — no errors surfaced to users, but nobody got the model they should have. That’s the kind of “resilience” that feels fine until it isn’t. So I shipped Phase 2. Here’s what I got wrong. 直到 Groq 的一次宕机才真正促使我将其上线。2026 年 5 月 22 日，Groq 宕机了两个小时。503 错误请求静默回退到了中等模型层级——用户端没有报错，但没人能用上他们本该使用的模型。这种“韧性”在出问题之前感觉还不错，但一旦出事就完全不行了。于是，我上线了第二阶段。以下是我犯的错误。

Wrong #1: I thought the accuracy metric was about correctness

错误 1：我以为准确率指标代表的是“正确性”

I measured “tier accuracy” using leave-one-out cross-validation on the embedding pool. The number came back: 83.2%. Decent. But I kept asking myself: 83.2% accuracy against what ground truth? The answer: against Groq’s own past decisions. The pool is labeled by Groq. The k-NN learns Groq’s category boundaries from those labels. When I measure accuracy, I’m measuring “how often does k-NN agree with Groq?” — not “how often is the routing objectively correct.” 我使用嵌入池的“留一法交叉验证”来衡量“层级准确率”。结果是 83.2%。还不错。但我不断问自己：83.2% 的准确率是相对于什么基准事实（Ground Truth）而言的？答案是：相对于 Groq 过去做出的决策。嵌入池是由 Groq 标记的，k-NN（K-近邻算法）正是从这些标签中学习 Groq 的分类边界。当我衡量准确率时，我衡量的是“k-NN 与 Groq 的一致频率”，而不是“路由在客观上有多正确”。

This is actually the right thing to measure. The goal of Phase 2 is to replace Groq with something local and fast — the quality bar is “indistinguishable from Groq,” not “better than Groq.” But I spent a week confused about why 83% felt both good and meaningless at the same time, before I understood what I was actually measuring. 实际上，这正是应该衡量的指标。第二阶段的目标是用本地且快速的方案替换 Groq——质量标准是“与 Groq 无法区分”，而不是“比 Groq 更好”。但我花了一周时间困惑于为什么 83% 的准确率既让我觉得不错，又觉得毫无意义，直到我理解了自己到底在衡量什么。

Wrong #2: I thought analysis vs research_lookup confusion was a problem

错误 2：我以为 `analysis` 和 `research_lookup` 之间的混淆是个问题

Analysis category accuracy: 59%. Terrible-looking number. The embeddings kept predicting research_lookup for analysis prompts and vice versa. I spent two days trying to fix this. Generated more synthetic data, tweaked the pool, re-ran validation. The number barely moved. Then I looked at the tier map: analysis 类别的准确率只有 59%。这个数字看起来太糟糕了。嵌入模型总是把分析类提示词预测为 research_lookup，反之亦然。我花了整整两天试图修复这个问题。生成了更多合成数据，调整了嵌入池，重新运行验证。但准确率几乎没变。后来我看了看层级映射表：

CATEGORY_TIER_MAP = {
    "analysis": "medium",
    "research_lookup": "medium", # same destination ...
}

Both categories route to medium tier. The embedding can’t distinguish them — and it doesn’t need to. It’s like being unable to tell two roads apart when both lead to the same city. The confusion that actually costs something is when coding gets sent to medium instead of strong. That happens in 3% of requests. The analysis/research_lookup confusion? Zero routing impact. Lesson: measure tier accuracy, not category accuracy. They’re different things and only one of them matters for the system’s actual job. 这两个类别都路由到中等层级。嵌入模型无法区分它们——而且根本没必要区分。这就像分不清两条路，但它们最终都通往同一个城市。真正会造成损失的混淆是：当编程任务被发送到中等模型而不是强模型时。这种情况在 3% 的请求中发生。而 analysis 与 research_lookup 的混淆呢？对路由没有任何影响。教训：衡量层级准确率，而不是类别准确率。它们是两码事，只有前者对系统的实际工作有意义。

Wrong #3: I thought synthetic data was good enough

错误 3：我以为合成数据就足够了

The pool needs labeled examples to do k-NN. My first instinct: generate 60 synthetic prompts per category using templates, fill the pool fast. I did this. It looked fine until I checked the actual embedding space. Sixty templates with minor variation produce maybe 15 distinct semantic clusters. The rest are near-duplicates — the same phrasing with a different noun. A k-NN pool full of near-duplicates memorizes instead of generalizing. k-NN 需要带标签的示例。我的第一直觉是：使用模板为每个类别生成 60 个合成提示词，快速填满嵌入池。我照做了。看起来没问题，直到我检查了实际的嵌入空间。60 个只有细微差别的模板可能只产生了 15 个不同的语义簇。其余的都是近乎重复的内容——只是换了个名词的相同句式。一个充满近乎重复数据的 k-NN 池只会死记硬背，而无法实现泛化。

What actually worked: real user messages. I filtered 342 prompts from actual chat session transcripts — things real users had genuinely asked, in multiple languages, at varying lengths, covering real tasks. That data has diversity that synthetic templates can’t fake. After mixing in LLM-generated prompts (using claude-haiku with explicit variety constraints: different languages, different lengths, different domains) for the thinner categories, the pool hit 1,309 entries and the tier accuracy became meaningful. Near-duplicate embeddings are the real enemy of pool quality. Not wrong labels. 真正有效的是：真实的用户消息。我从实际的聊天记录中筛选了 342 条提示词——这些是真实用户在多种语言、不同长度下提出的真实任务。这些数据具有合成模板无法伪造的多样性。在为较稀疏的类别混合了 LLM 生成的提示词（使用 claude-haiku 并明确限制多样性：不同语言、不同长度、不同领域）后，嵌入池达到了 1,309 条记录，层级准确率也变得有意义了。近乎重复的嵌入数据才是嵌入池质量的真正敌人，而不是错误的标签。

Wrong #4: I thought 30% “mislabeled” synthetic prompts were noise

错误 4：我以为 30% 的“错误标记”合成提示词是噪声

When I generated coding prompts and ran them through Groq for labeling, 30% came back as analysis. My first reaction: Groq is wrong, these are clearly coding prompts, I should override the labels. I didn’t. And that was correct. Look at what those “mislabeled” prompts actually were: “explain the time complexity of this algorithm”, “what’s the difference between recursion and iteration”, “review this approach for a binary search”. 当我生成编程提示词并让 Groq 进行标记时，30% 的结果被标记为 analysis。我的第一反应是：Groq 错了，这些显然是编程提示词，我应该覆盖这些标签。但我没有这样做。事实证明我是对的。看看那些被“错误标记”的提示词到底是什么：“解释这个算法的时间复杂度”、“递归和迭代有什么区别”、“审查这个二分查找的方法”。

These sit right on the boundary between explaining something (analysis) and working with code (coding). Groq consistently calls them analysis. So the embedding pool correctly learns Groq’s boundary — which is the boundary the live system actually uses. The labels aren’t wrong. My intuition about where the boundary should be was off. If your label source has a consistent opinion, trust it over your instinct. 这些问题恰好处于解释事物（分析）和处理代码（编程）的边界上。Groq 一贯将它们归类为分析。因此，嵌入池正确地学习了 Groq 的边界——这正是线上系统实际使用的边界。标签没有错，是我对边界应该在哪里的直觉错了。如果你的标签来源有始终如一的判断，请相信它，而不是你的直觉。

Wrong #5: I thought the disagreement would be symmetric

错误 5：我以为分歧是对称的

Of the 17% of requests where embedding k-NN disagrees with Groq on tier: 在嵌入 k-NN 与 Groq 在层级判断上存在分歧的 17% 的请求中：

Upgrade (k-NN -> stronger model): 10.0%
Downgrade (k-NN -> weaker model): 6.8%

升级（k-NN -> 更强的模型）：10.0% 降级（k-NN -> 更弱的模型）：6.8%

I expected roughly 50/50. Instead, the system naturally leans toward stronger models when it’s uncertain. I didn’t engineer this. It emerges from the data — the embedding space for casual and simple_lookup prompts is very dense and clean, so cheap-tier predictions are confident. The boundaries around strong tier are fuzzier, so when the k-NN is uncertain there, it tends to pull toward stronger neighbors. For a routing system, this asymmetry is desirable. Getting a stronger-than-needed model is expensive but silent. Getting a weaker-than-needed model is cheap but potentially visible to the user. 我原以为比例大约是 50/50。但实际上，当系统不确定时，它会自然地倾向于选择更强的模型。我并没有刻意设计这一点，它是从数据中涌现出来的——casual（休闲）和 simple_lookup（简单查询）提示词的嵌入空间非常密集且清晰，因此低成本层级的预测非常有信心。而 strong（强）层级周围的边界更模糊，所以当 k-NN 在那里不确定时，它倾向于向更强的邻居靠拢。对于路由系统来说，这种不对称性是可取的。获得一个比需求更强的模型虽然昂贵，但不会产生负面影响；而获得一个比需求更弱的模型虽然便宜，但可能会被用户察觉。

What the Numbers Look Like After 1 Month

一个月后的数据表现

Real traffic distribution (messaging bot): 真实流量分布（聊天机器人）：

cheap tier: 84.9% (casual conversation)
strong tier: 8.9% (coding, reasoning)
medium tier: 6.3% (analysis, creative)

低成本层级：84.9%（休闲对话）强模型层级：8.9%（编程、推理）中等层级：6.3%（分析、创意）

One important caveat before reading into these numbers: crab-bot runs as a messaging bot — the primary use case is casual conversation, quick lookups, and occasional technical questions. The 84.9% cheap-tier traffic is a direct reflection of that usage pattern. If you’re routing for a developer tool, a customer support bot, or a research assistant, your distribution will look very different. A coding-heavy workload might flip cheap and strong — and your cost savings curve will shift accordingly. 在解读这些数字之前，有一个重要的注意事项：crab-bot 是作为一个聊天机器人运行的——其主要用例是休闲对话、快速查询和偶尔的技术问题。84.9% 的低成本层级流量直接反映了这种使用模式。如果你是在为开发工具、客户支持机器人或研究助手做路由，你的分布情况会大不相同。如果是编程密集型工作负载，可能会颠倒低成本和强模型的比例——你的成本节省曲线也会随之改变。