Bytes Speak All Languages: Cross-Script Name Retrieval via Contrastive Learning

Bytes Speak All Languages: Cross-Script Name Retrieval via Contrastive Learning

字节通晓万语:基于对比学习的跨脚本姓名检索

Why learn 8 scripts when you can learn 256 bytes? 为什么要学习 8 种脚本,当你只需学习 256 个字节?

Every time a sanctions screening system checks a name against a watchlist, it faces a silent failure mode that nobody talks about. Type “Владимир Путин” into a system indexed on “Vladimir Putin” and most name-matching approaches return nothing. The two strings share zero characters, so edit distance is meaningless, phonetic codes fail (they assume Latin), and BM25 gives up entirely. 每当制裁筛选系统根据观察名单核对姓名时,都会面临一个无人提及的隐性故障模式。如果你在索引为“Vladimir Putin”的系统中输入“Владимир Путин”,大多数姓名匹配方法都无法返回结果。这两个字符串没有任何共同字符,因此编辑距离毫无意义,语音编码失效(它们默认基于拉丁字母),而 BM25 算法则会彻底放弃。

This is not an obscure edge case. Immigration databases, hospital record systems, and financial compliance pipelines deal with this daily. And yet, the dominant approaches to this problem are either classical (edit distance, Soundex variants) or heavyweight (fine-tune a multilingual LLM on a few hundred manually labeled pairs). 这并非罕见的边缘情况。移民数据库、医院记录系统和金融合规流程每天都在处理此类问题。然而,目前解决该问题的主流方法要么是传统的(编辑距离、Soundex 变体),要么是重量级的(在几百个手动标注的对上微调多语言大模型)。

In this post, I’ll walk you through how we trained a compact transformer encoder from scratch on raw UTF-8 bytes, with no tokenizer, no pretrained backbone, and no script detection, to solve cross-script phonetic name retrieval. We achieved 0.775 MRR and 0.897 R@10 across 8 non-Latin scripts, reducing the performance gap between Latin and non-Latin queries by 10x over the best classical baseline. The full code is on GitHub. This post covers the ideas and the engineering. 在这篇文章中,我将带你了解我们如何从零开始,在原始 UTF-8 字节上训练一个紧凑的 Transformer 编码器,且无需分词器、无需预训练骨干网络、无需脚本检测,从而解决跨脚本语音姓名检索问题。我们在 8 种非拉丁脚本中实现了 0.775 的 MRR 和 0.897 的 R@10,将拉丁语与非拉丁语查询之间的性能差距比最佳传统基准缩小了 10 倍。完整代码已在 GitHub 上开源。本文将涵盖其中的核心思想与工程实现。

Why is this hard?

为什么这很难?

The problem sits at the intersection of three things that don’t cooperate: 这个问题处于三个互不兼容因素的交汇点:

  1. Scripts are disjoint symbol sets. “Schwarzenegger” and “שוורצנגר” (Hebrew) have no shared characters. Edit distance, the go-to for fuzzy matching, produces a maximum-distance score every time a script boundary is crossed. Phonetic hashing (Double Metaphone, Soundex) encodes approximate English pronunciation, so it is useless for non-Latin queries by design.

  2. 脚本是互不相交的符号集。 “Schwarzenegger” 和 “שוורצנגר”(希伯来语)没有任何共同字符。作为模糊匹配首选的编辑距离,在跨越脚本边界时总是会产生最大距离得分。语音哈希(Double Metaphone, Soundex)编码的是近似的英语发音,因此在设计上对非拉丁语查询毫无用处。

  3. Romanization is not a function. The Chinese name written as “张” maps to Zhang, Chang, and Cheung depending on dialect, romanization standard, and historical convention. The Korean “박” maps to Park, Pak, and Bak. Any approach that tries to normalize to a canonical Latin form (like ICU transliterate) will get the right answer for one convention and fail for the others.

  4. 罗马化并非单一映射函数。 中文名“张”根据方言、罗马化标准和历史惯例,可以对应为 Zhang、Chang 或 Cheung。韩文“박”可以对应为 Park、Pak 或 Bak。任何试图将其归一化为标准拉丁形式(如 ICU 转写)的方法,只能针对一种惯例得到正确答案,而对其他惯例则会失败。

  5. Names carry no semantic context. Dense retrieval methods like DPR and BGE-M3 are powerful for sentence-level tasks because surrounding words provide semantic grounding. For a 2-word person name there is no context to compensate for surface mismatch. Chari et al. (2025) showed that even strong multilingual retrievers degrade severely when queries are transliterated rather than written in their native script.

  6. 姓名不携带语义上下文。 像 DPR 和 BGE-M3 这样的稠密检索方法在句子级任务中非常强大,因为周围的词提供了语义基础。但对于一个双词的人名,没有任何上下文来补偿表层的不匹配。Chari 等人 (2025) 的研究表明,即使是强大的多语言检索器,当查询被转写而非以原生脚本书写时,性能也会严重下降。

The insight behind our approach: every Unicode character decomposes deterministically into 1 to 4 bytes from a fixed 256-symbol alphabet. “Владимир” and “Vladimir” are different byte sequences, but a model trained contrastively on enough phonetic pairs can learn to map them to nearby vectors. The vocabulary is universal by construction. 我们方法背后的洞察是:每个 Unicode 字符都可以确定性地分解为 1 到 4 个字节,这些字节来自固定的 256 符号字母表。“Владимир”和“Vladimir”是不同的字节序列,但通过在足够多的语音对上进行对比学习,模型可以学会将它们映射到相近的向量空间。这种词汇表在构建上是通用的。

Building Training Data at Scale

大规模构建训练数据

You can’t train this model without data, and there is no dataset of 4 million cross-script phonetic name pairs lying around. We built one with a 4-stage LLM pipeline. 没有数据就无法训练该模型,而且目前也没有现成的 400 万对跨脚本语音姓名数据集。我们通过一个四阶段的 LLM 流水线构建了一个。

  • Stage 1: Stratified sampling from Wikidata. We started with 2 million person-name entities from Wikidata. We stratified by script-coverage bucket (0, 1-2, 3-4, 5+ non-English labels) and sampled proportionally, yielding 119,040 entities with balanced coverage.

  • 第一阶段:从 Wikidata 进行分层采样。 我们从 Wikidata 的 200 万个人名实体开始。我们按脚本覆盖范围(0、1-2、3-4、5 个以上非英语标签)进行分层并按比例采样,最终得到了 119,040 个覆盖均衡的实体。

  • Stage 2: Phonetic Latin variants (Llama-3.1-8B). For each English anchor name, we asked Llama-3.1-8B-Instruct to generate 4 phonetic spelling variants—the kinds of mishearings and misspellings real people produce.

  • 第二阶段:拉丁语语音变体(Llama-3.1-8B)。 对于每个英语锚点姓名,我们要求 Llama-3.1-8B-Instruct 生成 4 个语音拼写变体——即现实中人们可能产生的听错或拼写错误。

  • Stage 3: Cross-script transliteration (Qwen3-30B). For each English name and each of its Latin variants, we generated transliterations into 8 scripts: Arabic, Russian, Chinese, Japanese, Hebrew, Hindi, Greek, Korean.

  • 第三阶段:跨脚本转写(Qwen3-30B)。 对于每个英语姓名及其对应的拉丁语变体,我们生成了 8 种脚本的转写:阿拉伯语、俄语、中文、日语、希伯来语、印地语、希腊语和韩语。

  • Stage 4: Merge and tag. The final stage merges Wikidata ground-truth labels with LLM output, deduplicates, and tags each positive pair by type (phonetic, script, or combined).

  • 第四阶段:合并与标记。 最终阶段将 Wikidata 的真实标签与 LLM 输出合并、去重,并按类型(语音、脚本或组合)标记每个正样本对。

The Model

模型架构

The encoder is genuinely small: 6 transformer layers, 8 attention heads, hidden dim 256, FFN dim 1024, dropout 0.1, max length 256 bytes. Total parameters: ~4M. 该编码器非常小巧:6 层 Transformer,8 个注意力头,隐藏层维度 256,FFN 维度 1024,Dropout 0.1,最大长度 256 字节。总参数量约为 400 万。