All of human cooking compressed into 2 megabytes

人类烹饪的精髓：被压缩进 2 兆字节的数据中

Abstract: We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, and normalise the raw ingredient strings to 1,790 canonical entries via an LLM-augmented pipeline.

摘要： 我们推出了 Epicure，这是一组包含三个同源 skip-gram 食材嵌入模型，它们是在一个多语言食谱语料库上从零开始重新训练的。我们从 11 个来源收集了 414 万份食谱，涵盖了英语、中文、俄语、越南语、西班牙语、土耳其语、印尼语、德语和印度英语等七种语言，并通过大语言模型（LLM）增强的流水线将原始食材字符串标准化为 1,790 个规范条目。

A 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph, 2,247 typed compound nodes across 15 categories, seed three Metapath2Vec variants that share architecture and hyperparameters and differ only in the random-walk schema: Cooc walks the co-occurrence graph only, Chem walks the typed compound metapaths only, and Core blends both via injected ingredient-ingredient walks at controlled mixing, placing each model at a distinct point on the chemistry-vs-recipe-context spectrum.

该研究构建了一个包含 203,508 条边的食材-食材 NPMI 图，以及一个包含 80,019 条边的 FlavorDB 类型化食材-化合物图（涵盖 15 个类别的 2,247 个类型化化合物节点）。这些数据被用于训练三个 Metapath2Vec 变体，它们共享相同的架构和超参数，仅在随机游走模式上有所不同：Cooc 仅在共现图上游走，Chem 仅在类型化化合物元路径上游走，而 Core 则通过受控混合的食材-食材游走将两者融合，从而使每个模型在“化学属性”与“食谱上下文”的谱系中占据不同的位置。