All of human cooking compressed into 2 megabytes
All of human cooking compressed into 2 megabytes
人类烹饪的精髓:被压缩进 2 兆字节的数据中
Abstract: We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, and normalise the raw ingredient strings to 1,790 canonical entries via an LLM-augmented pipeline.
摘要: 我们推出了 Epicure,这是一组包含三个同源 skip-gram 食材嵌入模型,它们是在一个多语言食谱语料库上从零开始重新训练的。我们从 11 个来源收集了 414 万份食谱,涵盖了英语、中文、俄语、越南语、西班牙语、土耳其语、印尼语、德语和印度英语等七种语言,并通过大语言模型(LLM)增强的流水线将原始食材字符串标准化为 1,790 个规范条目。
A 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph, 2,247 typed compound nodes across 15 categories, seed three Metapath2Vec variants that share architecture and hyperparameters and differ only in the random-walk schema: Cooc walks the co-occurrence graph only, Chem walks the typed compound metapaths only, and Core blends both via injected ingredient-ingredient walks at controlled mixing, placing each model at a distinct point on the chemistry-vs-recipe-context spectrum.
该研究构建了一个包含 203,508 条边的食材-食材 NPMI 图,以及一个包含 80,019 条边的 FlavorDB 类型化食材-化合物图(涵盖 15 个类别的 2,247 个类型化化合物节点)。这些数据被用于训练三个 Metapath2Vec 变体,它们共享相同的架构和超参数,仅在随机游走模式上有所不同:Cooc 仅在共现图上游走,Chem 仅在类型化化合物元路径上游走,而 Core 则通过受控混合的食材-食材游走将两者融合,从而使每个模型在“化学属性”与“食谱上下文”的谱系中占据不同的位置。