The Culture Funnel: You Can't Align What isn't in the Data

The Culture Funnel: You Can’t Align What isn’t in the Data

文化漏斗：无法对齐数据中不存在的内容

Abstract: Current cultural alignment approaches focus on inference-time interventions, assuming models already contain sufficient cultural knowledge. We argue modern LLM pipelines suffer from a cultural data funnel. Using a multidimensional tagging framework across pretraining, fine-tuning, alignment, and reasoning datasets, we show explicit cultural signals decline sharply during post-training, while geographically concentrated, task-specialized data dominates.

摘要： 当前的文化对齐方法主要侧重于推理阶段的干预，并假设模型已经包含了足够的文化知识。我们认为，现代大语言模型（LLM）的流水线存在“文化数据漏斗”问题。通过在预训练、微调、对齐和推理数据集上使用多维标记框架，我们发现显性的文化信号在训练后阶段急剧下降，而地理位置集中、任务专业化的数据则占据了主导地位。

Multilinguality enhances geographic diversity of cultural knowledge but does not ensure balanced representation. Our tags improve downstream cultural benchmark performance, demonstrating that advances require shifting focus in training data pipelines. To facilitate future research, we release our culturally tagged dataset with 5.6M samples at this https URL.

多语言能力虽然增强了文化知识的地理多样性，但并不能确保均衡的代表性。我们的标记方法提升了下游文化基准测试的性能，这表明技术进步需要将重心转向训练数据流水线。为了促进未来的研究，我们发布了包含 560 万个样本的文化标记数据集，详情请访问此链接。