Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

通过一致性驱动的强化学习提升跨语言事实召回能力

Abstract: Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. 摘要： 主要基于英语数据训练的大型语言模型（LLMs）编码了大量的世界知识，但在其他语言中往往无法可靠地表达这些知识，这种现象被称为跨语言事实不一致性。

To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. 为了研究并解决这一问题，我们引入了 PolyFact，这是一个大规模平行多语言事实问答数据集，包含 12 种类型各异的语言中 10 万条基于 Wikidata 的事实。

Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. 利用 PolyFact，我们比较了轻量级持续预训练（CPT）、监督微调（SFT）以及通过组相对策略优化（GRPO）进行的强化学习，旨在提升 Qwen-2.5-7B 和 OLMo-2-1124-7B 模型中的跨语言事实召回能力。

We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. 我们发现，GRPO 的表现始终优于 SFT，不仅提升了跨语言的一致性，还增强了对未见语言的泛化能力，而基于平行数据的 CPT 所带来的额外收益则较为有限。

Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. 机制分析进一步表明，GRPO 通过减少 MLP 层和注意力头中的语言特异性来重组多语言路由，从而促进了更多共享的跨语言表征。

We release our code, models, and dataset. 我们已公开了相关的代码、模型和数据集。