CodeAlchemy: Synthetic Code Rewriting at Scale

CodeAlchemy：大规模合成代码重写技术

Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements.

在原始代码上进行预训练虽然能教授语法，但对于多样化的现实世界任务格式而言，其提供的信号非常稀疏。尽管合成数据已被证明对语言模型具有变革性意义，但在代码领域，除了有限的质量改进外，该方向仍有待深入探索。

We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces).

我们提出了 CodeAlchemy，这是一个合成数据生成框架，通过五种策略将公开来源的代码转化为语义丰富的训练数据：CodeEnhance（质量感知重写）、CodeQA（基于模板的问题）、CodeDev（开发者任务）、CodeDialogue（多轮对话）以及 CodeTrace（执行追踪）。

We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts. CodeTrace instruments and executes 1.3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge.

我们处理了涵盖 15 种语言的 3 个语料库，生成了超过 5000 亿个 token 的合成数据以及 3500 亿个推理 token，其规模比以往的研究高出几个数量级。CodeTrace 对 14 种语言和 5000 个库中的 130 多万个文件进行了插桩和执行，从而捕获了控制流、状态追踪和库知识。

We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4.5 achieve only 5.6% exact match on TraceEval, revealing critical gaps in semantic understanding.

我们引入了 DevEval（开发者任务）和 TraceEval（执行预测）基准测试；前沿模型（如 Claude Sonnet 4.5）在 TraceEval 上的精确匹配率仅为 5.6%，这揭示了模型在语义理解方面存在的关键差距。

Our 3B models achieve 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4.0.

我们的 3B（30 亿参数）模型在 HumanEval 上达到了 83.5%，在 MBPP 上达到了 63.2%，在 DevEval 上取得了 8.09% 的胜率，并在 TraceEval 上获得了 15.36 的 ROUGE-2 分数，性能超越了规模大 10 倍的前沿模型，包括 27B 的 Gemma-3 和 32B 的 Granite-4.0。