CodeAlchemy: Synthetic Code Rewriting at Scale
CodeAlchemy: Synthetic Code Rewriting at Scale
CodeAlchemy:大规模合成代码重写技术
Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements.
在原始代码上进行预训练虽然能教授语法,但对于多样化的现实世界任务格式而言,其提供的信号非常稀疏。尽管合成数据已被证明对语言模型具有变革性意义,但在代码领域,除了有限的质量改进外,该方向仍有待深入探索。
We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces).
我们提出了 CodeAlchemy,这是一个合成数据生成框架,通过五种策略将公开来源的代码转化为语义丰富的训练数据:CodeEnhance(质量感知重写)、CodeQA(基于模板的问题)、CodeDev(开发者任务)、CodeDialogue(多轮对话)以及 CodeTrace(执行追踪)。
We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts. CodeTrace instruments and executes 1.3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge.
我们处理了涵盖 15 种语言的 3 个语料库,生成了超过 5000 亿个 token 的合成数据以及 3500 亿个推理 token,其规模比以往的研究高出几个数量级。CodeTrace 对 14 种语言和 5000 个库中的 130 多万个文件进行了插桩和执行,从而捕获了控制流、状态追踪和库知识。
We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4.5 achieve only 5.6% exact match on TraceEval, revealing critical gaps in semantic understanding.
我们引入了 DevEval(开发者任务)和 TraceEval(执行预测)基准测试;前沿模型(如 Claude Sonnet 4.5)在 TraceEval 上的精确匹配率仅为 5.6%,这揭示了模型在语义理解方面存在的关键差距。
Our 3B models achieve 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4.0.
我们的 3B(30 亿参数)模型在 HumanEval 上达到了 83.5%,在 MBPP 上达到了 63.2%,在 DevEval 上取得了 8.09% 的胜率,并在 TraceEval 上获得了 15.36 的 ROUGE-2 分数,性能超越了规模大 10 倍的前沿模型,包括 27B 的 Gemma-3 和 32B 的 Granite-4.0。