Exploring Autonomous Agentic Data Engineering for Model Specialization

探索用于模型专业化的自主代理数据工程

Abstract: Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization.

摘要： 大型语言模型（LLMs）在通用任务上表现出了强大的性能，但在缺乏高质量领域特定数据的情况下，往往难以适应专业领域。现有的基于 LLM 的数据整理方法主要依赖于人工设计的工作流程，而 LLMs 是否能够自主执行端到端的数据工程流水线以实现模型专业化，目前尚未得到验证。

We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement.

我们正式提出了“自主代理数据工程”（Autonomous Agentic Data Engineering），这是一项旨在评估 LLMs 作为自主数据工程师的新任务，通过端到端的数据整理来推动模型专业化。我们将数据视为一个可优化的组件，并研究了在训练后性能提升的指导下，能够跨多个领域规划、生成并迭代优化训练数据的智能体。

Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization.

实验表明，自主 LLM 数据工程师带来了显著的收益，GPT-5.2 通过完全由智能体驱动的迭代数据适配，构建了一套训练课程，使学生模型的性能提升了 57.29%。通过阐明其潜力和瓶颈，我们的研究将自主数据工程确立为一种可衡量的能力，并为智能体驱动的模型专业化指明了方向。