Programmatic Context Augmentation for LLM-based Symbolic Regression

基于程序化上下文增强的 LLM 符号回归

Abstract: Symbolic regression (SR), the task of discovering mathematical expressions that best describe a given dataset, remains a fundamental challenge in scientific discovery. Traditional approaches, primarily based on genetic algorithms and related evolutionary methods, have proven useful but suffer from scalability and expressivity limitations. 摘要： 符号回归（SR）旨在发现最能描述给定数据集的数学表达式，这在科学发现中仍然是一项基础性挑战。传统的符号回归方法主要基于遗传算法及相关的进化方法，虽然已被证明有效，但在可扩展性和表达能力方面存在局限性。

Recently, large language model (LLM)-based evolutionary search methods have been introduced into SR and show promise. However, existing LLM-based approaches typically rely on scalar evaluation metrics, such as mean squared error, as the sole source of feedback during the search process, thereby overlooking the rich information embedded in the dataset. 近年来，基于大语言模型（LLM）的进化搜索方法被引入符号回归领域，并展现出良好的前景。然而，现有的基于 LLM 的方法在搜索过程中通常仅依赖均方误差等标量评估指标作为反馈来源，从而忽略了数据集中蕴含的丰富信息。

To address this limitation, we propose a novel LLM-based evolutionary search framework that incorporates programmatic context augmentation. By enabling code-based interactions with the dataset, our method can actively perform data analysis and extract informative signals, beyond aggregated evaluation scores. 为了解决这一局限性，我们提出了一种结合了程序化上下文增强的新型 LLM 进化搜索框架。通过实现基于代码的数据集交互，我们的方法能够主动进行数据分析并提取除聚合评估分数之外的有效信号。

We evaluate our framework on advanced benchmarks, such as LLM-SRBench, and demonstrate superior efficiency and accuracy compared to strong baselines. 我们在 LLM-SRBench 等先进基准测试上对该框架进行了评估，结果表明，与强基线模型相比，该方法在效率和准确性方面均表现出显著优势。