Auto-FL-Research: Agentic Search for Federated Learning Algorithms

Auto-FL-Research：联邦学习算法的智能体搜索

Abstract: Federated learning (FL) research often depends on many small but consequential algorithmic choices: optimizer variants, server aggregation rules, local training schedules, normalization, regularization, and model architecture. These choices are expensive to explore manually and difficult to compare fairly when candidate changes can also alter the FL training or evaluation path.

摘要： 联邦学习（FL）研究通常依赖于许多微小但至关重要的算法选择：优化器变体、服务器聚合规则、本地训练计划、归一化、正则化以及模型架构。手动探索这些选择成本高昂，且当候选方案的变更可能同时改变联邦学习的训练或评估路径时，很难进行公平的比较。

In this work, we present Auto-FL-Research (AFR), a constrained coding-agent workflow for FL algorithmic recipe search. Agents may propose and implement candidate training algorithms, including server aggregation rules, client update schedules, local objectives, and registered model variants, while task profiles fix the mutation surface, compute budget, communication contract, and final model evaluation.

在这项工作中，我们提出了 Auto-FL-Research (AFR)，这是一种用于联邦学习算法方案搜索的受限编码智能体工作流。智能体可以提出并实现候选训练算法，包括服务器聚合规则、客户端更新计划、本地目标函数和已注册的模型变体，同时任务配置文件会固定变异范围、计算预算、通信协议和最终模型评估标准。

Each campaign records candidate scores, runtime, edited files, artifacts, and failure status. We evaluate AFR on five healthcare cross-silo FLamby tasks and on grouped-client profiles for the five fixed LEAF datasets plus the LEAF synthetic task. Five-seed repeat evaluations support gains on four FLamby tasks and five of six LEAF profiles, while also exposing seed-sensitive and search-selected failure cases.

每项实验活动都会记录候选方案的得分、运行时间、编辑过的文件、产出物以及失败状态。我们在五个医疗跨孤岛 FLamby 任务以及五个固定 LEAF 数据集加上 LEAF 合成任务的组群客户端配置文件上评估了 AFR。五次随机种子重复评估支持了其在四个 FLamby 任务和六个 LEAF 配置文件中的五个上取得的增益，同时也揭示了对随机种子敏感以及搜索筛选出的失败案例。

Same-budget controls show that several gains correspond to FL-recipe changes, whereas other improvements are recovered by fixed-surface scalar controls or fail under repeat or held-out evaluation. These mixed outcomes are part of the contribution: they show how agent-generated candidates can be separated into repeated FL mechanisms, fixed-surface tuning effects, and selected single-run artifacts.

相同预算的对照实验表明，一些增益确实对应于联邦学习方案的变更，而另一些改进则可以通过固定范围的标量控制来复现，或者在重复评估或留出法评估中失效。这些混合的结果也是本研究贡献的一部分：它们展示了如何将智能体生成的候选方案区分开来，分为可重复的联邦学习机制、固定范围的调优效应以及选定的单次运行产物。