DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

DynaSchedBench：校准动态调度基准与基于大模型的调度智能体中的“可观测性悖论”

Abstract: Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generators obscure algorithmic capability with stochastic noise.

摘要： 目前，针对动态柔性作业车间调度问题（DFJSP）的神经组合优化进展受到方法论矛盾的阻碍：静态基准测试容易导致模型过拟合，而未经校准的生成器则会因随机噪声掩盖算法的真实能力。

To resolve this, we introduce DynaSchedBench, a diagnostic framework for DFJSP that rigorously controls the instance-generation process. Instead of relying on parameter sampling, our approach utilizes Sequential Event-Space Calibrator (SESC) that computes a novel Schedule Stress Index (SSI) to stratify instances by difficulty.

为了解决这一问题，我们引入了 DynaSchedBench，这是一个针对 DFJSP 的诊断框架，能够严格控制实例生成过程。我们的方法不再依赖参数采样，而是利用序列事件空间校准器（SESC）计算一种新颖的调度压力指数（SSI），从而根据难度对实例进行分层。

We demonstrate that SESC is substantially more computationally efficient than evolutionary baselines while converging reliably to the target metrics. The framework integrates modular components for instance generation, snapshot-based simulation, agents, evaluation, and visualization, thereby enabling rigorous testing of reactive and lookahead-based policies.

我们证明，与进化基准方法相比，SESC 在计算效率上显著更高，同时能可靠地收敛到目标指标。该框架集成了实例生成、基于快照的仿真、智能体、评估和可视化等模块化组件，从而能够对反应式和前瞻式策略进行严格测试。

Leveraging this calibrated environment, we identify key limitations of LLM-based scheduling agents. Specifically, in step-wise online decision-making for dynamic scheduling, we identify an “Observability Paradox”: providing agents with oracle access to full structural information can degrade policy performance, underperforming concise information.

利用这一校准环境，我们发现了基于大模型（LLM）的调度智能体的关键局限性。具体而言，在动态调度的逐步在线决策中，我们发现了一个“可观测性悖论”：为智能体提供访问完整结构信息的“预言机”权限反而可能降低策略性能，其表现不如简洁的信息。

Furthermore, despite substantial token overhead, tool-augmented and refinement strategies fail to reliably improve performance, and most LLM agents fail to consistently surpass strong dispatching baselines—behaving more like robust heuristic approximators than superior optimizers.

此外，尽管存在巨大的 Token 开销，工具增强和优化策略也未能可靠地提升性能；大多数大模型智能体无法持续超越强大的调度基准——它们的表现更像是稳健的启发式近似器，而非卓越的优化器。