Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

针对离策略时序差分预测的“行为感知”辅助修正方法

Abstract: Temporal-difference learning with function approximation can be unstable under off-policy sampling. TDC stabilizes off-policy TD through an auxiliary covariance correction, and TDRC further regularizes this correction in a single-timescale recursion. This paper studies a behavior-aware replacement of the auxiliary covariance geometry in the linear prediction setting, which is the standard local model for understanding the feature-space dynamics of value-function approximation.

摘要： 在离策略（off-policy）采样下，带有函数近似的时序差分（TD）学习可能变得不稳定。TDC 通过辅助协方差修正来稳定离策略 TD，而 TDRC 则通过单时间尺度递归进一步对该修正进行正则化。本文研究了在线性预测设置下对辅助协方差几何结构的“行为感知”替换，这是理解价值函数近似中特征空间动态的标准局部模型。

We first replace the TDC auxiliary matrix (C) by the behavior Bellman matrix (A_\mu), yielding BA-TDC, and then regularize the same behavior-aware equation to obtain BA-TDRC. This two-step construction separates the contribution of behavior-aware geometry from the contribution of regularization. The linear analysis also provides a tractable model for an auxiliary-geometry design question that arises in neural-network value approximation, where feature covariances and temporal transition matrices jointly shape the last-layer correction dynamics.

我们首先用行为贝尔曼矩阵（A_\mu）替换 TDC 辅助矩阵（C），从而得到 BA-TDC，随后对相同的行为感知方程进行正则化，得到 BA-TDRC。这种两步构建法将“行为感知几何结构”的贡献与“正则化”的贡献分离开来。线性分析还为神经网络价值近似中出现的辅助几何设计问题提供了一个易于处理的模型，在该模型中，特征协方差和时间转移矩阵共同塑造了最后一层的修正动态。

We give a finite-state mean-system formulation, prove fixed-point preservation and almost-sure convergence under a Hurwitz stability condition on the instantiated mean system, and compare deterministic mean rates through the spectral radius of the exact linear error recursion. Experiments on the two-state counterexample, Baird’s counterexample, Random Walk, and Boyan Chain show that the behavior-aware replacement can be highly beneficial by itself on some tasks, but that regularization is necessary for robust performance across harder settings.

我们给出了有限状态均值系统公式，证明了在实例化均值系统满足 Hurwitz 稳定性条件下的不动点保持性和几乎处处收敛性，并通过精确线性误差递归的谱半径比较了确定性均值速率。在双状态反例、Baird 反例、随机游走（Random Walk）和 Boyan 链上的实验表明，行为感知替换本身在某些任务中非常有益，但在更困难的设置下，正则化对于实现稳健性能是必不可少的。