Playing Connect Four with Deep Q-Learning

使用深度 Q 学习玩四子棋

Deep Learning Playing Connect Four with Deep Q-Learning Solving multiplayer games with function approximation. 深度学习玩四子棋：使用深度 Q 学习解决多玩家博弈的函数逼近问题。

In the previous posts, we explored how to extend Reinforcement Learning (RL) beyond the tabular setting using function approximation. While this allowed us to generalize across states, our experiments also revealed an important limitation: in simple environments like GridWorld, approximate methods can struggle to match the stability and efficiency of tabular approaches. The main reason is that learning a good representation is itself a difficult problem—one that can outweigh the benefits of generalization when the state space is still relatively small. 在之前的文章中，我们探讨了如何利用函数逼近将强化学习（RL）扩展到表格设置之外。虽然这使我们能够跨状态进行泛化，但实验也揭示了一个重要的局限性：在像 GridWorld 这样简单的环境中，近似方法往往难以达到表格方法所具备的稳定性和效率。主要原因是学习一种良好的表征本身就是一个难题——当状态空间相对较小时，这种难度可能会抵消泛化带来的好处。

To truly unlock the power of function approximation, we therefore need to move to environments where tabular methods are no longer viable. This naturally leads us to multi-player games, where the state space grows combinatorically and generalization becomes essential – and at the same time perfectly fits into this post series, as so far we did not manage to learn any meaningful behavior on more complex multi-player environments. In this post, we take this step by considering the classic game of Connect Four and investigate how to learn strong policies using Deep Q-Learning. 为了真正释放函数逼近的威力，我们需要转向那些表格方法不再适用的环境。这自然引导我们进入多玩家博弈领域，在这些领域中，状态空间呈组合式增长，泛化变得至关重要。这也非常契合本系列文章的主题，因为到目前为止，我们还未能学会在更复杂的多玩家环境中表现出任何有意义的行为。在本文中，我们将通过经典的四子棋游戏迈出这一步，并研究如何利用深度 Q 学习来学习强大的策略。

Introduction to Approximate Solution Methods for Reinforcement Learning: From Sarsa to Deep Q-Learning

强化学习近似求解方法简介：从 Sarsa 到深度 Q 学习

To tackle this task, we extend our framework along several important dimensions. First, we move from online updates to a batched training setup. In our earlier implementation of Sarsa, we updated the model after every transition. While faithful to the original algorithm [1], this approach is computationally inefficient: each optimizer step incurs a non-trivial cost, and modern hardware—especially GPUs—is designed to operate on batches with only marginal additional overhead. 为了完成这项任务，我们从几个重要维度扩展了框架。首先，我们将在线更新改为批量训练设置。在我们早期的 Sarsa 实现中，我们在每次转换后都会更新模型。虽然这忠实于原始算法 [1]，但这种方法在计算上效率低下：每个优化器步骤都会产生不小的开销，而现代硬件（尤其是 GPU）的设计初衷就是以极小的额外开销处理批量数据。

To address this, we introduce a replay buffer. Instead of updating immediately, we store transitions as they are encountered—either up to a fixed capacity or, in our case, until one or multiple games have finished. We then perform a batched update over this collected experience. This not only improves computational efficiency but also stabilizes learning by reducing the variance of individual updates. 为了解决这个问题，我们引入了经验回放池（Replay Buffer）。我们不再立即更新，而是将遇到的转换存储起来——直到达到固定容量，或者像我们这种情况，直到一场或多场游戏结束。然后，我们对这些收集到的经验进行批量更新。这不仅提高了计算效率，还通过减少单次更新的方差稳定了学习过程。

At this point, an important conceptual shift occurs. By sampling from past experience rather than strictly following the current policy, we move away from Sarsa—an on-policy method—towards Q-learning, which is off-policy. While we have not formally reintroduced Q-learning in the function approximation setting here, the extension from the tabular case is largely straightforward. This combination of replay buffers and Q-learning forms the foundation of Deep Q-Networks (DQNs), popularized by DeepMind in their seminal work on Atari games [2]. 此时，发生了一个重要的概念转变。通过从过去的经验中采样，而不是严格遵循当前策略，我们从 Sarsa（一种同策略方法）转向了 Q 学习（一种异策略方法）。虽然我们在此没有正式重新介绍函数逼近设置下的 Q 学习，但从表格情况进行的扩展是非常直接的。这种经验回放池与 Q 学习的结合构成了深度 Q 网络（DQN）的基础，该网络由 DeepMind 在其关于 Atari 游戏的开创性工作中推广 [2]。

Finally, we turn to scalability. Reinforcement learning is inherently data-hungry, so increasing throughput is crucial. To this end, we implement a vectorized environment wrapper that allows us to simulate multiple games of Connect Four in parallel. Concretely, a single call to step(a) now processes a batch of actions and advances all environments simultaneously. In practice, however, achieving true parallelism in Python is non-trivial. The Global Interpreter Lock (GIL) ensures that only one thread executes Python bytecode at a time, which limits the effectiveness of multi-threading for CPU-bound workloads such as environment stepping. 最后，我们转向可扩展性。强化学习本质上是数据饥渴型的，因此提高吞吐量至关重要。为此，我们实现了一个向量化环境包装器，允许我们并行模拟多场四子棋游戏。具体来说，现在对 step(a) 的单次调用会处理一批动作并同时推进所有环境。然而在实践中，在 Python 中实现真正的并行并不容易。全局解释器锁（GIL）确保同一时间只有一个线程执行 Python 字节码，这限制了多线程在环境步进等 CPU 密集型工作负载中的有效性。

We also experimented with multi-processing, but found that the additional overhead (e.g., inter-process communication) largely offset any gains in our setting. For the interested reader I recommend an earlier post of mine. Despite these limitations, the combination of batched updates and environment vectorization yields a substantial improvement in throughput, increasing performance to approximately 50–100 games per second. 我们也尝试了多进程处理，但发现额外的开销（如进程间通信）在我们的设置中抵消了大部分收益。对于感兴趣的读者，我推荐阅读我之前的一篇文章。尽管存在这些限制，批量更新和环境向量化的结合仍显著提高了吞吐量，将性能提升至每秒约 50–100 场游戏。

Implementation

实现

In this post, I deliberately avoid going into too much detail on the environment vectorization and instead focus on the RL aspects. Partly, this is because the vectorization itself is “just” an implementation detail—but also because, in all honesty, our current setup is not ideal. Much of this is due to limitations imposed by the PettingZoo environment we are using. In future posts, we will explore different environments and revisit this topic with a stronger emphasis on scalability—a crucial aspect of modern reinforcement learning. 在本文中，我特意避免深入探讨环境向量化的细节，而是专注于强化学习方面。部分原因是向量化本身“只是”一个实现细节，但也因为坦率地说，我们目前的设置并不理想。这很大程度上是由于我们使用的 PettingZoo 环境所带来的限制。在未来的文章中，我们将探索不同的环境，并重新审视这个主题，重点关注可扩展性——这是现代强化学习的一个关键方面。

For a more detailed discussion of how we structure multi-player environments, manage agents, and maintain an opponent pool, I refer to my earlier post on multi-player RL. The vectorized setup used here is simply an extension of that framework to multiple games running in parallel. As always, the full implementation is available on GitHub. 关于我们如何构建多玩家环境、管理智能体以及维护对手池的更详细讨论，请参考我之前关于多玩家强化学习的文章。这里使用的向量化设置只是将该框架扩展到并行运行的多场游戏。一如既往，完整实现可在 GitHub 上找到。

Revisiting Q-Learning

重温 Q 学习

Let us briefly revisit Q-learning and connect it to our implementation. The core update rule is given by: In contrast to Sarsa, which uses the action actually taken in the next state, Q-learning uses a max operator over all possible next actions. This makes it off-policy, as the update does not depend on the behavior policy used to generate the data. In practice, this often leads to faster propagation of value information, especially in deterministic environments such as board games. 让我们简要重温一下 Q 学习并将其与我们的实现联系起来。其核心更新规则为：与使用下一状态实际采取动作的 Sarsa 不同，Q 学习在所有可能的下一动作上使用最大化算子（max operator）。这使其成为异策略方法，因为更新不依赖于生成数据的行为策略。在实践中，这通常会导致价值信息的传播更快，特别是在棋盘游戏等确定性环境中。

When combined with neural networks, this approach is commonly referred to as Deep Q-Learning. Instead of maintaining a table of values, we train a neural network Qθ(s,a) to approximate the action-value function. The update is then implemented as a regression problem, minimizing the difference between the current estimate and a bootstrapped target: In our implementation, this corresponds directly to the batch_update function. Given a batch of transitions (s,a,r,s’,done), we first compute the predicted Q-values for the taken actions: q = self.q(batch.states, …); q_sa = q.gather(1, batch.actions.unsqueeze(1)).squeeze(1). Next, we construct the target. 当与神经网络结合时，这种方法通常被称为深度 Q 学习。我们不再维护一个数值表，而是训练一个神经网络 Qθ(s,a) 来逼近动作价值函数。更新过程随后被实现为一个回归问题，即最小化当前估计值与自举目标（bootstrapped target）之间的差异：在我们的实现中，这直接对应于 batch_update 函数。给定一批转换 (s,a,r,s’,done)，我们首先计算所采取动作的预测 Q 值：q = self.q(batch.states, …); q_sa = q.gather(1, batch.actions.unsqueeze(1)).squeeze(1)。接下来，我们构建目标。