How PgBouncer Works
How PgBouncer Works
PgBouncer 是如何工作的
A from-the-ground-up explanation of PgBouncer: why Postgres needs a connection pooler, what each pooling mode actually does, what breaks under transaction pooling, and how to run the thing in production. About a 30-minute read. Versions reference PgBouncer 1.25.1, the current stable as of writing. 这是一篇关于 PgBouncer 的深度入门指南:解释了为什么 Postgres 需要连接池、每种池化模式的具体作用、事务池模式下会产生哪些问题,以及如何在生产环境中运行它。阅读全文约需 30 分钟。本文参考版本为撰写时的最新稳定版 PgBouncer 1.25.1。
1. The thing PgBouncer fixes
1. PgBouncer 解决的问题
If you have used Postgres for any length of time, you have probably run into some version of this story. Your app has been fine for months. Then you push a release that adds a few more workers, or traffic spikes during a launch, or your background job system starts running more parallel jobs than it used to. Suddenly the database starts refusing connections. Latency goes up. Memory usage on the database host climbs to ninety-something percent. The error logs are full of FATAL: sorry, too many clients already, and your max_connections setting, which seemed generous when you set it, is now the thing standing between you and a working application.
如果你使用 Postgres 一段时间,很可能遇到过这种情况:你的应用运行了几个月都很正常,但当你发布了一个增加工作进程的新版本,或者在上线时流量激增,又或者后台任务系统开始并行处理更多任务时,数据库突然开始拒绝连接。延迟升高,数据库主机的内存占用飙升至 90% 以上。错误日志中充斥着 FATAL: sorry, too many clients already,而你当初觉得设置得绰绰有余的 max_connections 参数,现在却成了阻碍应用正常运行的绊脚石。
The textbook fix is “add a connection pooler.” The textbook recommendation is PgBouncer. And then, if you are like most people, you set pool_mode = transaction, point your app at port 6432 instead of 5432, and walk away thinking the problem is solved. It usually is. But the thing you just deployed is doing something quite specific, and if you do not understand what, the next bug you hit will be confusing in a way that direct Postgres bugs are not.
教科书式的解决方案是“添加一个连接池”。教科书式的推荐工具是 PgBouncer。如果你像大多数人一样,设置了 pool_mode = transaction,将应用连接端口从 5432 改为 6432,然后就以为问题解决了。通常情况下确实如此,但你刚刚部署的这个组件在执行非常具体的操作。如果你不理解其原理,那么你遇到的下一个 Bug 将会让你感到困惑,这种困惑与直接使用 Postgres 时遇到的 Bug 完全不同。
Why doesn’t SET search_path stick? Why did my long-running report time out at 5 seconds when I never set a 5-second timeout? Why does this Perl script work when I connect directly but fail through PgBouncer? PgBouncer presents itself as a transparent proxy, and in one mode it almost is, but in the mode you actually want to use, it isn’t, and the gap between what looks transparent and what actually is transparent is where the bugs live. This piece is about that gap.
为什么 SET search_path 不生效?为什么我从未设置过 5 秒超时,但我的长耗时报表却在 5 秒时超时了?为什么这个 Perl 脚本直接连接时正常,通过 PgBouncer 连接却失败了?PgBouncer 自称是一个透明代理,在某种模式下它确实几乎是透明的,但在你真正想用的模式下却并非如此。这种“看起来透明”与“实际透明”之间的鸿沟,正是 Bug 滋生的地方。本文旨在探讨这一鸿沟。
We’ll start with why Postgres connections cost something in the first place, work through the three pooling modes and what each one buys and breaks, then look at how to actually run PgBouncer in production: how to size the pool, how to authenticate without storing every password twice, how to monitor it, and the long list of small things that go subtly wrong. 我们将从 Postgres 连接为何会产生开销开始,深入探讨三种池化模式及其优缺点,然后研究如何在生产环境中运行 PgBouncer:如何规划连接池大小、如何在不重复存储密码的情况下进行身份验证、如何监控它,以及那些容易被忽视的各种细微问题。
Act 1 — Why connections are expensive
第一幕:为什么连接很昂贵
The Postgres process model and why it does not love thousands of clients. Postgres 的进程模型,以及为什么它无法轻松应对数千个客户端。
2. The fork that started it all
2. 一切的起点:Fork 机制
Postgres has been around since 1986. Its connection model dates from a time when the modern web did not exist and the typical workload was a handful of analysts running queries from terminals. The model is simple. There is one supervisor process called the postmaster. When a client connects, the postmaster calls fork() and produces a new child process called a backend. That backend is yours, exclusively, for the lifetime of the connection. When you disconnect, it exits. Every query you run, every transaction, every SET, every prepared statement, every temporary table, lives inside that one process.
Postgres 诞生于 1986 年。其连接模型源于那个现代互联网尚未出现的时代,当时典型的工作负载只是少数分析师在终端上运行查询。该模型很简单:有一个名为 postmaster 的管理进程。当客户端连接时,postmaster 会调用 fork() 并产生一个新的子进程,称为 backend(后端进程)。在连接的整个生命周期内,该后端进程专属于你。当你断开连接时,它就会退出。你运行的每一个查询、每一个事务、每一个 SET 指令、每一个预处理语句、每一个临时表,都存在于这一个进程中。
This design has real virtues. Each connection is isolated, so a memory bug in one query cannot corrupt another client’s session. Cancellation is a SIGINT to a specific PID. The OS scheduler does the heavy lifting of fairness between connections. It is a very Unix design and it ages well in many ways.
这种设计有其显著优点。每个连接都是隔离的,因此一个查询中的内存错误不会破坏另一个客户端的会话。取消操作只需向特定的 PID 发送 SIGINT 信号。操作系统调度器负责处理连接间的公平性。这是一种非常 Unix 风格的设计,在许多方面都经受住了时间的考验。
It does not age well in one specific way. A process is not free. Forking it is not free. Initializing the backend’s catalog cache is not free. And keeping it around when it is doing nothing is not free either. 但在一个特定方面,它显得有些过时:进程不是免费的,Fork 进程不是免费的,初始化后端进程的目录缓存也不是免费的。即使在空闲时保持进程存在,也是有代价的。
Fig 1. The Postgres connection model. Every client gets a dedicated OS process. The processes are isolated, which is great for stability, and unshared, which is bad for scaling. 图 1. Postgres 连接模型。每个客户端都有一个专用的操作系统进程。进程之间相互隔离(有利于稳定性),但互不共享(不利于扩展)。
3. What an idle connection actually costs
3. 空闲连接的实际成本
Memory is the headline cost, and the number you usually hear is “10 MB per connection.” That number is roughly right but slightly misleading. The truth is that ps and top overcount, because Linux uses copy-on-write for forked processes and does not split out shared memory cleanly in those tools.
内存是主要的成本,通常听到的数字是“每个连接 10 MB”。这个数字大致正确,但略有误导性。事实是 ps 和 top 命令统计得过高,因为 Linux 对 Fork 出来的进程使用了写时复制(copy-on-write),而这些工具无法清晰地分离共享内存。
When you measure more carefully — Andres Freund’s writeup uses a mix of /proc/[pid]/smaps_rollup (an approximation, not an exact figure) and direct PSS accounting — the incremental private memory of an idle Postgres backend is closer to 1.5 to 2 MB on a fresh connection, growing to 5 to 15 MB once the connection has done some work and accumulated catalog cache, prepared plans, and per-process state. The AWS database team’s RDS-side measurements landed at 1.5 to 14.5 MB depending on workload, which lines up.
当你进行更仔细的测量时(Andres Freund 的文章结合使用了 /proc/[pid]/smaps_rollup 和直接的 PSS 统计),一个空闲的 Postgres 后端进程在刚建立连接时,增量私有内存更接近 1.5 到 2 MB;一旦连接执行了一些工作并积累了目录缓存、预处理计划和进程状态,内存占用会增长到 5 到 15 MB。AWS 数据库团队在 RDS 端的测量结果为 1.5 到 14.5 MB(取决于工作负载),这与上述结论一致。
So it is not “10 MB per connection” exactly. It is more like “a few MB per connection, more if it does anything interesting, multiplied by however many connections you have, even if they are idle.” On a host with 32 GB of RAM, where you also want most of that going to shared_buffers and the OS page cache, you cannot afford 2,000 idle connections. You probably cannot afford 1,000.
所以,并不是严格的“每个连接 10 MB”。更准确的说法是:“每个连接占用几 MB,如果执行了复杂操作则更多,乘以你的连接总数,即使它们处于空闲状态。”在一台 32 GB 内存的主机上,你还需要将大部分内存留给 shared_buffers 和操作系统页面缓存,你根本负担不起 2,000 个空闲连接,甚至可能连 1,000 个都负担不起。
Memory is not even the whole story. Each backend process is also a thing the OS scheduler has to consider. Each backend has a slot in Postgres’ shared procarray, which is scanned during snapshot acquisition: a process that does nothing all day still gets walked over every time another transaction takes a snapshot. There’s a known cliff somewhere in the low thousands of connections where Postgres spends measurably more CPU just managing the procarray than running queries. The exact number depends on hardware and version, but as a rule of thumb, past a few hundred connections per CPU core, you are buying overhead and not throughput.
内存还不是全部。每个后端进程都是操作系统调度器必须考虑的对象。每个后端在 Postgres 的共享 procarray 中都有一个槽位,在获取快照时会被扫描:即使一个进程整天什么都不做,每当另一个事务获取快照时,它仍然会被遍历。在连接数达到几千时,存在一个已知的性能悬崖,此时 Postgres 花在管理 procarray 上的 CPU 开销比运行查询还要多。确切的数字取决于硬件和版本,但经验法则是:当每个 CPU 核心的连接数超过几百个时,你增加的只是开销,而不是吞吐量。
Then there is connection setup cost. Opening a fresh Postgres connection involves a TCP handshake, a TLS handshake (in any sane production setup), a fork on the server, a SCRAM exchange, sending startup parameters, loading the role’s config, and setting the search path. On a healthy LAN this takes a few milliseconds. On a TLS connection across regions, it can be 30 to 100 ms.
此外还有连接建立成本。建立一个新的 Postgres 连接涉及 TCP 握手、TLS 握手(在任何合理的生产环境中)、服务器端的 Fork 操作、SCRAM 认证交换、发送启动参数、加载角色配置以及设置 search_path。在健康的局域网中,这需要几毫秒;而在跨区域的 TLS 连接中,可能需要 30 到 100 毫秒。