kernel asynchronous reads in PostgreSQL 19 (io_uring)
PostgreSQL 19 中的内核异步读取 (io_uring)
In the previous post, I executed a query that benefits from Asynchronous Sequential Scan. Although the OS-level read calls remain synchronous (pread64(), preadv()), PostgreSQL’s IO workers issue them and manage the asynchronous IO queues. Linux provides asynchronous buffered I/O that PostgreSQL can use directly via the io_uring system calls.
在上一篇文章中,我执行了一个受益于异步顺序扫描(Asynchronous Sequential Scan)的查询。尽管操作系统层面的读取调用仍然是同步的(pread64(), preadv()),但 PostgreSQL 的 IO 工作进程会发出这些调用并管理异步 IO 队列。Linux 提供了异步缓冲 I/O,PostgreSQL 可以通过 io_uring 系统调用直接使用它。
In this post, I run the same query using the io_uring IO method instead of the worker. Because I am running inside a Docker container where Secure Computing Mode (seccomp) disables io_uring system calls, I started a container with seccomp disabled:
在这篇文章中,我将使用 io_uring IO 方法而不是工作进程来运行相同的查询。由于我是在 Docker 容器中运行,而安全计算模式(seccomp)禁用了 io_uring 系统调用,因此我启动了一个禁用 seccomp 的容器:
docker run -d --name pg19 \
-p 5432:5432 \
-e POSTGRES_PASSWORD=xxx \
--security-opt seccomp=unconfined \
postgres:19beta1 \
-c io_method=io_uring
I connected (PGUSER=postgres PGPASSWORD=xxx PGHOST=localhost psql) and checked the configuration:
我连接到数据库(PGUSER=postgres PGPASSWORD=xxx PGHOST=localhost psql)并检查了配置:
postgres=# \dconfig io_*
List of configuration parameters
Parameter | Value
-----------------------+----------
io_combine_limit | 128kB
io_max_combine_limit | 128kB
io_max_concurrency | 64
io_max_workers | 8
io_method | io_uring
io_min_workers | 2
io_worker_idle_timeout | 1min
io_worker_launch_interval | 100ms
(8 rows)
This is similar to the previous post, but with a different io_method. I will execute the same query that benefits from io_combine, not the one involving large TOASTed documents:
这与上一篇文章类似,但使用了不同的 io_method。我将执行同一个受益于 io_combine 的查询,而不是涉及大型 TOAST 文档的查询:
postgres=# explain (analyze, buffers, io, costs off) select count(*),avg(length(data)) from smalldocs;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Finalize Aggregate (actual time=941.539..943.440 rows=1.00 loops=1)
Buffers: shared hit=15019 read=131281 dirtied=801 written=432
-> Gather (actual time=941.398..943.428 rows=3.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15019 read=131281 dirtied=801 written=432
-> Partial Aggregate (actual time=939.501..939.502 rows=1.00 loops=3)
Buffers: shared hit=15019 read=131281 dirtied=801 written=432
-> Parallel Seq Scan on smalldocs (actual time=0.033..155.375 rows=341333.33 loops=3)
Prefetch: avg=74.32 max=91 capacity=94
I/O: count=8247 waits=54 size=15.92 in-progress=4.97
Buffers: shared hit=15019 read=131281 dirtied=801 written=432
Worker 0: Prefetch: avg=74.41 max=91 capacity=94 I/O: count=2695 waits=30 size=15.93 in-progress=4.98
Worker 1: Prefetch: avg=73.99 max=91 capacity=94 I/O: count=2760 waits=13 size=15.88 in-progress=4.95
Planning:
Buffers: shared hit=5
Planning Time: 0.094 ms
Execution Time: 943.470 ms
(20 rows)
This plan is similar to the previous one because io combine, visible as prefetch, works the same for both the worker and io_uring. The difference now is that I no longer see any postgres: io worker processes, since this is managed by the kernel. I used strace on the PostgreSQL backend and on parallel workers:
这个执行计划与之前类似,因为 io combine(表现为预取 prefetch)对于工作进程和 io_uring 的工作方式是一样的。现在的区别在于,我不再看到任何 postgres: io worker 进程,因为这部分工作现在由内核管理。我在 PostgreSQL 后端和并行工作进程上使用了 strace:
# echo 3 | sudo tee /proc/sys/vm/drop_caches && strace -fyye trace=io_uring_enter,io_uring_setup,io_uring_enter,io_uring_register -s 0 -qq \
-p $(pgrep -fd, "postgres: ") -T -o /dev/stdout
(Trace output omitted for brevity…)
The syscall is io_uring_enter(fd, to_submit, min_complete, flags, sig, sigsz), so:
系统调用为 io_uring_enter(fd, to_submit, min_complete, flags, sig, sigsz),因此:
io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 indicates: Kernel, here is one new I/O request from my submission queue. I do not want to wait. The kernel confirms that one submission was consumed.
io_uring_enter(4, 1, 0, 0, NULL, 8) = 1 表示:内核,这是我提交队列中的一个新 I/O 请求。我不想等待。内核确认已消耗一个提交。
io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 indicates: I am not submitting anything. Wait until at least one completion is available. The return value is zero because no new submissions were made by this call. The short elapsed time shows that the completion was available quickly.
io_uring_enter(4, 0, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 0 表示:我没有提交任何内容。请等待直到至少有一个完成事件可用。返回值为零,因为此调用没有进行新的提交。较短的耗时表明完成事件很快就可用了。
The io_uring trace reveals that PostgreSQL does not wait for individual read operations. Instead, the backend consistently submits requests via io_uring_enter(..., 1, 0, ...) and retrieves completed requests from the completion queue.
io_uring 的跟踪结果显示,PostgreSQL 不会等待单个读取操作。相反,后端持续通过 io_uring_enter(..., 1, 0, ...) 提交请求,并从完成队列中获取已完成的请求。