Premature Optimization is Fun Sometimes (2025)

有时，过早优化也很有趣 (2025)

A colleague of mine was recently discussing a connectivity monitoring system he is working on with me. It’s nothing fancy, just sending ICMP Echo Requests to a couple of different servers, and monitoring latency and dropped packet averages over 1-minute, 5-minute, and 15-minute periods. 最近，我的一位同事和我讨论了他正在开发的一个连接监控系统。这并不是什么复杂的东西，只是向几台不同的服务器发送 ICMP 回显请求（Echo Requests），并监控 1 分钟、5 分钟和 15 分钟周期内的延迟和丢包率。

Up came the topic of how this data should be stored, the natural thought was a 512 entry ring buffer, containing entries like the following: 谈到这些数据该如何存储时，很自然地想到了一个包含 512 个条目的环形缓冲区（ring buffer），其中包含如下条目：

struct ping_timestamp {
    uint64_t sent_ns;     // when the packed was sent (unit: ns)
    uint64_t received_ns; // when the packet was received (unit: ns)
    in_addr_t source_addr; // source address
    uint16_t seq_no;      // echo request sequence number
    bool received;        // has the request been received?
};

And backed by the following array: 并由以下数组支持：

struct ping_timestamp pings_rb[512];
// ...
printf("%zu\n", sizeof(pings_rb)); // 12288

12 KiB. Pretty wasteful, right? We can certainly do better. Do we need to keep fields for both sent and received? What we’re really interested is the latency. We need to know when a packet was sent, only up until we know when it was received, at that point, the data we want to keep is received - sent, so why don’t we make it a tagged union? 12 KiB。相当浪费，对吧？我们肯定可以做得更好。我们需要同时保留发送和接收的时间字段吗？我们真正感兴趣的是延迟。我们只需要知道数据包何时发送，直到我们知道它何时接收为止；在那一刻，我们想要保留的数据就是“接收时间 - 发送时间”，所以为什么不把它做成一个带标签的联合体（tagged union）呢？

struct ping_timestamp_2 {
    union {
        uint64_t sent_ts;    // unit: 100μs
        uint64_t elapsed_ts; // unit: 100μs
    };
    in_addr_t source_addr;
    uint16_t seq_no;
    bool received;
};
// ...
printf("%zu\n", sizeof(pings_rb)); // 8192

Not bad, we’ve shaved off an entire page. We can still do better. Nanosecond precision? In our case, ping times are measured in the tens or hundreds, or even thousands of milliseconds. We don’t need to keep all of those extra bits around. If we change the unit from nanosecond to 100 microsecond increments (0.1ms), then 43-bits is sufficient for us to keep track of pings for up to 20-years. 20-years is a bit excessive still, but it doesn’t hurt to be at least a little bit future proof. And received? 8-bit for a true/false value seems altogether too much. The answer: bitfields. 还不错，我们省下了一整个内存页。我们还可以做得更好。纳秒精度？在我们的案例中，ping 时间是以几十、几百甚至几千毫秒来衡量的。我们不需要保留所有那些多余的位。如果我们把单位从纳秒改为 100 微秒（0.1ms），那么 43 位就足以让我们记录长达 20 年的 ping 数据。20 年确实有点过长，但至少在未来兼容性上做点准备也没坏处。至于“是否接收”？用 8 位来表示真/假值似乎太浪费了。答案是：位域（bitfields）。

struct ping_timestamp_3 {
    uint64_t sent_or_elapsed_ts: 43;
    uint64_t received: 1;
    uint64_t seq_no: 16;
    in_addr_t source_addr;
};
// ...
printf("%zu\n", sizeof(pings_rb)); // 8192

Wait, what? Why haven’t we saved any space? The answer is struct padding. 等等，什么情况？为什么我们没有节省任何空间？答案是结构体填充（struct padding）。

The layout of ping_timestamp_2 looks like this: ping_timestamp_2 的布局如下所示：

0      8      16     24     32     40     48     56     64...
+------+------+------+------+------+------+------+------+
| sent/elapsed                                          |
+------+------+------+------+------+------+------+------+
...64    72     80     88     96     104    112    120    128
+------+------+------+------+------+------+------+------+
| source_address      | seq_no | recv?| pad           |
+-----------------------------------------------------+

Where the padding byte at the end is to ensure alignment requirements. ping_timestamp_3 on the other end, looks like this: 末尾的填充字节是为了确保对齐要求。而 ping_timestamp_3 的布局如下：

0      8      16     24     32     40     R?     48     56     64...
+------+------+------+------+------+------+------+------+
| sent/elapsed               || seq_no | P           |
+------+------+------+------+------+------+------+------+
...64    72     80     88     96     104    112    120    128
+------+------+------+------+------+------+------+------+
| source_address      | padding                             |
+-----------------------------------------------------+

So our optimization there didn’t actually save any space. We’re wasting 36-bits of padding. Is there any way we can somehow do better? We keep track of the source address due to frequent changes while our product is in operation (on a mobile data network). When the address changes, we also reset the sequence number for reasons that aren’t relevant to the current topic. We have seen, in the past, packets with differing source addresses but identical sequence numbers due to be processed by our application at the same time (the joys of asynchronous programming), so the source address serves to disambiguate these changes. 所以我们那里的优化实际上并没有节省任何空间。我们浪费了 36 位的填充空间。有什么办法能做得更好吗？我们之所以要追踪源地址，是因为我们的产品在运行过程中（在移动数据网络上）地址会频繁变化。当地址改变时，我们也会重置序列号，原因与当前主题无关。过去我们曾遇到过这种情况：由于异步编程的“乐趣”，不同源地址但序列号相同的数据包同时被我们的应用程序处理，因此源地址可以用来消除这些歧义。

But there’s another way to disambiguate. An ICMP echo request has a 16-bit long identifier field to allow applications to identify which echo request packets were sent by them. Its value is completely arbitrary. On Linux iputils ping sets it to getpid() & 0xFFFF; on OpenBSD a random number is used instead. Although it’s 16-bits long, we don’t actually need to use the full 16-bits. There’s 4 free bits left in the first 8-bytes of our ping_timestamp_3. Our thought was to use a rolling 4-bit counter, that is increased whenever our source address changes (this is monitored elsewhere in the application), allowing us to uniquely identify which source address the packet came from. 但还有另一种消除歧义的方法。ICMP 回显请求有一个 16 位的标识符字段，允许应用程序识别哪些回显请求数据包是由它们发送的。它的值完全是任意的。在 Linux 上，iputils ping 将其设置为 getpid() & 0xFFFF；而在 OpenBSD 上则使用随机数。虽然它有 16 位长，但我们实际上并不需要使用完整的 16 位。在 ping_timestamp_3 的前 8 个字节中还有 4 个空闲位。我们的想法是使用一个 4 位的滚动计数器，每当源地址改变时（这在应用程序的其他地方进行监控），计数器就会增加，从而让我们能够唯一地识别数据包来自哪个源地址。

Our final struct looks like this: 我们最终的结构体如下所示：

struct ping_timestamp {
    uint64_t elapsed_or_sent_ts : 43;
    uint64_t received : 1;
    uint64_t counter: 4;
    uint64_t seq_no: 16;
};
// ...
printf("%zu\n", sizeof(pings_rb)); // 4096

Much better. A whole 8-kilobytes of savings, and down to a single page of data. You may have noticed that I have changed the order of the fields slightly. This is to line seq_no up on a 16-bit boundary, so that loading it is a single ldrh instruction rather than require a shift. Similarly, reading from elapsed_or_sent_ts only requires a mask. In the end, this was a completely pointless exercise. Our application isn’t remotely memory constrained. But it was fun. 好多了。整整节省了 8 KB，数据量缩减到了一个内存页。你可能已经注意到我稍微改变了字段的顺序。这是为了让 seq_no 对齐在 16 位边界上，这样加载它只需要一条 ldrh 指令，而不需要移位操作。同样，读取 elapsed_or_sent_ts 也只需要一个掩码。最后，这完全是一次毫无意义的练习。我们的应用程序根本不存在内存限制问题。但它很有趣。

Addendum 2025-06-21

附录 2025-06-21

I realized there’s a way to “optimize” this slightly further. By switching the order of the received/counter fields, accessing the received bit only requires a shift instruction rather than a shift and a mask: 我意识到还有一种方法可以进一步“优化”。通过交换 received 和 counter 字段的顺序，访问 received 位只需要一条移位指令，而不需要移位加掩码：

struct ping_timestamp {
    uint64_t elapsed_or_sent_ts : 43;
    uint64_t counter: 4;
    uint64_t received : 1;
    uint64_t seq_no: 16;
};

Addendum 2025-06-22

附录 2025-06-22

There’s a slight “issue” with the above code: received is now much cheaper to access, at the expense of counter needing to mask out the received bit. But we can fix this! We only ever read counter when received is true, i.e. 1. If received were zero, and we could tell the compiler to assume that it were zero, then no mask would be necessary. The solution? Flip the meaning of the received bit. 上面的代码有一个小“问题”：现在访问 received 的代价更低了，但代价是 counter 需要屏蔽掉 received 位。但我们可以解决这个问题！我们只在 received 为真（即 1）时才读取 counter。如果 received 为零，并且我们可以告诉编译器假设它为零，那么就不需要掩码了。解决方案是什么？反转 received 位的含义。

struct ping_timestamp {
    uint64_t elapsed_or_sent_ts : 43;
    // ...