Everything in C is undefined behavior

Everything in C is undefined behavior

If he had been a programmer, Cardinal Richelieu would have said “Give me six lines written by the hand of the most expert C programmer in the world, and I will find enough in them to trigger undefined behavior”. Nobody can write correct C, or C++. And I say that as someone who’s written C and C++ on an almost daily basis for about 30 years. I listen to C++ podcasts. I watch C++ conference talks. I enjoy reading and writing C++. C++ has served us well, but it’s 2026, and the environment of 1985 (C++) or 1972 (C) is not the environment of today.

如果黎塞留枢机主教是一位程序员,他一定会说:“给我六行由世界上最顶尖的 C 程序员亲手写的代码,我就能从中找出足以触发未定义行为(UB)的内容。”没有人能写出完全正确的 C 或 C++ 代码。我以一个近 30 年来几乎每天都在编写 C 和 C++ 的人的身份说出这句话。我听 C++ 播客,看 C++ 会议演讲,也乐于阅读和编写 C++。C++ 曾为我们立下汗马功劳,但现在已经是 2026 年了,1985 年(C++)或 1972 年(C)的开发环境早已不再适用于今天。

I’m definitely not the first to say this. I remember reading a post by someone prominent about a decade ago saying that a good case can be made that use of C++ is a SOX violation. And while I was not onboard with the rest of their rant (nor their confusion about “its” vs “it’s”), I never disagreed about that point. With time I found it to be more and more true. WAY more things are undefined behavior (UB) than you’d expect. Everyone knows that double-free, use after free, accessing outside the bounds of an object (e.g. array), and accessing uninitialized memory is UB. After all, C & C++ are not memory safe languages. And yet we as an industry seem to be unable to stop making even those mistakes over and over.

我绝对不是第一个这么说的人。我记得大约十年前,一位知名人士曾发文称,有充分的理由认为使用 C++ 是一种违反萨班斯法案(SOX)的行为。虽然我并不完全认同他其余的抱怨(也不认同他在“its”和“it’s”用法上的混淆),但我从未反驳过那个观点。随着时间的推移,我发现这一点越来越正确。未定义行为(UB)的范围远比你想象的要广得多。每个人都知道重复释放(double-free)、释放后使用(use after free)、越界访问对象(如数组)以及访问未初始化内存都是 UB。毕竟,C 和 C++ 并不是内存安全的语言。然而,作为一个行业,我们似乎总是无法停止重复犯下这些错误。

But there’s more. More subtle. More illogical. It’s not about optimizations. Some people seem to think that as long as they don’t compile with optimizations turned on, undefined behavior can’t hurt them. They believe that the compiler is somehow being deliberately hostile, going “AHA! UB! I can do whatever I want here!”, and without optimizations turned on it won’t. This is incorrect. UB doesn’t mean that the compiler can take advantage of your sloppiness. UB means that the compiler can assume that your code is valid. It means that the intention of your code that’s oh so obvious when read by a human, doesn’t even have a way to be expressed between compiler stages or modules.

但情况远不止于此。它更微妙,也更不合逻辑。这与优化无关。有些人似乎认为,只要不开启编译器优化,未定义行为就不会伤害他们。他们认为编译器是在故意作对,仿佛在说:“哈!发现 UB 了!我可以在这里为所欲为!”,而只要关闭优化,它就不会这么做。这是错误的。UB 并不意味着编译器可以利用你的疏忽。UB 意味着编译器可以假设你的代码是合法的。这意味着你代码中那些人类一眼就能看出的意图,甚至无法在编译器的各个阶段或模块之间得到表达。

UB means that the compiler doesn’t even have to implement some special cases in its code generation, because they “can’t happen”. The compiler, and really the underlying hardware too, is playing a game of telephone with your UB intentions. It may end up with what you wanted, but there’s no guarantee for now or in the future.

UB 意味着编译器甚至不需要在代码生成中实现某些特殊情况,因为它认为这些情况“不可能发生”。编译器,实际上还有底层硬件,正在和你那充满 UB 的意图玩“传声筒”游戏。最终结果可能恰好是你想要的,但现在或未来都无法保证这一点。

UB is everywhere

UB 无处不在

The following is not an attempt at enumerating all the UB in the world. It’s merely making the case that UB is everywhere, and if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL nontrivial C and C++ code has UB.

以下内容并非试图列举世界上所有的 UB。它只是为了证明 UB 无处不在,如果没人能写出完全正确的代码,那么责怪程序员又从何谈起呢?我的观点是:所有非平凡的 C 和 C++ 代码都存在 UB。

Accessing an object which is not correctly aligned

访问未正确对齐的对象

As an example of this, take this code: int foo(const int* p) { return *p; } If this function is called with a pointer not correctly aligned (probably meaning on an address that’s a multiple of sizeof(int), but who knows), this is UB. C23 6.3.2.3. On Linux Alpha, in some cases this would merely trap to the kernel, which would software emulate what you intended. In other cases it would (probably) crash your program with a SIGBUS. On SPARC it would cause a SIGBUS.

举个例子,看看这段代码:int foo(const int* p) { return *p; }。如果调用此函数时传入的指针未正确对齐(通常指地址不是 sizeof(int) 的倍数,但谁知道呢),这就是 UB。根据 C23 6.3.2.3 标准。在 Linux Alpha 上,某些情况下这只会陷入内核,由内核通过软件模拟你的意图;而在其他情况下,它(很可能)会导致程序因 SIGBUS 而崩溃。在 SPARC 上,它会导致 SIGBUS。

Sure, on x86/amd64 (henceforth just “x86”) this is likely fine. Hell, it’s probably even an atomic read. x86 is famously extremely forgiving about cache coherency subtleties. So here we have three cases: kernel gave a helping hand (Alpha for some loads), crash (other Alpha loads, and SPARC), not a problem (x86). What about ARM, RISC-V, and others? What about future architectures? A future architecture could even have special int-pointer registers that do not populate the lowest bits, because such pointers cannot exist. Even if it works, maybe the compiler one day changes from using one load instruction to another, and suddenly that’s no longer fixed up by the kernel. Because the compiler is not obligated to generate assembly instructions that work on unaligned pointers. Because it’s UB.

当然,在 x86/amd64(以下简称“x86”)上这可能没问题。见鬼,它甚至可能是一次原子读取。x86 以对缓存一致性细节极其宽容而闻名。所以这里有三种情况:内核伸出援手(Alpha 的某些加载)、崩溃(Alpha 的其他加载以及 SPARC)、没问题(x86)。那么 ARM、RISC-V 和其他架构呢?未来的架构呢?未来的架构甚至可能拥有特殊的 int 指针寄存器,这些寄存器不填充最低位,因为这样的指针根本不存在。即使现在能运行,也许有一天编译器将加载指令换成了另一种,内核就再也无法修复它了。因为编译器没有义务生成适用于未对齐指针的汇编指令。因为这是 UB。

Or how about this: void set_it(std::atomic<int>* p) { p->store(123); } int get_it(std::atomic<int>* p) { return p->load(); } Is this operation atomic when the object is not correctly aligned? That’s the wrong question to ask. Mu, unask the question. It’s UB. (but also yes, in practice this can easily be an atomicity problem). If you want to get even more convinced, you can try thinking about what happens if an object you thought you were reading atomically spans pages. But don’t think too much about it, or you may conclude that “it’s fine”. It’s not. It’s UB.

或者看看这个: void set_it(std::atomic<int>* p) { p->store(123); } int get_it(std::atomic<int>* p) { return p->load(); } 当对象未正确对齐时,此操作是原子的吗?问这个问题就错了。无(Mu),撤回这个问题。这是 UB。(但实际上,这也很容易引发原子性问题)。如果你想更确信这一点,可以试着思考一下,如果你认为正在进行原子读取的对象跨越了内存页会发生什么。但别想太多,否则你可能会得出“没关系”的结论。其实不然,这是 UB。

Actually, it was UB even before that

事实上,在此之前它已经是 UB 了

Don’t blame the foo() function, above. The act of dereferencing the pointer wasn’t the problem. Merely creating the pointer was enough to be a problem. Example: bool parse_packet(const uint8_t* bytes) { const int* magic_intp = (const int*)bytes; // UB! int magic_raw = foo(magic_intp); // Probably crashes on SPARC. int magic = ntohl(magic_raw); // this is fine, at least. […] } That cast is the problem, not foo(). It’s perfectly valid for the compiler to assign specific meaning, such as garbage collection or security tagging bits, to the lower bits of an int*.

不要责怪上面的 foo() 函数。解引用指针的行为并不是问题所在。仅仅创建该指针就足以构成问题。例如: bool parse_packet(const uint8_t* bytes) { const int* magic_intp = (const int*)bytes; // UB! int magic_raw = foo(magic_intp); // 在 SPARC 上可能会崩溃。 int magic = ntohl(magic_raw); // 至少这部分没问题。 […] } 问题在于那个强制类型转换,而不是 foo()。编译器完全有权为 int* 的低位分配特定含义,例如垃圾回收或安全标记位。

isxdigit() on char input

对 char 输入使用 isxdigit()

bool bar(char ch) { return isxdigit(ch); } isxdigit() is a simple function that takes a character and returns 1 if it’s a hex digit. 0-9 or a-f. It can also take the value EOF. Uh, ok. What value is EOF? Per C23 7.4p1 we know it’s an int, and we can infer that it’s not representable by unsigned char. isxdigit() therefore takes an int, not a char. All values of char fit inside int, so we should be fine. Casting from char to int fits, so per section 6.3.1.3 we’re fine, right? No. Because if bar() is called with a value other than 0-127, and on your architecture char is signed (implementation defined, per 6.2.5, paragraph 20 in C23), then the integer value ends up negative. And the following is a valid implementation of isxdigit(), that would cause a read of who-knows-what memory. It could even be I/O mapped memory, triggering things to happen that is more than merely getting a random value or crash. It could cause the motor to start.

bool bar(char ch) { return isxdigit(ch); } isxdigit() 是一个简单的函数,它接收一个字符,如果是十六进制数字(0-9 或 a-f)则返回 1。它也可以接收 EOF 值。呃,好吧。EOF 是什么值?根据 C23 7.4p1,我们知道它是一个 int,并且可以推断出它无法用 unsigned char 表示。因此,isxdigit() 接收的是 int 而不是 char。所有的 char 值都能放入 int 中,所以我们应该没问题。从 char 到 int 的转换是合法的,所以根据 6.3.1.3 节,我们应该没问题,对吧?不。因为如果调用 bar() 时传入的值不在 0-127 之间,且在你的架构上 char 是有符号的(根据 C23 6.2.5 第 20 段,这是由实现定义的),那么转换后的整数值就会变成负数。而以下是 isxdigit() 的一种合法实现,它会导致读取到不知何处的内存。它甚至可能是 I/O 映射内存,从而触发比仅仅获取随机值或崩溃更严重的事情。它甚至可能导致电机启动。