Why Don’t Lowercase Letters Come Right After Uppercase Letters in ASCII?
Why Don’t Lowercase Letters Come Right After Uppercase Letters in ASCII?
为什么 ASCII 码中大写字母后面没有紧跟小写字母?
Something finally clicked for me. When looking at an ASCII table, you will notice that after the uppercase Z, there are a few other characters before lowercase a: 我终于想通了。当你查看 ASCII 表时,会发现大写字母 Z 之后,在小写字母 a 出现之前,还夹杂着几个其他字符:
| Decimal | Binary | Symbol |
|---|---|---|
| 88 | 01011000 | X |
| 89 | 01011001 | Y |
| 90 | 01011010 | Z |
| 91 | 01011011 | [ |
| 92 | 01011100 | \ |
| 93 | 01011101 | ] |
| 94 | 01011110 | ^ |
| 95 | 01011111 | _ |
| 96 | 01100000 | ` |
| 97 | 01100001 | a |
It made sense to me that characters are represented by numbers, since numbers are the only thing computers really know how to store and manipulate. So you need some kind of encoding that maps numbers to characters. ASCII was one of the earliest character encoding schemes, but it only used 7 bits, which means it could represent just 128 code points: $2^7$. That is not nearly enough for all the characters humans use, especially once you start considering languages like Chinese, which has tens of thousands of characters. Nowadays we use Unicode as the standard character set, which has various encodings such as UTF-8 and UTF-16. The nice thing about Unicode is that its first 128 code points are the same as ASCII. 字符由数字表示对我来说很好理解,因为数字是计算机唯一真正懂得存储和处理的东西。因此,你需要某种编码方式将数字映射为字符。ASCII 是最早的字符编码方案之一,但它仅使用 7 位,这意味着它只能表示 128 个码点($2^7$)。这对于人类使用的所有字符来说远远不够,尤其是考虑到像中文这样拥有数万个字符的语言时。如今,我们使用 Unicode 作为标准字符集,它拥有 UTF-8 和 UTF-16 等多种编码方式。Unicode 的优点在于其前 128 个码点与 ASCII 完全相同。
With that context, I always found it strange that the designers of ASCII included 6 characters after uppercase Z before starting the lowercase letters. Then it hit me: we have 26 letters in the English alphabet, plus 6 additional characters before lowercase starts: 26 + 6 = 32. If you know anything about computers, powers of 2 tend to stick out. Let’s take a look at the binary representations of some characters compared to their lowercase counterparts. 有了这些背景,我一直觉得很奇怪,为什么 ASCII 的设计者要在大写 Z 之后、小写字母开始之前插入 6 个字符。后来我突然意识到:英文字母表有 26 个字母,加上小写字母开始前的 6 个额外字符:26 + 6 = 32。如果你对计算机有所了解,就会知道 2 的幂次方往往非常重要。让我们看看一些字符及其对应小写字母的二进制表示。
| Decimal | Binary | Symbol |
|---|---|---|
| 65 | 01000001 | A |
| 97 | 01100001 | a |
| 66 | 01000010 | B |
| 98 | 01100010 | b |
| 67 | 01000011 | C |
| 99 | 01100011 | c |
Do you see it? The 5th bit is always flipped when comparing an uppercase letter to its lowercase counterpart. This makes sense when you convert the difference to decimal: 看到了吗?在比较大写字母与其对应的小写字母时,第 5 位(从右往左数,从 0 开始)总是发生翻转。将这个差值转换为十进制后,一切就说得通了:
$$ 32 = 2^5 $$
The number 32! Because of this, you can do some interesting bitwise operations. For example, to convert a character to uppercase, you can do a bitwise AND with the bitwise NOT of 32: 数字 32!正因如此,你可以进行一些有趣的位运算。例如,要将字符转换为大写,你可以将其与 32 的按位取反(NOT)进行按位与(AND)运算:
Step 1: Bitwise NOT of 32 to create a mask
第一步:对 32 进行按位取反以创建掩码
~ 0 0 1 0 0 0 0 0 (32)
-------------------
1 1 0 1 1 1 1 1 (mask)
Step 2: Bitwise AND ‘a’ with the mask
第二步:将 ‘a’ 与掩码进行按位与运算
0 1 1 0 0 0 0 1 (97 = 'a')
& 1 1 0 1 1 1 1 1 (mask)
-------------------
0 1 0 0 0 0 0 1 (65 = 'A')
If you do this with an existing uppercase letter, it stays the same:
如果你对现有的大写字母执行此操作,它将保持不变:
0 1 0 0 0 0 0 1 (65 = 'A')
& 1 1 0 1 1 1 1 1 (mask)
-------------------
0 1 0 0 0 0 0 1 (65 = 'A')
If you want to lowercase a letter you can do a bitwise OR with 32:
如果你想将字母转换为小写,可以将其与 32 进行按位或(OR)运算:
0 1 0 0 0 0 0 1 (65 = 'A')
| 0 0 1 0 0 0 0 0 (32)
-------------------
0 1 1 0 0 0 0 1 (97 = 'a')
Once again, doing this with an existing lowercase letter will keep it the same:
同样,对现有的小写字母执行此操作,它将保持不变:
0 1 1 0 0 0 0 1 (97 = 'a')
| 0 0 1 0 0 0 0 0 (32)
-------------------
0 1 1 0 0 0 0 1 (97 = 'a')
If you want to flip the case you can use a bitwise XOR with 32:
如果你想翻转大小写,可以使用 32 进行按位异或(XOR)运算:
0 1 1 0 0 0 0 1 (97 = 'a')
^ 0 0 1 0 0 0 0 0 (32)
-------------------
0 1 0 0 0 0 0 1 (65 = 'A')
0 1 0 0 0 0 0 1 (65 = 'A')
^ 0 0 1 0 0 0 0 0 (32)
-------------------
0 1 1 0 0 0 0 1 (97 = 'a')
Last party trick for you, if you want to get the alphabet index you can do a bitwise AND with 31:
最后一个小技巧:如果你想获取字母在字母表中的索引,可以将其与 31 进行按位与运算:
0 1 0 0 0 0 0 1 (65 = 'A')
& 0 0 0 1 1 1 1 1 (31)
-------------------
0 0 0 0 0 0 0 1 (1)
0 1 1 1 1 0 1 0 (122 = 'z')
& 0 0 0 1 1 1 1 1 (31)
-------------------
0 0 0 1 1 0 1 0 (26)
This works because 31 effectively clears the first three bits and keeps only the lower five bits. In ASCII, the lower five bits of letters line up with their alphabet position: A/a ends in 00001, B/b ends in 00010, and so on up to Z/z, which ends in 11010. Another way to think about it is that, for ASCII character codes, c & 31 is equivalent to c % 32, because 32 is a power of two. Masking with 31, which is binary 00011111, keeps only the part of the number “left over” after groups of 32 are removed.
之所以有效,是因为 31 实际上清除了前三位,只保留了后五位。在 ASCII 中,字母的后五位与其在字母表中的位置是一一对应的:A/a 以 00001 结尾,B/b 以 00010 结尾,以此类推,直到 Z/z 以 11010 结尾。另一种理解方式是,对于 ASCII 字符代码,c & 31 等同于 c % 32,因为 32 是 2 的幂。使用 31(二进制 00011111)进行掩码运算,只保留了数字在剔除 32 的倍数后“剩余”的部分。
‘A’ = 65 → 65 % 32 = 1 ‘B’ = 66 → 66 % 32 = 2 … ‘Z’ = 90 → 90 % 32 = 26 ‘a’ = 97 → 97 % 32 = 1 ‘b’ = 98 → 98 % 32 = 2 … ‘z’ = 122 → 122 % 32 = 26
Now you know why the designers of ASCII put those extra six characters before proceeding to lowercase. 现在你知道为什么 ASCII 的设计者在进入小写字母之前要放入那额外的六个字符了吧。