The Joy of Typing
The Joy of Typing: A Practical Guide to Modern Type Annotations in Python for Data Science
类型标注的乐趣:Python 数据科学现代类型标注实践指南
In data science, as in life, it’s important to know what you’re working with. Python’s dynamic type system appears to make this difficult at first glance. A type is a promise about the values an object can hold and the operations that apply to it: an integer can be multiplied or compared, a string concatenated, a dictionary indexed by key. 在数据科学中,正如在生活中一样,了解你所处理的对象至关重要。乍看之下,Python 的动态类型系统似乎让这一点变得困难。类型是对对象所能持有的值以及可对其执行的操作的一种承诺:整数可以进行乘法或比较,字符串可以进行拼接,字典可以通过键进行索引。
Many languages check these promises before the program runs. Rust and Go catch type mismatches at compile time and refuse to produce a runnable binary if they fail; TypeScript runs its checks during a separate compile step. Python does no checking at all by default, and the consequences play out at runtime. 许多语言会在程序运行前检查这些承诺。Rust 和 Go 会在编译时捕获类型不匹配,如果检查失败,则拒绝生成可执行的二进制文件;TypeScript 则在单独的编译步骤中进行检查。而 Python 默认不进行任何检查,其后果会在运行时显现。
In Python, a name binds only to a value. The name itself carries no commitment about the value’s type, and the next assignment can replace the value with one of a completely different kind. A function will accept whatever you pass it and return whatever its body produces; if the type of either is not what you intended, the interpreter will not say so. 在 Python 中,名称仅绑定到一个值。名称本身并不包含关于值类型的承诺,下一次赋值可以将该值替换为完全不同类型的值。函数会接受你传递给它的任何内容,并返回其函数体产生的任何结果;如果其中任何一个的类型不是你预期的,解释器也不会发出警告。
The mismatch only surfaces as an exception later, if at all, when code downstream performs an operation the actual type doesn’t support: arithmetic on a string, a method call on the wrong kind of object, a comparison that quietly evaluates to something nonsensical. This leniency is often in fact a strength: it suits rapid prototyping and the kind of exploratory, notebook-driven work where the shape of a value is something you discover as you go. 这种不匹配通常只会在稍后(如果会发生的话)作为异常出现,即当下游代码执行了实际类型不支持的操作时:例如对字符串进行算术运算、在错误类型的对象上调用方法,或者进行得出无意义结果的比较。这种宽容实际上往往是一种优势:它适合快速原型设计以及那种探索性的、以 Notebook 为驱动的工作方式,在这种方式中,值的形态是你边做边发现的。
But in machine learning and data science workflows, where pipelines are long and a single unexpected type can silently break a downstream step or produce meaningless results, the same flexibility becomes a serious liability. Modern Python’s response to this is type annotations. Added to Python in version 3.5 via PEP 484, annotations are syntax for specifying the types you intend. 但在机器学习和数据科学工作流中,由于流水线很长,一个意外的类型就可能悄无声息地破坏下游步骤或产生毫无意义的结果,这种灵活性就成了严重的隐患。现代 Python 对此的回应是类型标注。类型标注通过 PEP 484 在 Python 3.5 版本中引入,它是一种用于指定你预期类型的语法。
A function gets type information by attaching it to its arguments and return value with colons and an arrow: 函数通过在参数和返回值后使用冒号和箭头来附加类型信息:
def scale_data(x: float) -> float:
return x * 2
The annotation is not enforced at runtime. Calling scale_data("123") raises no error in the interpreter; the function dutifully concatenates the string with itself and returns “123123”. What catches the mismatch is a separate piece of software, called a static type checker, which reads the annotations and verifies them before the code runs:
这种标注在运行时不会被强制执行。调用 scale_data("123") 不会在解释器中引发错误;函数会忠实地将字符串与自身拼接并返回 “123123”。捕获这种不匹配的是一个独立的软件,称为静态类型检查器,它会在代码运行前读取并验证这些标注:
scale_data(x="123") # Type error! Expected float, got str
Static checkers surface type annotations directly in the editor, flagging mismatches as you write. Alongside established tools like mypy and pyright, a newer generation of Rust-based checkers (Astral’s ruff, Meta’s Pyre, and the now open-source Zuban) are pushing performance much further, making full-project analysis feasible even on large codebases. 静态检查器直接在编辑器中显示类型标注,并在你编写代码时标记出不匹配项。除了 mypy 和 pyright 等成熟工具外,新一代基于 Rust 的检查器(如 Astral 的 ruff、Meta 的 Pyre 以及现已开源的 Zuban)正在大幅提升性能,使得即使在大型代码库中进行全项目分析也变得可行。
This model is deliberately separate from Python’s runtime. Type hints are optional, and checking happens ahead of execution rather than during it. As PEP 484 puts it: “Python will remain a dynamically typed language, and the authors have no desire to ever make type hints mandatory, even by convention.” 这种模式特意与 Python 的运行时分离开来。类型提示是可选的,检查发生在执行之前而非执行期间。正如 PEP 484 所述:“Python 将保持为一种动态类型语言,作者们无意让类型提示成为强制要求,即使是作为惯例也不打算这样做。”
The reason is historical as much as philosophical. Python grew up as a dynamically typed language, and by the time PEP 484 arrived there were decades of untyped code in the wild. Making hints mandatory would have broken that overnight. A type checker does not execute your program or enforce type correctness while it runs. Instead, it analyses the source code statically, identifying places where your code contradicts its own declared intent. 这既是出于历史原因,也是出于哲学考量。Python 作为一种动态类型语言成长起来,当 PEP 484 到来时,市面上已经存在了几十年的无类型代码。强制要求类型提示会一夜之间破坏这些代码。类型检查器不会执行你的程序,也不会在程序运行时强制执行类型正确性。相反,它会静态地分析源代码,识别出代码中与其自身声明的意图相矛盾的地方。
Some of these mismatches would eventually raise exceptions, others would silently produce the wrong result. Either way, they become visible immediately. A mismatched argument that might otherwise surface hours into a pipeline run is caught at the point of writing. Annotations make a function’s expectations explicit: they document its inputs and outputs, reduce the need to inspect its body, and force decisions about edge cases before runtime. Once you’re used to it, adding type annotations can be highly satisfying, and even fun! 其中一些不匹配最终会导致异常,另一些则会悄无声息地产生错误结果。无论哪种情况,它们都会立即显现出来。一个原本可能在流水线运行数小时后才暴露的参数不匹配问题,在编写时就能被捕获。标注使函数的预期变得明确:它们记录了输入和输出,减少了检查函数体的需要,并迫使你在运行前就对边界情况做出决策。一旦习惯了,添加类型标注会非常有成就感,甚至很有趣!
Making structure explicit / 让结构变得明确
Dictionaries are the workhorse of Python data work. Rows from a dataset, configuration objects, API responses: all routinely represented as dicts with known keys and value types. TypedDict (PEP 589) provides a lightweight way to write such a schema down: 字典是 Python 数据工作的主力。数据集中的行、配置对象、API 响应:通常都表示为具有已知键和值类型的字典。TypedDict (PEP 589) 提供了一种轻量级的方式来编写此类模式:
from typing import TypedDict
class SensorReading(TypedDict):
timestamp: float
temperature: float
pressure: float
location: str
def process_reading(reading: SensorReading) -> float:
return reading["temperature"] * 1.8 + 32
# return reading["temp"] # Type error: no such key
At runtime, a SensorReading is just a regular dict with zero performance overhead. But your type checker now knows the schema, which means typos in key names get caught immediately rather than surfacing as KeyErrors in production. The PEP highlights JSON objects as the canonical use case. This is a deeper reason TypedDict matters in data work: it lets you describe the shape of data you do not own, such as the responses that come back from an API, the rows that arrive from a CSV, or the documents you pull from a database, without having to wrap them in a class first. 在运行时,SensorReading 只是一个普通的字典,没有任何性能开销。但你的类型检查器现在了解了该模式,这意味着键名中的拼写错误会被立即捕获,而不是在生产环境中以 KeyError 的形式出现。该 PEP 强调 JSON 对象是其典型用例。这是 TypedDict 在数据工作中如此重要的深层原因:它让你能够描述你不拥有的数据形态,例如来自 API 的响应、从 CSV 导入的行,或者从数据库中提取的文档,而无需先将它们包装在类中。
PEP 655 added NotRequired for optional fields, and PEP 705 added ReadOnly for immutable ones, both useful for nested structures from APIs or database queries. TypedDict is structurally typed rather than closed: by default a dict can carry extra keys you didn’t list and still satisfy the type, which is a deliberate choice for interoperability but occasionally surprising. PEP 728, accepted in 2025 and targeting Python 3.15, lets you declare a TypedDict with closed=True, which makes any unlisted key a type error.
PEP 655 增加了用于可选字段的 NotRequired,PEP 705 增加了用于不可变字段的 ReadOnly,这两者对于来自 API 或数据库查询的嵌套结构都非常有用。TypedDict 是结构化类型的,而不是封闭的:默认情况下,字典可以包含你未列出的额外键,且仍然满足类型要求,这是一种为了互操作性而做出的刻意选择,但有时会令人惊讶。2025 年被采纳并针对 Python 3.15 的 PEP 728 允许你使用 closed=True 声明 TypedDict,这使得任何未列出的键都会导致类型错误。