I Reduced My Pandas Runtime by 95% — Here’s What I Was Doing Wrong

I Reduced My Pandas Runtime by 95% — Here’s What I Was Doing Wrong

我将 Pandas 的运行时间缩短了 95% —— 我之前错在哪里

Most slow Pandas code “works”, until it doesn’t. Learn how to spot hidden bottlenecks, avoid costly row-wise operations, and know when Pandas is no longer enough. 大多数缓慢的 Pandas 代码在数据量小时都能“正常运行”,直到遇到瓶颈。学习如何发现隐藏的性能瓶颈、避免昂贵的逐行操作,并了解何时 Pandas 已不再适用。

I’ve been learning Pandas for some time now. Nothing too crazy though. Just basic data cleaning, exploratory data analysis, and some essential functions. I’ve also explored things like method chaining for cleaner, more organized code, and operations that silently break your Pandas workflow, both of which I’ve written about before. 我学习 Pandas 已经有一段时间了。虽然没有深入钻研,但掌握了基础的数据清洗、探索性数据分析以及一些核心函数。我还探索过诸如“方法链”(method chaining)以实现更整洁、更有条理的代码,以及那些会悄无声息地破坏 Pandas 工作流的操作,这些内容我之前都写过。

I never really thought about runtime. Honestly, if my code ran without errors and gave me the output I needed, I was happy. Even if it took a few minutes for all my notebook cells to finish, I didn’t care. No errors meant no problems, right? 我从未真正考虑过运行时间。老实说,只要代码没有报错并给出了我想要的结果,我就很满足了。即使 Notebook 中的所有单元格运行需要几分钟,我也不在乎。没有报错就意味着没有问题,对吧?

Then I came across the concept of vectorization. And something clicked. I went down the rabbit hole, as I usually do. The more I read, the more I realized that “no errors” and “efficient code” are two very different things. Your Pandas code can be completely correct and still be quietly terrible at scale. 后来,我接触到了“向量化”(vectorization)的概念,突然间茅塞顿开。像往常一样,我深入钻研了下去。读得越多,我越意识到“没有报错”和“高效代码”是两码事。你的 Pandas 代码可能完全正确,但在大规模数据下却可能表现得极其糟糕。

So this article is me documenting what I found. The mistakes that slow Pandas code down, why they happen, how to fix them, and when Pandas itself might be the bottleneck. If you’ve ever run a notebook and just assumed the wait time was normal, this one’s for you. 因此,这篇文章记录了我的发现:导致 Pandas 代码变慢的错误、它们发生的原因、如何修复它们,以及何时 Pandas 本身可能成为瓶颈。如果你曾经运行过 Notebook 并认为等待时间很正常,那么这篇文章就是为你准备的。

Why “Working Code” Isn’t Good Enough

为什么“能运行的代码”还不够好

There’s a reason this took me a while to think about. Pandas is designed to be forgiving. You can write code in a dozen different ways and most of them will work. You get your output, your dataframe looks right, and you move on. But that flexibility comes with a hidden cost. 我花了一段时间才意识到这一点是有原因的。Pandas 的设计初衷是宽容的。你可以用十几种不同的方式编写代码,大多数都能运行。你得到了输出,数据框看起来也没问题,于是你就继续往下做了。但这种灵活性伴随着隐形成本。

Unlike SQL or production-grade data systems, Pandas doesn’t force you to think about efficiency. It doesn’t warn you when you’re doing something expensive. It just… does it. Slowly, sometimes. But it does it. 与 SQL 或生产级数据系统不同,Pandas 不会强迫你考虑效率。当你执行昂贵的操作时,它不会发出警告。它只是……执行它。有时很慢,但它确实执行了。

Think about it this way. SQL has a query optimizer. It looks at what you’re asking for and figures out the most efficient way to get it. Pandas doesn’t have that. It trusts you to write efficient code. And if you don’t know what efficient looks like, you’ll never know you’re missing it. 换个角度想:SQL 有查询优化器。它会分析你的需求,并找出获取数据的最高效方式。Pandas 没有这个功能。它信任你会编写高效的代码。如果你不知道什么是高效,你就永远不会意识到自己错过了什么。

The result is that a lot of Pandas code in the wild is what I’d call politely inefficient. It works on small datasets. It works on medium datasets with a little patience. But the moment you throw real-world data at it, something that’s a few hundred thousand rows or more, the cracks start to show. 结果就是,市面上大量的 Pandas 代码只能被称为“礼貌性低效”。它们在小数据集上运行良好,在中小数据集上多点耐心也能跑通。但一旦你处理真实世界的数据——比如几十万行甚至更多时,问题就开始显现了。

What used to take seconds now takes minutes. What took minutes becomes unusable. And the frustrating part is nothing looks wrong. No errors. No warnings. Just a slow notebook and a spinning cursor. That’s the trap. Pandas optimizes for convenience, not speed. And convenience is great, until it isn’t. 曾经几秒钟完成的任务现在需要几分钟。曾经几分钟的任务变得无法使用。最令人沮丧的是,一切看起来都没问题。没有错误,没有警告。只有一个缓慢的 Notebook 和不停旋转的光标。这就是陷阱。Pandas 优化的是便利性,而不是速度。便利性固然好,但过犹不及。

So the first shift is a mindset one: working code and efficient code are not the same thing. Once that clicks, everything else follows. 因此,第一个转变是思维方式的转变:能运行的代码和高效的代码不是一回事。一旦你理解了这一点,其他一切都会迎刃而解。

Profiling: Stop Guessing, Start Measuring

性能分析:停止猜测,开始测量

Here’s something I noticed while going down this rabbit hole. Most people, when they feel like their code is slow, do one of two things. They either rewrite the whole thing from scratch hoping something improves, or they just accept it and wait. Neither of those is the right move. The right move is to measure first. You can’t optimize what you haven’t identified. And more often than not, the part of your code you think is slow isn’t actually the problem. 在深入研究的过程中,我注意到一件事:大多数人在觉得代码慢时,通常会做两件事之一:要么从头重写整个程序,希望有所改善;要么干脆接受现状,默默等待。这两种做法都不对。正确的做法是先测量。你无法优化你尚未定位的问题。而且通常情况下,你认为慢的那部分代码,往往并不是真正的症结所在。

Pandas gives you a few simple tools to start with. Pandas 提供了一些简单的工具供你入门。

%timeit — Know How Long Things Actually Take

%timeit — 了解任务实际耗时

%timeit is a Jupyter magic command that runs a line of code multiple times and gives you the average execution time. It’s the simplest way to compare two approaches and know, concretely, which one is faster. %timeit 是 Jupyter 的一个魔法命令,它会多次运行一行代码并给出平均执行时间。这是比较两种方法并明确知道哪种更快的最简单方法。

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'sales': np.random.randint(100, 10000, size=100_000),
    'discount': np.random.uniform(0.0, 0.5, size=100_000)
})

# Approach A
%timeit df.apply(lambda row: row['sales'] * row['discount'], axis=1)

# Approach B
%timeit df['sales'] * df['discount']

On a dataset of 100,000 rows, the difference is not subtle: 在 10 万行的数据集上,差异非常明显:

  • 1.91 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • 316 μs ± 14 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Same output. Completely different cost. That’s the kind of thing you’d never notice by just running the cell once and moving on. 输出结果相同,但代价截然不同。这就是那种如果你只运行一次单元格就直接跳过,永远无法察觉的问题。

df.info() and df.memory_usage() — Know What You’re Carrying

df.info() 和 df.memory_usage() — 了解你的负载

Speed isn’t just about computation. Memory plays a huge role too. A dataframe that’s bloated with the wrong data types will slow everything down before you’ve even written a single transformation. 速度不仅仅关乎计算,内存也起着巨大的作用。一个充斥着错误数据类型的数据框,甚至在你还没写任何转换逻辑之前,就会拖慢一切。

df.info()

Output: 输出:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   sales     100000 non-null int64  
 1   discount  100000 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.5 MB

To check the memory usage df.memory_usage(deep=True) 查看内存使用情况 df.memory_usage(deep=True)

Output: 输出:

Index        132
sales     400000
discount  800000
dtype: int64

Here, we can see that discount is taking up twice the space. This is because discount is stored as a “heavier” number type (float64) while sales is stored in a “lighter” type (int32). This becomes especially important when you’re working with string columns or object types that are secretly eating memory. We’ll come back to this in the next section. 在这里,我们可以看到 discount 占用的空间是 sales 的两倍。这是因为 discount 被存储为“更重”的数字类型(float64),而 sales 存储在“更轻”的类型(int32)中。当你处理字符串列或那些暗中消耗内存的 object 类型时,这一点尤为重要。我们将在下一节回到这个问题。

The Profiling Mindset

性能分析的思维方式

The tools themselves are simple. The shift is in how you approach your code. Before you optimize anything, ask: where is the time actually going? Measure the slow parts. Compare alternatives. Let the numbers tell you what to fix. Because what feels slow and what is slow are often two different things entirely. 工具本身很简单,关键在于你对待代码的方式。在优化任何东西之前,先问问自己:时间到底花在哪里了?测量缓慢的部分,比较替代方案,让数据告诉你该修复什么。因为“感觉慢”和“实际慢”往往是两码事。

Mistake #1: Row-wise Operations (The Silent Killer)

错误 #1:逐行操作(沉默的杀手)

If there’s one thing I kept seeing come up again and again while researching this topic, it was this: people looping through Pandas dataframes row by… 在研究这个主题时,我反复看到的一件事就是:人们通过循环遍历 Pandas 数据框的每一行……