Correlation Doesn’t Mean Causation! But What Does It Mean?

Correlation Doesn’t Mean Causation! But What Does It Mean?

相关性并不意味着因果关系!但它到底意味着什么?

Data Science 数据科学

Correlation Doesn’t Mean Causation! But What Does It Mean? What does correlation tells us? 相关性并不意味着因果关系!但它到底意味着什么?相关性告诉了我们什么?

Sara A. Metwalli | Apr 28, 2026 | 6 min read Sara A. Metwalli | 2026年4月28日 | 阅读需6分钟

Even before any of us got into data science, there was a phrase that we’d all heard; everyone knows it, young and old: “Correlation doesn’t imply causation.” It is a catchy phrase, and you’ve definitely said it once or twice, and might even have nodded confidently when someone else said it. 在我们接触数据科学之前,我们就都听过这样一句话;无论老少,人尽皆知:“相关性并不意味着因果关系。”这是一句朗朗上口的短语,你肯定也说过一两次,甚至在别人说起时还会自信地点头表示赞同。

Especially for datasets that don’t relate to each other, but where it’s funny and intriguing to imply causation! Here are two very interesting facts: 特别是对于那些本不相关、但暗示其存在因果关系又显得既有趣又引人入胜的数据集时!以下是两个非常有趣的事实:

  • Countries that eat more pizza tend to have higher math scores.
  • The more sunglasses sold, the more shark attacks occur.
  • 吃披萨越多的国家,数学成绩往往越高。
  • 太阳镜卖得越多,鲨鱼袭击事件就越多。

Now, if that were all the information you have… what should you conclude? Does eating pizza make you better at math? Will buying a new pair of sunglasses cause a shark attack? Though it is funny to think about, the answer to those questions is “probably not”. And yet, these are examples of something very real: Correlation. 现在,如果你掌握的信息仅限于此……你应该得出什么结论呢?吃披萨能让你数学变好吗?买一副新太阳镜会导致鲨鱼袭击吗?虽然这样想很有趣,但这些问题的答案是“大概不会”。然而,这些例子确实反映了一个非常真实的概念:相关性。

The question worth asking now is: if correlation doesn’t equal causation, then what does it mean? That’s where things get fuzzy. Because we tend to treat correlation like a vague idea, we think of it as if it means “They’re kind of related”, or “They move together somehow”. But correlation isn’t just a feeling; it’s a precise mathematical measurement of how two variables move together. Instead of just repeating the warning, let’s actually understand the concept. Once you do, those weird examples stop being surprising and start making sense. So, let’s get into it! 现在值得问的问题是:如果相关性不等于因果关系,那它到底意味着什么?这就是问题变得模糊的地方。因为我们倾向于把相关性当作一个模糊的概念,认为它意味着“它们有点关系”或者“它们以某种方式一起变动”。但相关性不仅仅是一种感觉;它是衡量两个变量如何共同变动的精确数学指标。与其仅仅重复那句警告,不如让我们真正理解这个概念。一旦你理解了,那些奇怪的例子就不再令人惊讶,反而变得合乎逻辑了。那么,让我们开始吧!

What is correlation?

什么是相关性?

When people say two things are “correlated,” they usually mean one of three things: 当人们说两件事“相关”时,通常是指以下三种情况之一:

  1. “Those two things seem related.”
  2. “Those two things move together.”
  3. “There’s some connection between those two things.”
  4. “这两件事看起来有关联。”
  5. “这两件事会一起变动。”
  6. “这两件事之间存在某种联系。”

On a surface level, all three of these are not wrong, but they are missing some nuances. Correlation is not a vibe. It’s a measurement! And like any measurement, it answers a very specific question. 从表面上看,这三点都没错,但它们忽略了一些细微差别。相关性不是一种“感觉”,它是一种度量!像任何度量一样,它回答了一个非常具体的问题。

Taking a step back, imagine you collect the data on how many hours students studied and their exam scores. You plot it, and you see something like this: Each point represents one student. The x-axis is how long they studied, and the y-axis is their score. When you look at this plot, you notice that the points tend to move upward. So you conclude, “As study time increases, scores tend to increase too”, which is what we call a positive correlation. 退一步想,假设你收集了学生学习时长和考试成绩的数据。你将其绘制成图表,会看到类似这样的结果:每个点代表一名学生。X轴是学习时长,Y轴是分数。当你观察这张图时,会发现这些点倾向于向上移动。因此你得出结论:“随着学习时间增加,分数往往也会增加”,这就是我们所说的正相关。

But, is that just a trend or is the data telling you something more? In this example, the relationship you just plotted is: when one variable is above its average, the other tends to be above its average too. That’s the key idea most people miss: correlation isn’t about raw values, it’s about how variables move relative to their averages. So, the question correlation answers is: Do two variables move together in a consistent way? 但是,这仅仅是一个趋势,还是数据在告诉你更多信息?在这个例子中,你刚刚绘制的关系是:当一个变量高于其平均值时,另一个变量也倾向于高于其平均值。这是大多数人忽略的关键点:相关性不是关于原始数值,而是关于变量相对于其平均值如何变动。因此,相关性回答的问题是:两个变量是否以一致的方式共同变动?

That question has one of three answers: 这个问题有三种答案之一:

  • Up + up → positive correlation
  • Up + down → negative correlation
  • No consistent pattern → no correlation
  • 上升 + 上升 → 正相关
  • 上升 + 下降 → 负相关
  • 没有一致的模式 → 无相关性

The Math Behind Correlation

相关性背后的数学原理

Let’s try to make thinking about correlation more intuitive. We will do that using the Pearson correlation coefficient, which we can define as: 让我们试着让关于相关性的思考更直观一些。我们将使用皮尔逊相关系数(Pearson correlation coefficient)来实现这一点,其定义为:

$$r = \frac{cov(X, Y)}{\sigma_{X} \cdot \sigma_{Y}}$$

Okay, I know that equation isn’t what anyone thinks of when I say “intuitive”… But stick with me and let’s unpack it without turning it into a lecture. 好吧,我知道当我提到“直观”时,没人会想到这个公式……但请跟随我,我们不用把它变成枯燥的讲座,而是来拆解它。

Step 1: Covariance (AKA Do They Move Together?) 第一步:协方差(即:它们会一起变动吗?)

Covariance looks at how two variables move relative to their averages. For example, if both variables are above their averages, we get positive covariance; if one is above and the other below, we get negative covariance. Basically, covariance answers: “Are these variables aligned in how they deviate from their averages?” 协方差考察的是两个变量相对于其平均值如何变动。例如,如果两个变量都高于其平均值,我们得到正协方差;如果一个高于平均值而另一个低于平均值,我们得到负协方差。基本上,协方差回答了:“这些变量在偏离其平均值的方式上是否一致?”

Step 2: Normalize It 第二步:归一化

Covariance alone is hard to interpret because it depends on scale. To overcome that, we divide by the standard deviations: $\sigma_{X}$ and $\sigma_{Y}$. This rescales everything into a clean range: -1 to 1. That gives us common ground for comparing variable values. 仅凭协方差很难解释,因为它取决于量纲(尺度)。为了克服这一点,我们除以标准差 $\sigma_{X}$ 和 $\sigma_{Y}$。这会将所有数值重新缩放到一个清晰的范围:-1 到 1。这为我们比较变量值提供了一个共同的基础。

After these two steps, we can now calculate the Pearson coefficient! If we get: 经过这两个步骤,我们现在可以计算皮尔逊系数了!如果我们得到:

  • +1 → perfect positive relationship.
  • 0 → no linear relationship.
  • -1 → perfect negative relationship.
  • +1 → 完全正相关。
  • 0 → 无线性相关。
  • -1 → 完全负相关。

This code simply measures how consistently these two variables move together—not how big they are, but how well they are aligned. 这段代码只是衡量这两个变量共同变动的一致性——不是它们有多大,而是它们对齐得有多好。

What Different Correlations Look Like

不同相关性的表现形式

  • Left: strong positive correlation → clear upward pattern
  • Middle: no correlation → random scatter
  • Right: strong negative correlation → downward pattern
  • 左图:强正相关 → 清晰的上升模式
  • 中图:无相关性 → 随机散点
  • 右图:强负相关 → 下降模式

Correlation measures consistency of movement, not just whether two variables are related. 相关性衡量的是变动的一致性,而不仅仅是两个变量是否相关。

What Correlation Actually Tells You

相关性实际上告诉了你什么

Correlation tells you: these variables move together in a structured way. It tells us that there is a pattern here to pay attention to. But, it does NOT tell you why or how they do, or whether one causes the other. 相关性告诉你:这些变量以一种结构化的方式共同变动。它告诉我们这里有一个值得注意的模式。但是,它并没有告诉你它们为什么或如何变动,也没有告诉你其中一个是否导致了另一个。

The classic example of correlation is that ice cream sales and drowning incidents are correlated. In fact, we can plot the number of ice cream sales and drowning incidents to get: We can see a clear upward relationship between these two variables… more ice cream sales lead to more drownings?… But that’s misleading. Because the real driver is temperature: hot weather means more ice cream sales, more people going to the beach, and more swimming. So, though we can clearly see that correlation is real, the explanation is hidden. 相关性的经典例子是冰淇淋销量与溺水事件相关。事实上,我们可以绘制冰淇淋销量和溺水事件数量的图表:我们可以看到这两个变量之间存在明显的上升关系……冰淇淋卖得越多,溺水就越多?……但这具有误导性。因为真正的驱动因素是温度:天气炎热意味着冰淇淋销量增加,去海滩的人更多,游泳的人也更多。因此,虽然我们可以清楚地看到相关性是真实的,但其背后的解释却被隐藏了。

Correlation and Nonlinearity

相关性与非线性

Now consider this relationship: $y = x^2$ 现在考虑这种关系:$y = x^2$

This is clearly a strong relationship, as x increases or decreases, y increases! But if you compute correlation: np.corrcoef(x, y)[0,1] 这显然是一种强关系,随着 x 的增加或减少,y 都会增加!但如果你计算相关性:np.corrcoef(x, y)[0,1]

You’ll get something close to 0. That is because correlation only measures: How well a straight line fits the relationship. This is a crucial limitation. If the relationship is curved, correlation may fail, even when a strong relationship exists. So, instead of thinking: “Correlation = relationship”, think: “Correlation = linear relationship”. 你会得到一个接近 0 的结果。这是因为相关性只衡量:直线对这种关系的拟合程度。这是一个关键的局限性。如果关系是曲线的,即使存在强关系,相关性也可能失效。因此,不要认为“相关性 = 关系”,而应认为“相关性 = 线性关系”。