Why Gradient Descent Became Stochastic

为什么梯度下降演变成了随机梯度下降

In this blog post, we are going to discuss not only how but also why gradient descent and stochastic gradient descent are used. We already know about linear regression, and recently I wrote about it in the context of vectors and projections. Now, we will try to understand gradient descent with the help of a linear regression problem. 在这篇博文中，我们不仅要讨论如何使用梯度下降和随机梯度下降，还要探讨为什么要使用它们。我们已经了解了线性回归，最近我还从向量和投影的角度撰写了相关内容。现在，我们将通过一个线性回归问题来理解梯度下降。

But before that, I just want to briefly recall what we already know about linear regression and the math behind it, so that anyone starting out finds it easy to follow. If you already know the basic math behind linear regression, then you can directly start from the section titled “Why Do We Need Gradient Descent?”. 在此之前，我想简要回顾一下我们已知的线性回归及其背后的数学原理，以便初学者能够轻松跟上。如果你已经掌握了线性回归的基本数学知识，可以直接从“为什么我们需要梯度下降？”这一节开始阅读。

Let’s say we started our machine learning journey, and the first thing we did was implementing a linear regression model using Python. We implemented it successfully and got the best values for the slope and intercept. Now we have a question: What’s actually happening behind this algorithm? We want to understand the math behind it. 假设我们开启了机器学习之旅，做的第一件事就是使用 Python 实现一个线性回归模型。我们成功实现了它，并得到了斜率和截距的最佳值。现在我们有一个疑问：这个算法背后到底发生了什么？我们想要理解其背后的数学原理。

Linear Regression Recap

线性回归回顾

For that, let’s consider this data. Now, we want to understand the math behind the algorithm. We come across these formulas for the slope and intercept: 为此，让我们考虑以下数据。现在，我们想要理解算法背后的数学原理。我们遇到了以下斜率和截距的公式：

[ \beta_1 = \frac{\sum_{i=1}^{n} (x_i – \bar{x})(y_i – \bar{y})}{\sum_{i=1}^{n} (x_i – \bar{x})^2} ] [ \beta_0 = \bar{y} – \beta_1\bar{x} ]

Now, by using these formulas we calculate the slope and intercept. The Simple Linear Regression equation is: 现在，通过使用这些公式，我们计算出斜率和截距。简单线性回归方程为：

[ \hat{y} = \beta_0+\beta_1x ]

We got the values using the formulas, but we are not satisfied and want to go deeper. Now our goal is to learn how we got these formulas. To understand that, we will now see a 3D bowl curve. We get that bowl curve when we plot all the possible combinations of $\beta_0$, $\beta_1$ and the mean squared error (MSE). 我们通过公式得到了这些值，但我们并不满足，想要深入探究。现在的目标是学习这些公式是如何推导出来的。为了理解这一点，我们将观察一个 3D 碗状曲线。当我们绘制出 $\beta_0$、$\beta_1$ 和均方误差 (MSE) 的所有可能组合时，就会得到这条碗状曲线。

Now, by looking at the curve, we understand that we need the mean squared error to be as low as possible, and it reaches its minimum when the gradient becomes zero. We already know that to find the slope of any curve, we need differentiation. Next, we perform differentiation on the loss function, since the bowl curve is the 3D representation of it, and you realize that here we have two variables. So, we perform partial differentiation and then solve further to get the formulas for the slope and intercept. 观察这条曲线，我们明白需要让均方误差尽可能低，而当梯度变为零时，它达到最小值。我们已经知道，要找到任何曲线的斜率，都需要微分。接下来，我们对损失函数进行微分，因为碗状曲线是它的 3D 表示，你会发现这里有两个变量。因此，我们进行偏微分，然后进一步求解，从而得到斜率和截距的公式。

Deriving the Formulas for Slope and Intercept

推导斜率和截距公式

Start with the Mean Squared Error (MSE) loss function: 从均方误差 (MSE) 损失函数开始：

[ MSE(\beta_0,\beta_1) = \frac{1}{n} \sum_{i=1}^{n} (y_i-(\beta_0+\beta_1x_i))^2 ]

Rearrange the inner expression: 重排内部表达式：

[ = \frac{1}{n} \sum_{i=1}^{n} (y_i-\beta_0-\beta_1x_i)^2 ]

Now take partial derivative with respect to $\beta_0$: 现在对 $\beta_0$ 求偏导数：

[ \frac{\partial MSE}{\partial \beta_0} = \frac{\partial}{\partial \beta_0} \left( \frac{1}{n} \sum_{i=1}^{n} (y_i-\beta_0-\beta_1x_i)^2 \right) ]

Take constant outside: 将常数移到外面：

[ = \frac{1}{n} \frac{\partial}{\partial \beta_0} \sum_{i=1}^{n} (y_i-\beta_0-\beta_1x_i)^2 ]

Move derivative inside the summation: 将导数移入求和符号内：

[ = \frac{1}{n} \sum_{i=1}^{n} \frac{\partial}{\partial \beta_0} (y_i-\beta_0-\beta_1x_i)^2 ]

Apply chain rule: 应用链式法则：

[ = \frac{1}{n} \sum_{i=1}^{n} 2(y_i-\beta_0-\beta_1x_i) \cdot \frac{\partial}{\partial \beta_0} (y_i-\beta_0-\beta_1x_i) ]

Apply derivative rules: 应用导数规则：

[ \frac{d}{d\beta_0}(y_i)=0, \quad \frac{d}{d\beta_0}(-\beta_0)=-1, \quad \frac{d}{d\beta_0}(-\beta_1x_i)=0 ]

So the inner derivative becomes: 因此内部导数变为：

[ \frac{\partial}{\partial \beta_0} (y_i-\beta_0-\beta_1x_i) = -1 ]

Substitute back: 代回原式：

[ \frac{\partial MSE}{\partial \beta_0} = \frac{1}{n} \sum_{i=1}^{n} 2(y_i-\beta_0-\beta_1x_i)(-1) ]

Simplify: 简化：

[ = -\frac{2}{n} \sum_{i=1}^{n} (y_i-\beta_0-\beta_1x_i) ]

Set derivative equal to zero: 令导数等于零：

[ -\frac{2}{n} \sum_{i=1}^{n} (y_i-\beta_0-\beta_1x_i) = 0 ]

Multiply both sides by $-\frac{n}{2}$: 两边同时乘以 $-\frac{n}{2}$：

[ \sum_{i=1}^{n} (y_i-\beta_0-\beta_1x_i) = 0 ]

Expand: 展开：

[ \sum_{i=1}^{n}y_i – n\beta_0 – \beta_1\sum_{i=1}^{n}x_i = 0 ]

Rearrange: 重排：

[ n\beta_0 = \sum_{i=1}^{n}y_i – \beta_1\sum_{i=1}^{n}x_i ]

Divide by $n$: 除以 $n$：

[ \beta_0 = \frac{1}{n}\sum_{i=1}^{n}y_i – \beta_1 \frac{1}{n}\sum_{i=1}^{n}x_i ]

Using means ($\bar{x} = \frac{1}{n}\sum x_i$ and $\bar{y} = \frac{1}{n}\sum y_i$), the final intercept formula is: 使用均值（$\bar{x} = \frac{1}{n}\sum x_i$ 和 $\bar{y} = \frac{1}{n}\sum y_i$），最终的截距公式为：

[ \beta_0 = \bar{y} – \beta_1\bar{x} ]