Linear Regression From Scratch PT3: Stochastic Gradient Descent

Published in

Nerd For Tech

6 min readMay 25, 2021

Welcome back! This is part 3 of the Linear regression model from scratch. In this article, we will be exploring the famous “Stochastic gradient descent(SGD)” algorithm. I will attempt to explain the intuition of how this very popular algorithm works, and what makes it different from the simple Batch gradient descent algorithm explored in the last series. If you have not seen the previous articles on this subject, find the links embedded here: Part1 and Part 2. Do read them, as this article builds on what has been discussed earlier on.

In the previous article, we discussed the use of the Numerical approach over the Analytical approach(the Closed form solution) in solving a linear regression task. We especially explored one popular class of Numerical approach; “ The Gradient Descent”. As explained, it is the most popular optimization technique used in Machine learning(including sub-fields like Deep learning), and it has three variants;

The Batch Gradient Descent(BGD)
The Stochastic Gradient Descent(SGD)
The Mini-Batch Gradient Descent, also called the Mini-Batch Stochastic Gradient Descent.(MBSGD)

The focus of this article will be on the “The Stochastic Gradient Descent(SGD)”.

An important point of note: “the idea and assumptions between all three variants is the same( iterative process to achieve a global minimum, and assumption of convexity), with only a slight difference in achieving the purpose.”

THE STOCHASTIC GRADIENT DESCENT: INTUITION

Recall that in Batch Gradient Descent(BGD), all training examples(X) are used to compute the gradient and update the weights at a single step of iteration.

For Stochastic Gradient Descent(SGD), the difference comes in how the gradient computation is carried out. SGD, rather than using all training examples at the same time, randomly selects a single training example for gradient computation. What do I mean?

Remember the Pseudocode for Gradient Descent?

Initialize a random guess for the weights (Start with some random guess for theta)
Compute gradient using “all samples(x)”
Update weights using the update rule
Repeat until convergence

For SGD, this becomes;

Initialize a random guess for the weights (Start with some random guess for theta).
Randomly select a sample (x(i)), from the set of training data.
Compute its gradient(i.e gradient of x(i) alone).
Update weights using the update rule.
Repeat steps(2–4) for all training samples(from x(i) to x(n)) until convergence.

Where: x(i) here is a single training observation/sample, y(i) is the label/target for mapped to x(i).

Fun Fact: Stochastic == Random. Training examples are chosen randomly.

SGD has the advantage of being fast and requiring less memory. However, it never really converges as we expect, it rather achieves an approximate to the optimal solution. You will notice it has a noisier movement down the steep and oscillates around the global minimum(but never really hits the spot). Its computational complexity is constant time O(1).

“The stochastic(random) nature of SGD means that some noise will be introduced into the weight setting, but the technique is overall still very effective and is widely used.” — Research Gate(Link to Publication Below)

SGD is also referred to as an “Unbiased Estimator”, as the expectation of the gradient is the same as BGD, but with a smaller variance.

“An estimator is a function that tries to estimate some useful qualities of a population data given its sample data. Bias is a very important property of an estimator in machine Learning. SGD is unbiased because the expectation of the gradient is the same as that for BGD.
I understand these terms might sound a little complex, I recommend you reference the links below, I have attached the link to a blogpost that explains these concepts.”

Advantages of Stochastic Gradient Descent over Batch Gradient Descent

Batch Gradient Descent is slow. It requires computation of gradient over the entire training samples at each iteration. Stochastic Gradient descent on the other hand is faster, because it only uses one sample at a time.
Stochastic Gradient descent is an unbiased estimator.
Its complexity is constant time O(1), BDG has complexity of O(N).

CODE IMPLEMENTATION OF STOCHASTIC GRADIENT DESCENT

As mentioned earlier, the only slight difference that needs to be made to the code used for Batch Gradient Descent in the previous article, is to iterate over the features (X), taking one observation(row) at a time(More like slicing through the data horizontally), and using this for gradient computation, rather than the entire dataset.

Let’s see how that works.

We will skip the data preprocessing steps and proceed directly to the model class.

Before that,Let’s get a quick overview of two important terms with respect to SGD:

These are;

Epoch
Number of Iterations(n_iters)

So the Epoch is defined as a single pass over the entire dataset(all training data). While the iteration is a pass over one training data. For Batch Gradient descent, the two terms could sometimes be used interchangeably as all training data are considered at a single step. Whereas, for SGD, these terms differ(as you would see in many deep learning application too), Epoch covers the entire dataset, while iterations considers the step we take as we slice through the data horizontally(taking one observation at a time).

You will get a better intuition of this with code.

The only function that will reflect the change is the fit/train function.

As shown in the code snippet above, a second loop was initiated, with an indication of the horizontal slice of the data(X_train.shape[0]==row/observation). Notice the line where this slice was reshaped? Can you guess why?

Exactly! Now, rather than a matrix, we have a vector, “a row vector”. Without reshaping, the code returns an error(Remember the rule for matrix multiplication). This modified ‘X’ replaces the initial X in the train function only.

Also, the value of “X_train.shape[0]” is the number of iterations here(n_iters). So you could also as well define n_iters as a variable, see image below.

Observe the slice was also implemented for ‘Y’(as indicated in the formula).

Every other function remains the same; “Predict and MSE function.”

Run the code by initializing the model and see how the loss plot differs from BGD.

Did you notice a smaller learning rate was used as compared to BGD? Can you see the oscillatory and noisy movement?

Generally, it is advisable to use a smaller learning rate for SGD, because it depends on a single data at a time for computation. It is very noisy(as shown in the loss plot above, and has more tendency to overshoot , as it oscillates around the global minimum.

Great Job!

CONCLUSION

I hope you enjoyed reading this and gained some understanding on how SGD works. SGD is more popular in application in the sub-field of Machine Learning- “Deep Learning”. It is the most popularly used optimizer.

In the next series, we will talk about Mini-batch Stochastic Gradient Decent(the coolest of the lot😄).