LINEAR REGRESSION FROM SCRATCH PT1

Aminah Mardiyyah Rufai
8 min readMar 24, 2021

The Linear Regression is considered the most natural learning algorithm for modelling data, primary because it is easy to interpret and models efficiently most natural problems. It belongs to the family of “Linear Models/Predictors” in machine learning(one of the most useful hypothesis space). Although, there are various ways of implementing this algorithm(including using the Sklearn library in Python, it is just as important to understand the basic intuition behind this algorithm and how libraries such as Sklearn work behind the scenes.

Source: Wikipedia

Assumptions of the Linear Regression Model

  • It assumes a linear relationship between the dependent and independent variables.
  • It assumes the variables follow a normal distribution , and like most Machine learning models, it assumes the features are Identically Independently distributed(I.I.D).
  • It assumes No or little multicollinearity.
  • It assumes no auto-correlation between features.
  • It assumes Homoscedasticity of features.

THE INTUITION

Generally, in machine learning, there are three types of problems;

  • The Supervised ,and;
  • UnSupervised Learning
  • Reinforcement Learning Methods

Under Supervised Methods, there two sub-classes;

  • Regression(Output/Target/Predicted values is continuous)
  • Classification (Output/Target/Predicted values is Discrete)

I have an article on “Machine Learning Buzzwords(link embedded here)” you might want to check it out for a better understanding of some terms.

The Linear regression is classified as a regression algorithm because its output is a range of continuous data. It is also considered a Parametric learning algorithm , as its parameters are independent of the size of the data.

There are two major approaches used in modelling a Linear regression function. These approaches are:
- The Analytical approach
- The Numerical approach

The Analytical Approach also called the Closed form solution, uses the Normal equation(or Ordinary Least squared approach(OLS)), with no update rule for the weights(theta) and no step function or learning rate required. It models a linear relationship between the dependent and independent variables and calculates the error between expected and the actual using a set criterion(either of Root Mean Squared Error,(RMSE), Mean Squared Error(MSE), Mean Absolute Error(MAE)). In simpler terms, for the analytical approach, the weight , ⍬ is calculated just once using the OLS equation without consideration for if it is the optimal value or not.

The Numerical Approach uses Iterative methods to repeatedly update weights(⍬) until convergence or a global minimum is achieved, where the best or optimum weight is achieved, with cost error minimization. There are several functions under this category, the most popular being the Gradient Descent class of functions. Others are;

  • Newton Raphson’s method, Hessian Matrix et cetera.

This article focuses on the Analytical approach and a demo in code using Python.

A BRIEF SUMMARY OF THE MATHEMATICAL CONCEPT: ANALYTICAL APPROACH(CLOSED FORM SOLUTION)

Closed-form and Gradient Descent Regression Explained with Python – Towards  AI — The Best of Tech, Science, and Engineering
Source: Here

It is possible that the matrix $(x^Tx)^{-1}$ ,may not be invertible and be computationally expensive ($O(n³)$)

Another way of calculating the error/criteria is using the Maximum Likelihood Estimation(MLE).

Source:Here

The Maximum Likelihood Estimation(MLE) is a way of estimating the best parameters of a model given the data by maximizing the probability of the data.

IMPLEMENTATION IN CODE(PYTHON)

  • Import the necessary libraries (Numpy, Pandas,Matplotlib(if desired, but not necessary)
  • Read the data ( Data used for this demonstration was obtained from the UCI Machine Learning repository. Find link Here.

So here, I created a Python function, with the data path, header and sep as arguments. The reason for this is because, sometimes, we encounter data that does not display or read correctly when called with the pandas read function. Sometimes, the when there are no default headers, Pandas assumes the first row as the headers, which results in (n-1), samples/data points i.e one observation omitted and an error with the column names. Also, there could be cases where the data is read in such a way that columns are concatenated together and appear as one column, rather than individual columns, which could pose a problem in further analysis of the data. I recommend you reference the Pandas documentation for arguments that can be used with the read function to correctly read a data. Pandas Documentation

Tips: Try omitting the ‘header’ and ‘sep’ argument and read the data without it to better understand the difference.

Next, the actual column headers as obtained from the UCI page, used to replace the default.See image below

Please note that while replacing the column names, it is very important to follow the particular order of the column and input the names accordingly.

  • Data Preprocessing steps(Data preparation before modelling)

For the data preprocessing steps, the shape of the data was checked(this could have also been included in the read function as a return value if desired), the pandas describe function was also used to get a better description of the data. Here, I found that one of the columns had a single value in all observations. For the purpose of simplicity in this problem, the column was dropped.

Next, the data was split into features(X) and Target(Y).

So for a linear regression problem, most often a Feature scaling is performed as part of the preprocessing steps, also a bias term is added to the independent variable X;

Some of the reasons for feature scaling are:

  1. Amongst the assumptions of the Linear regression is that the features are Identically distributed , very commonly is to assume a Gaussian/Normal Distribution. In order to keep check on this, feature scaling is done. It could either be using the Min-Max Scaling method(Shrinking data points range between 0,1) or Normalising(Shrinking data points range to between -1 and 1).
  2. No preference rule on features: Sometimes, there could be presence of extreme values in the datasets(very large values or very low values),in order to prevent a bias or tendency of the Algorithm to have a skewed prediction(giving more importance to certain values than others), which we might want to avoid in certain situations as this use-case.
  3. IT WORKS : Centering or Scaling data points for a Linear regression Algorithm has been proven to give a better line of fit as data points are centered or closer to the origin.

For the Bias-term, one primary reason for adding it is to keep the model’s predictions in check. It compensates the difference between the actual target and the predicted target values.

The code implementation for this is as shown below:

I added comments in the functions for a view on what each does.

The next step is to split the data into a “Train and Test Set”. If you are very familiar with the Sklearn library, you might have come across or use this a couple of times. However, here, we will do it from scratch. This will also help in understanding better, how the Sklearn version works.

To split the data, we will need the following:

  • A shuffled data: This is to ensure random sampling such that there is a uniform and identical distribution of the data across the train and test split, otherwise there might be a bias split in the train and test set. It is just as simple as making sure each feature is well represented in the train and test set, also avoiding a case where each observation is present in the train set but absent in the test set, how would such a model perform?
  • A test_size(or split ratio): This is the simply the ratio with which the data should be split on. In most cases, it could either be a 80:20(1.e 80%Train set, 20% test set), or 75:25 or 70:30. This depends on the size of the data and also individual choice. After splitting, we will have X_train, X_test(with n-D shape, n=number of features — 1, or number of features — target column), where X_train is the train version of the data without the labels/target column and X_test is the test version without labels and with which the model will make predictions on; similarly, Y_train is the train version of the data with only the target column(1D in shape) and Y_Test the test version with which the data will be evaluated on.

View the code implementation below. Notice a split ratio of 80:20 was selected, and slicing used for splitting.

  • Modelling:

Finally, we get to build the model. We will be implementing this using the ‘Class’ method in Python, ‘LinearRegression’ class. See image below:

Notice in the fit method, two terms X and y term were created . This is just for simplicity though, to avoid writing a longer line of code.

The X-term is simply; $(x^Tx)^{-1}$

and the Y-term is this part of the equation: $.x^T.y$

This way, it can easily be substituted in the equation. Luckily, numpy has an input module that makes it easy to compute theta for the closed form , ‘Linalg’ .

I highly recommend you check out the documentation referenced here: https://numpy.org/doc/stable/reference/routines.linalg.html .

It is great for Linear Algebra and has multiple functionalities.

I hope you found this insightful. Here I talked about Linear regression using the closed form solution and a simple implementation from scratch using Numpy. WATCH OUT for the next series on optimization techniques used in Linear regression(the Numerical Approach).

Thank you for reading!!

Find references and resources below. Link to code repository on Github can also be found below:

--

--

Aminah Mardiyyah Rufai

Machine Learning Researcher | Machine Intelligence student at African Institute for Mathematical Sciences and Machine Intelligence | PHD Candidate