Linear Regression Made Simple: A Step-by-Step Tutorial

Utsav Desai
13 min readFeb 12, 2023

--

What is Linear Regression?

Linear Regression is a supervised learning algorithm in machine learning, which is widely used for solving regression problems. Regression is a type of machine learning problem where the goal is to predict a continuous output variable based on one or more input variables.

In Linear Regression, the goal is to find the best-fitting linear equation to describe the relationship between the input variables (also known as predictors or features) and the output variable (also known as the response variable).

The equation for a simple linear regression model can be written as follows:

y = b0 + b1 * x

Here, y is the dependent variable (the variable we are trying to predict), x is the independent variable (the predictor or feature), b0 is the intercept term (the value of y when x is zero), and b1 is the slope coefficient (the change in y for a unit change in x).

The goal of Linear Regression is to find the best values for b0 and b1 such that the line best fits the data points, minimizing the errors or the difference between the predicted values and the actual values.

Types of Linear Regression?

There are two main types of Linear Regression models: Simple Linear Regression and Multiple Linear Regression.

Simple Linear Regression: In simple linear regression, there is only one independent variable (also known as the predictor or feature) and one dependent variable (also known as the response variable). The goal of simple linear regression is to find the best-fitting line to describe the relationship between the independent and dependent variable. The equation for a simple linear regression model can be written as:

Y = b0 + b1 * X

Here, Y is the dependent variable, X is the independent variable, b0 is the intercept term, and b1 is the slope coefficient.

Multiple Linear Regression: In multiple linear regression, there are multiple independent variables and one dependent variable. The goal of multiple linear regression is to find the best-fitting line to describe the relationship between the independent variables and the dependent variable. The equation for a multiple linear regression model can be written as:

Y = b0 + b1 * X1 + b2 * X2 + … + bn * Xn

Here, Y is the dependent variable, X1, X2, …, Xn are the independent variables, b0 is the intercept term, and b1, b2, …, bn are the slope coefficients.

In both types of linear regression, the goal is to find the best values for the intercept and slope coefficients that minimize the difference between the predicted values and the actual values. Linear regression is widely used in many real-world applications, such as finance, marketing, and healthcare, for predicting outcomes such as stock prices, customer behavior, and patient outcomes.

Linear Regression Line

In machine learning, a regression line can show two types of relationships between the input variables (also known as predictors or features) and the output variable (also known as the response variable) in a linear regression model.

  • Positive Relationship: A positive relationship exists between the input variables and the output variable when the slope of the regression line is positive. In other words, as the values of the input variables increase, the value of the output variable also increases. This can be seen as an upward slope on a scatter plot of the data.
  • Negative Relationship: A negative relationship exists between the input variables and the output variable when the slope of the regression line is negative. In other words, as the values of the input variables increase, the value of the output variable decreases. This can be seen as a downward slope on a scatter plot of the data.

Finding the best fit line

In machine learning, finding the best-fitting line is crucial in linear regression, as it determines the accuracy of the predictions made by the model. The best-fitting line is the line that has the smallest difference between the predicted values and the actual values.

To find the best-fitting line in a linear regression model, we use a process called “ordinary least squares (OLS) regression”. This process involves calculating the sum of the squared differences between the predicted values and the actual values for each data point, and then finding the line that minimizes this sum of squared errors.

The best-fitting line is found by minimizing the residual sum of squares (RSS), which is the sum of the squared differences between the predicted values and the actual values. This is achieved by adjusting the values of the intercept and slope coefficients, also known as c and m, respectively.

Once the values of c and m are determined, we can use the linear regression equation to make predictions for new data points. The equation for a simple linear regression model can be written as:

y = c + m * x

Here, y is the dependent variable (the variable we are trying to predict), x is the independent variable (the predictor or feature), c is the intercept term (the value of y when x is zero), and m is the slope coefficient (the change in y for a unit change in x).

In multiple linear regression, the equation would have more independent variables, and the slope coefficients for each variable would be included in the equation.

Overall, finding the best-fitting line in a linear regression model is critical for accurate predictions and is achieved by minimizing the residual sum of squares using the OLS regression method.

Cost function

The cost function, also known as the loss function or objective function, is a measure of how well the model is performing. In linear regression, the cost function is used to calculate the difference between the predicted values and the actual values, also known as the residuals or errors.

The goal of linear regression is to minimize the cost function, which is achieved by finding the best-fitting line that minimizes the sum of the squared differences between the predicted values and the actual values. The most commonly used cost function in linear regression is the Mean Squared Error (MSE), which is calculated as the average of the squared differences between the predicted and actual values:

MSE = (1/n) * Σ(yi — ŷi)²

Here, n is the number of data points, yi is the actual value, and ŷi is the predicted value.

The goal of the model is to find the values of the intercept and slope coefficients, b0 and b1, that minimize the MSE. This is typically done using an optimization algorithm, such as gradient descent, which iteratively adjusts the values of the coefficients to minimize the cost function.

Other cost functions that can be used in linear regression include Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Huber Loss, each with its own advantages and disadvantages. The choice of cost function depends on the specific problem and the requirements of the model.

Gradient Descent

Gradient Descent is an optimization algorithm used to minimize the cost function in linear regression models. The goal of gradient descent is to find the values of the intercept and slope coefficients, b0 and b1, that minimize the cost function by iteratively adjusting the values of these coefficients.

The basic idea behind gradient descent is to calculate the gradient of the cost function, which is the direction of steepest descent, and update the values of the coefficients in the opposite direction of the gradient. The learning rate, also known as the step size, determines how large of a step is taken in each iteration.

The algorithm works as follows:

  1. Initialize the values of b0 and b1 to random values.
  2. Calculate the predicted values for the given input data using the current values of b0 and b1.
  3. Calculate the cost function using the predicted values and the actual values.
  4. Calculate the gradient of the cost function with respect to b0 and b1.
  5. Update the values of b0 and b1 by taking a step in the opposite direction of the gradient, with a step size determined by the learning rate.
  6. Repeat steps 2–5 until the cost function is minimized or a maximum number of iterations is reached.

The choice of the learning rate is critical in gradient descent, as a learning rate that is too large can cause the algorithm to overshoot the minimum and a learning rate that is too small can cause the algorithm to converge slowly.

There are different variations of gradient descent, including batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, each with their own advantages and disadvantages. The choice of gradient descent algorithm depends on the specific problem and the requirements of the model.

Model Performance

Model performance in linear regression can be evaluated using various metrics that measure how well the model fits the data and how well it generalizes to new data. Here are some common metrics used in linear regression:

Example: Suppose we have a linear regression model that predicts housing prices based on the size of the house. We have 10 data points with the following actual prices (y) and predicted prices (y_hat):

y = [100, 150, 200, 250, 300, 350, 400, 450, 500, 550]

y_hat = [110, 140, 180, 240, 290, 320, 380, 420, 480, 520]

y_mean = sum(y) / len(y) = 325

1. Sum of Squares Regression (SSR): SSR is calculated by taking the sum of the squared differences between the predicted values and the mean of the dependent variable. It represents the variability in the dependent variable that is explained by the independent variables. Mathematically, it can be expressed as:

SSR = sum((y_hat — y_mean)²) = 28300

2. Sum of Squares Error (SSE): SSE is calculated by taking the sum of the squared differences between the predicted values and the actual values of the dependent variable. Mathematically, it can be expressed as:

SSE = sum((y — y_hat)²) = 26700

3. Mean Squared Error (MSE): This metric measures the average squared difference between the predicted and actual values. It is calculated as the sum of squared residuals divided by the number of data points. A lower MSE indicates better performance. Mathematically, it can be expressed as:

MSE = (1/n) * sum((y — y_hat)²) = 2670

4. Root Mean Squared Error (RMSE): This metric measures the square root of the MSE and is useful for interpreting the errors in the same units as the dependent variable. Mathematically, it can be expressed as:

RMSE = sqrt(MSE) = sqrt(2670) = 51.68

5. Mean Absolute Error (MAE): This metric measures the average absolute difference between the predicted and actual values. It is less sensitive to outliers than MSE, but still provides a measure of the model’s accuracy. Mathematically, it can be expressed as:

MAE = (1/n) * sum(abs(y — y_hat)) = 36

6. Sum Of Squares Total (SST): SST is calculated by taking the sum of the squared differences between the actual values of the dependent variable and its mean value. Mathematically, it can be expressed as:

SST = sum((y — y_mean)²) = 55000

6. R-Squared (R²): This metric measures the proportion of variance in the dependent variable that is explained by the independent variable(s). It ranges from 0 to 1, with a higher value indicating better performance.

R² = SSR / SST = 28300 / 55000 = 0.5164

R² = 1 — SSE / SST = 1 —26700 / 55000 = 0.5164

Multiple Linear Regression using sklearn

Multiple linear regression is a type of regression analysis that models the relationship between multiple independent variables and a single dependent variable. In machine learning, you can implement multiple linear regression using the scikit-learn library in Python. Here is an example of how to do that:

First, you need to import the necessary modules:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import pandas as pd

Then, you need to load your dataset into a pandas DataFrame:

df = pd.read_csv('your_dataset.csv')

Next, you need to separate the independent variables (X) from the dependent variable (y):

X = df[['independent_variable_1', 'independent_variable_2', ...]]
y = df['dependent_variable']

After that, you need to split the dataset into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Now, you can create a LinearRegression object and fit it to the training data:

regressor = LinearRegression()
regressor.fit(X_train, y_train)

Finally, you can use the trained model to make predictions on the testing data and evaluate its performance using the coefficient of determination (R-squared):

y_pred = regressor.predict(X_test)
score = r2_score(y_test, y_pred)
print('R-squared:', score)

That’s it! You have now implemented multiple linear regression using scikit-learn in Python. Note that this is a basic example and you may need to perform additional steps such as data preprocessing, feature scaling, and regularization depending on your specific use case.

Polynomial Regression

Polynomial regression is a type of regression analysis that models the relationship between a dependent variable and one or more independent variables as an nth degree polynomial function. In other words, it extends the linear regression model by adding polynomial terms to the equation. This allows for a more flexible model that can capture nonlinear relationships between the variables.

In polynomial regression, the equation takes the form:

y = b0 + b1x + b2x² + … + bn*x^n + e

where y is the dependent variable, x is the independent variable, b0, b1, b2, …, bn are the regression coefficients, n is the degree of the polynomial, and e is the error term.

To implement polynomial regression in machine learning, you can use libraries like scikit-learn in Python. Here is an example of how to do that:

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Load the dataset
X = [[1], [2], [3], [4], [5], [6]]
y = [2, 4, 5, 4, 5, 7]

# Create a polynomial feature object with degree 2
poly = PolynomialFeatures(degree=2)

# Transform the input data to include polynomial terms up to degree 2
X_poly = poly.fit_transform(X)

# Create a linear regression object and fit it to the transformed data
regressor = LinearRegression()
regressor.fit(X_poly, y)

# Make a prediction for a new input value
X_new = poly.transform([[7]])
y_new = regressor.predict(X_new)

# Print the predicted value
print(y_new)

In this example, we first load a dataset consisting of a single independent variable and a single dependent variable. Then, we create a PolynomialFeatures object with degree 2, which transforms the input data to include polynomial terms up to degree 2. We then fit a linear regression model to the transformed data using the LinearRegression class. Finally, we make a prediction for a new input value (7) and print the predicted value.

Assumptions of Linear Regression

The following are the fundamental assumptions of Linear Regression, which can be used to answer the question of whether we can use a linear regression algorithm on a particular dataset.

A linear relationship between features and the target variable: Linear Regression assumes that the relationship between independent features and the target is linear. It does not support anything else. You may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship).

Little or No Multicollinearity between features: Multicollinearity exists when the independent variables are found to be moderately or highly correlated. In a model with correlated variables, it becomes a tough task to figure out the true relationship of predictors with the target variable. In other words, it becomes difficult to find out which variable is actually contributing to predict the response variable.

Little or No Autocorrelation in residuals: The presence of correlation in error terms drastically reduces the model’s accuracy. This usually occurs in time series models where the next instant is dependent on the previous instant. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error.

No Heteroscedasticity: The presence of non-constant variance in the error terms results in heteroscedasticity. Generally, non-constant variance arises in the presence of outliers. Looks like, these values get too much weight, thereby disproportionately influences the model’s performance.

Normal distribution of error terms: If the error terms are not normally distributed, confidence intervals may become too wide or narrow. Once the confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on the least-squares. The presence of non-normal distribution suggests that there are a few unusual data points that must be studied closely to make a better model.

--

--

Utsav Desai

Utsav Desai is a technology enthusiast with an interest in DevOps, App Development, and Web Development.