Regression

sai krishna
6 min readJul 3, 2021

===⇒Episode-1⇐===

Concept of Regression in Machine Learning

Hello Everyone!!!

“ GitHub has newly launched a new AI pair programmer that helps you write better code called GitHub Copilot.” GitHub Copilot draws context from the code you’re working on, suggesting whole lines or entire functions. It helps you quickly discover alternative ways to solve problems, write tests, and explore new APIs without having to tediously tailor a search for answers on the internet. As you type, it adapts to the way you write code — to help you complete your work faster.

Here I have come up with a vital concept in Machine Learning which has been predominantly used in all the sectors in the industries. As I said in the earlier episode of this blog series that I will be explaining the concepts of Supervised Learning. Here comes the first topic, Regression is a statistical method to determine the relationship between the dependent and independent variables that have been used in Finance, weather forecast, and other disciplines. I am going to explain the concepts of Linear regression, Multi-linear regression, Polynomial Regression, multi-variant regression

Linear Regression

Linear Regression is usually the first machine learning algorithm that every data scientist comes across. It is a simple model, but everyone needs to master it, as it lays the foundation for other machine learning algorithms.

What?

A simple Linear Regression is a supervised learning algorithm used to predict the dependent variable with help of an independent variable using the best fit line. It works on the principle of ordinary least square (OLS)/ Mean square error (MSE)(MSE). In statistics, ols is a method to estimate the unknown parameter of the linear regression function, its goal is to minimize the sum of the square difference between observed dependent variables in the given data set and those predicted by the linear regression function.

Basic concepts and mathematics

There are two kinds of variables in a linear regression model:

  • The input or predictor variable is the variable(s) that help predict the value of the output variable. It is commonly referred to as X.
  • The output variable is the variable that we want to predict. It is commonly referred to as Y.

For instance, we want to predict the car service cost (Dependent Variable) based on the driven kilometers(Independent Variable). It can easily find out with help of the data using the Linear Regression algorithm.

Let’s see in tech theoretical aspect: let say n observations be (x₁,y₁), (x₂,y₂), … , (x,y) pairs of predictors and responses.

Actual LR Equation
Fit-model
Estimated Parameter
Estimated parameter

where

x = independent variable
y = dependent variable
b0 = intercept
b1 = slope or Gradient or co-efficicent

When?

  1. It is used to predict the future values of continuous data.
  2. Data Should be linear with a single independent and dependent variable.

How?

Fundamental process of Machine Learning model
###Import necessary libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
### Generate 'random' data
np.random.seed(0)
### Array of 100 values with mean = 1.5, stddev = 2.5
x = 2.5 * np.random.randn(100) + 1.5
# Generate 100 residual terms
res = 0.5 * np.random.randn(100)
# Actual values of Y
Y = 2 + 0.3 * X + res
# Create pandas dataframe to store our X and y values
df = pd.DataFrame( {'x': x, 'Y': Y})
# Show the first five rows of our dataframe
df.head()
Random Data generated from the above code

To estimate y using the OLS method, we need to calculate xmean and ymean, the covariance of X and y (xycov), and the variance of X (xvar) before we can determine the values for b0 and b1.

# Calculate the mean of X and y
xmean = np.mean(x)
ymean = np.mean(Y)
# Calculate the terms needed for the numerator and denominator of beta
df['xycov'] = (df['x'] - xmean) * (df['Y'] - ymean)
df['xvar'] = (df['x'] - xmean)**2
# Calculate beta and alpha
b1 = df['xycov'].sum() / df['xvar'].sum()
b0 = ymean - (beta * xmean)
print(f'b0 = {b0}')
print(f'b1 = {b1}')
Estimated parameters

Great, we now have an estimate for b0 and b1! Our model can be written as Yₑ = 1.99+ 0.284 x,​ and we can make predictions:

ypred = b0 + b1 x
Output
# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(x, ypred) # regression line
plt.plot(x, Y, 'ro') # scatter plot showing actual data
plt.title('Actual vs Predicted')
plt.xlabel('x')
plt.ylabel('Y')
plt.show()
Actual vs predicted plot

The blue line is our line of best fit, Yₑ = 1.99+ 0.284 x. We can see from this graph that there is a positive linear relationship between x and Y. Using our model, we can predict Y from any value of x!

Multi-linear Regression

What?

Multi-Linear Regression is also a supervised algorithm used to predict the dependent variable with help of one or more independent variables using the best fit line. When there is only one feature, it is called Uni-variate Linear Regression and if there are multiple features, it is called Multiple Linear Regression.

Formula: Ypred = b0 + b1 * x1 + b2 * (x2) + e

For instance, we want to predict the price of the house (Dependent Variable) with help of room size, no. of rooms, location of the house (Independent variables).

When?

  1. Multi Linear Regression is also used to predict the future values of continuous data.
  2. Data Should be linear with multiple independent variables and a single dependent variable.

Polynomial Regression

What?

Polynomial Regression is a special case scenario deals with non-linear data, it fits the non-linear data with the relationship between the target variable y and the independent variables X is modeled as a nth degree polynomial equation.

Formula: Ypred = b0 + b1 * x + b2 * (x**2) + e

When?

  1. Polynomial Regression is used to predict the future values of continuous non-linear data.
  2. Data Should be non-linear, with multiple independent variables and a single dependent variable.

Multi-variant Regression

What?

Multi-variant Regression is an extremely special case scenario, that estimates predictive analysis of a model with more than one target variable. The outcome variables should be at least moderately correlated for the multivariate regression analysis to make sense.

Example: A researcher has collected data on three psychological variables, four academic variables (standardized test scores), and the type of educational program the student is in for 500 high school students. She is interested in how the set of psychological variables is related to the academic variables and the type of program the student is in.

When?

  1. Multi-variant Regression is used to predict multiple predictors with help of multiple response data.

There are many more important tiny concepts running behind each algorithm case, it’s hard to cover the complete concept of each algorithm in detail in a single episode. I will try to cover those principal concepts separately, other than this series. In this episode of the blog series, I have covered the basic intuition of the supervised learning’s regression algorithms and the implementation of those algorithms are here. I am not going to add any references because most of the technical content has been taken from my master curriculum material. See you next week with episode#02, coming up with an interesting and frequently used algorithm. Until then Happy Learning

You can also email me directly or find me on LinkedIn. I’d love to hear from you if I can help you or your team with machine learning.

“Don’t dream to join a tech giant company, dream to make a company gigantic.” — Sai

--

--

sai krishna

A Machine Learning technology researcher commencing a new Blog Series to make clear ML concepts Simple and ease for everyone.