# Regression

===⇒Episode-1⇐===

Hello Everyone!!!

“ GitHub has newly launched a new AI pair programmer that helps you write better code called GitHub Copilot.” GitHub Copilot draws context from the code you’re working on, suggesting whole lines or entire functions. It helps you quickly discover alternative ways to solve problems, write tests, and explore new APIs without having to tediously tailor a search for answers on the internet. As you type, it adapts to the way you write code — to help you complete your work faster.

Here I have come up with a vital concept in Machine Learning which has been predominantly used in all the sectors in the industries. As I said in the earlier episode of this blog series that I will be explaining the concepts of Supervised Learning. Here comes the first topic, **Regression** is a statistical method to determine the relationship between the dependent and independent variables that have been used in Finance, weather forecast, and other disciplines. I am going to explain the concepts of Linear regression, Multi-linear regression, Polynomial Regression, multi-variant regression

**Linear Regression**

Linear Regression is usually the first machine learning algorithm that every data scientist comes across. It is a simple model, but everyone needs to master it, as it lays the foundation for other machine learning algorithms.

*What?*

A simple Linear Regression is a supervised learning algorithm used to predict the dependent variable with help of an independent variable using the best fit line. It works on the principle of ordinary least square (OLS)/ Mean square error (MSE)(MSE). In statistics, ols is a method to estimate the unknown parameter of the linear regression function, its goal is to minimize the sum of the square difference between observed dependent variables in the given data set and those predicted by the linear regression function.

# Basic concepts and mathematics

There are two kinds of variables in a linear regression model:

- The
**input**or**predictor variable**is the variable(s) that help predict the value of the output variable. It is commonly referred to as.*X* - The
**output variable**is the variable that we want to predict. It is commonly referred to as.*Y*

For instance, we want to predict the car service cost (Dependent Variable) based on the driven kilometers(Independent Variable). It can easily find out with help of the data using the Linear Regression algorithm.

Let’s see in tech theoretical aspect: let say n observations be (x₁,y₁), (x₂,y₂), … , (x**ₙ**,y**ₙ**) pairs of predictors and responses.

where

x = independent variable

y = dependent variable

b0 = intercept

b1 = slope or Gradient or co-efficicent

*When?*

- It is used to predict the future values of continuous data.
- Data Should be linear with a single independent and dependent variable.

*How?*

###Import necessary libraries

import numpy as np

import pandas as pd

from matplotlib import pyplot as plt ### Generate 'random' data

np.random.seed(0)### Array of 100 values with mean = 1.5, stddev = 2.5

x = 2.5 * np.random.randn(100) + 1.5

# Generate 100 residual terms

res = 0.5 * np.random.randn(100)

# Actual values of Y

Y = 2 + 0.3 * X + res # Create pandas dataframe to store our X and y values

df = pd.DataFrame( {'x': x, 'Y': Y}) # Show the first five rows of our dataframe

df.head()

To estimate *y* using the OLS method, we need to calculate `xmean`

and `ymean`

, the covariance of *X* and *y* (`xycov`

), and the variance of *X* (`xvar`

) before we can determine the values for `b0`

and `b1`

.

# Calculate the mean of X and y

xmean = np.mean(x)

ymean = np.mean(Y)# Calculate the terms needed for the numerator and denominator of beta

df['xycov'] = (df['x'] - xmean) * (df['Y'] - ymean)

df['xvar'] = (df['x'] - xmean)**2# Calculate beta and alpha

b1 = df['xycov'].sum() / df['xvar'].sum()

b0 = ymean - (beta * xmean)

print(f'b0 = {b0}')

print(f'b1 = {b1}')

Great, we now have an estimate for `b0`

and `b1`

! Our model can be written as *Yₑ = 1.99+ 0.284 x,* and we can make predictions:

`ypred = b0 + b1 x`

# Plot regression against actual data

plt.figure(figsize=(12, 6))

plt.plot(x, ypred) # regression line

plt.plot(x, Y, 'ro') # scatter plot showing actual data

plt.title('Actual vs Predicted')

plt.xlabel('x')

plt.ylabel('Y')plt.show()

The blue line is our line of best fit, *Yₑ = 1.99+ 0.284 x. *We can see from this graph that there is a positive linear relationship between x and Y. Using our model, we can predict Y from any value of x!

**Multi-linear Regression**

*What?*

Multi-Linear Regression is also a supervised algorithm used to predict the dependent variable with help of one or more independent variables using the best fit line. When there is only one feature, it is called *Uni-variate* Linear Regression and if there are multiple features, it is called *Multi*ple Linear Regression.

Formula: Ypred = b0 + b1 * x1 + b2 * (x2) + e

For instance, we want to predict the price of the house (Dependent Variable) with help of room size, no. of rooms, location of the house (Independent variables).

*When?*

- Multi Linear Regression is also used to predict the future values of continuous data.
- Data Should be linear with multiple independent variables and a single dependent variable.

**Polynomial Regression**

*What?*

Polynomial Regression is a special case scenario deals with non-linear data, it fits the non-linear data with the relationship between the target variable y and the independent variables X is modeled as a nth degree polynomial equation.

Formula: Ypred = b0 + b1 * x + b2 * (x**2) + e

*When?*

- Polynomial Regression is used to predict the future values of continuous non-linear data.
- Data Should be non-linear, with multiple independent variables and a single dependent variable.

**Multi-variant Regression**

*What?*

Multi-variant Regression is an extremely special case scenario, that estimates predictive analysis of a model with more than one target variable. The outcome variables should be at least moderately correlated for the multivariate regression analysis to make sense.

Example: A researcher has collected data on three psychological variables, four academic variables (standardized test scores), and the type of educational program the student is in for 500 high school students. She is interested in how the set of psychological variables is related to the academic variables and the type of program the student is in.

When?

- Multi-variant Regression is used to predict multiple predictors with help of multiple response data.

There are many more important tiny concepts running behind each algorithm case, it’s hard to cover the complete concept of each algorithm in detail in a single episode. I will try to cover those principal concepts separately, other than this series. In this episode of the blog series, I have covered the basic intuition of the supervised learning’s regression algorithms and the implementation of those algorithms are here. I am not going to add any references because most of the technical content has been taken from my master curriculum material. See you next week with episode#02, coming up with an interesting and frequently used algorithm. Until then Happy Learning

You can also email me directly or find me on LinkedIn. I’d love to hear from you if I can help you or your team with machine learning.

“Don’t dream to join a tech giant company, dream to make a company gigantic.” — Sai