LINEAR REGRESSION

Gunjan Agicha
3 min readSep 24, 2019

--

We try to find the best fit line/plane that represents the relationships between i/p and o/p. This best plane is built by learning the parameters of a function that maps i/p to o/p.

E(y)= B0 + B1*X

Hence, if we find B0 and B1 parameters, we can predict our output. So Machine Learning basically tries to guess these parameters which minimize the cost function.

cost function: Mean Square Error

Why do we square them? It penalizes the values which were further away from the actual value.

Aim: find a linear model that minimizes the MSE.

Assumptions: There is a linear relationship between the i/p variables and o/p.

Correlation coefficient: How strongly the two things are correlated.

So for a linear model, a high correlation between dependent and independent variable means they are strongly linearly correlated.

How to make sense of the equation:

for every 1 step increase in x, we expect y to increase by B1

Multicollinearity: When there are many independent variables, these variables might be related to each other, this is called multicollinearity.

Null Hypothesis(H0): the Currently accepted value of a parameter

Null Hypothesis for regression: There is no relationship between x and y.

Regression statistics:

  1. r: coeffiecient, describes the relationship between each independent variable and dependent variable.

correlation coefficient given by Pearson coefficient:

or

2.: coefficient of determination

represents the percent of the data that is the closest to the line of best fit. For example, if r = 0.922, then = 0.850, which means that 85% of the total variation in y can be explained by the linear relationship between x and y (as described by the regression equation). The other 15% of the total variation in y remains unexplained.

3. p- value: We use p-value to see which terms are significant and should be kept in the model. if it's less than .05, then you can reject the null hypothesis.

4. standard error: how far away the points are from the predicted line.

5. t-stat: coefficient/standard error

6. Confidence: How confident are we in our decision. If u are more than 95% sure, thats considered to be good.

Significance: 1- confidence. So significance<5% is good.

Cross-validation: helps to evaluate machine learning models by choosing the best model with minimum error. We keep some part of the training data as testing data on which error of each model is calculated. The best model is then passed through the actual testing data to determine the testing accuracy. This also makes sure that the model does not overfit.

Types:

a) Hold out: prone to sample bias

b) k-fold: divided training data into k sets, train model on k-1 sets and 1 set as the validation set. Take the average of all the errors. Selection bias will not be present.

c) leave one out CV: a special case of k-fold validation where k=no od data points. 1 data point is taken as a validation sample each of the n cycles. then take an average of this error. con is that it will take lot of time.

d) bootstrap: randomly draw N samples from the training sample with replacement, and train the model on this set, while testing it on the left-out data points. The average over each test set error.

--

--

Gunjan Agicha

MSCS @UTD, Data Science Intern @Autodesk. If you want to get connected, send request here: https://www.linkedin.com/in/gagicha/