Understanding residuals in statistics

A residual is the difference between a predicted value of dependent variable y with the actual value of y.

Residual = predicted Y - actual Y

ri = yi - ŷi

When you have a negative residual, it means the predicted value is too high. A positive residual means your predicted value is too low. A linear regression should minimise the amount of residuals it has.

Concept of Residuals

As you might know, the goal of a linear regression is to predict the value of a dependent variable based on the value of an independent variable.

The regression line is a prediction for each observation in the dataset, but it’s very unlikely that the line will match all the values you’re observing.

Residuals are the differences between the predicted and the observed values.

When we plot the observed values and place the fitted regression line, we can calculate the vertical distance between predicted y and actual y:

Since we have 8 observations in the example, then we’ll have 8 residual values. The sum and mean of all residual values are always near zero.

Example of calculating residuals

Let’s see an example of predicting a person’s height based on their age. Here’s a case of simple linear regression to calculate the residuals.

Suppose you have a dataset of 8 individuals with their respective ages (in years) and heights (in inches):

PersonAgeHeight
12060
22567
33057
43563
54072
64573
75065
85572

Using a statistical software, we found the linear regression equation as follows:

54.42 + 0.31(x)

Knowing this, we can start calculating the residual values. For example, to calculate the residual of Person 1:

Predicted Y = 54.42 + 0.31(20) = 60.62

Residual = 60 - 60.62 = -0.62

You can then calculate the rest of the residual values using the same equation:

PersonAgeHeightPredicted HeightResiduals
1206060.62-0.62
2256762.174.83
3305763.72-6.72
4356365.27-2.27
5407266.825.18
6457368.374.63
7506569.92-4.92
8557271.470.53

The sum of the residuals are 0.64 and the average (mean) is 0.08.

Summary

A residual is the difference between a predicted value of a dependent variable and the actual observed value of that variable.

Residuals provide valuable diagnostic information about the regression model’s goodness of fit, assumptions, and potential areas for improvement.

They help assess the reliability and validity of the regression analysis, enabling researchers and analysts to make informed decisions based on the model’s performance and suitability for the data at hand.

With this tutorial, you’ve learned how to calculate the residuals for each dependent variable in a linear regression.

I hope this tutorial helps. Happy analyzing!

Take your skills to the next level ⚡️

I'm sending out an occasional email with the latest tutorials on programming, web development, and statistics. Drop your email in the box below and I'll send new stuff straight into your inbox!

No spam. Unsubscribe anytime.