A residual is the difference between a predicted value of dependent variable y
with the actual value of y
.
Residual = predicted Y - actual Y
ri = yi - ŷi
When you have a negative residual, it means the predicted value is too high. A positive residual means your predicted value is too low. A linear regression should minimise the amount of residuals it has.
Concept of Residuals
As you might know, the goal of a linear regression is to predict the value of a dependent variable based on the value of an independent variable.
The regression line is a prediction for each observation in the dataset, but it’s very unlikely that the line will match all the values you’re observing.
Residuals are the differences between the predicted and the observed values.
When we plot the observed values and place the fitted regression line, we can calculate the vertical distance between predicted y
and actual y
:
Since we have 8 observations in the example, then we’ll have 8 residual values. The sum and mean of all residual values are always near zero.
Example of calculating residuals
Let’s see an example of predicting a person’s height based on their age. Here’s a case of simple linear regression to calculate the residuals.
Suppose you have a dataset of 8 individuals with their respective ages (in years) and heights (in inches):
Person | Age | Height |
---|---|---|
1 | 20 | 60 |
2 | 25 | 67 |
3 | 30 | 57 |
4 | 35 | 63 |
5 | 40 | 72 |
6 | 45 | 73 |
7 | 50 | 65 |
8 | 55 | 72 |
Using a statistical software, we found the linear regression equation as follows:
54.42 + 0.31(x)
Knowing this, we can start calculating the residual values. For example, to calculate the residual of Person 1
:
Predicted Y = 54.42 + 0.31(20) = 60.62
Residual = 60 - 60.62 = -0.62
You can then calculate the rest of the residual values using the same equation:
Person | Age | Height | Predicted Height | Residuals |
---|---|---|---|---|
1 | 20 | 60 | 60.62 | -0.62 |
2 | 25 | 67 | 62.17 | 4.83 |
3 | 30 | 57 | 63.72 | -6.72 |
4 | 35 | 63 | 65.27 | -2.27 |
5 | 40 | 72 | 66.82 | 5.18 |
6 | 45 | 73 | 68.37 | 4.63 |
7 | 50 | 65 | 69.92 | -4.92 |
8 | 55 | 72 | 71.47 | 0.53 |
The sum of the residuals are 0.64
and the average (mean) is 0.08
.
Summary
A residual is the difference between a predicted value of a dependent variable and the actual observed value of that variable.
Residuals provide valuable diagnostic information about the regression model’s goodness of fit, assumptions, and potential areas for improvement.
They help assess the reliability and validity of the regression analysis, enabling researchers and analysts to make informed decisions based on the model’s performance and suitability for the data at hand.
With this tutorial, you’ve learned how to calculate the residuals for each dependent variable in a linear regression.
I hope this tutorial helps. Happy analyzing!