Linear regression in R: Interpreting the summary

When performing a linear regression in R, the program outputs a lot of relevant information when you call summary(). In this post we'll go through all the figures and discuss how to interpret them. An Excel sheet containing all the calculations is available here.

To get started, we perform a regression on the Boston data set, which is part of the MASS package:

> library(MASS)
> names(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age" 
[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
> ?Boston

We will try to predict the median value of owner-occupied homes (medv) only based on the crome rate:

> summary(
lm(formula = medv ~ crim, data = Boston)

    Min      1Q  Median      3Q     Max
-16.957  -5.449  -2.007   2.512  29.800

            Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.03311    0.40914   58.74

Based on this output, let's go through all the values and see what they tell us. First let's have a look at a plot of this data:

There's definitely a trend here (higher crime rate corresponds to lower house prices), as expected but only up to a point. After a crime rate of about 40, the median house prices remain roughly constant. There is also a more significant problem with our regression: We predict negative house prices for areas with very high crime rates. We will see how to address these two issues in subsequent posts.

This line shows the call made to calculate the regression. In this case this is not really helpful, but when the call is more complicated and includes higher order and interaction terms it is helpful to have this stored somewhere.

The residuals are defined as the difference between actual and predicted value. If our prediction showed no bias, then these residuals would be distributed evenly around zero (just as many predictions are too low and too high). The R output gives a good first impression of this distribution. A good rule of thumb is that the median should be close to zero and  the absolute values of the 1Q and 3Q values should be approximately identical. It seems that in this case, the residuals are not distributed evenly around zero, indicating problems with the fit.

These are the actual coefficients for the intercept and the predictor variables along with their standard error, fitted using least squares. If the absolute value of the estimate is much larger than the standard error, there's a good chance that the real coefficient is actually zero. The t value calculates exactly that, it is defined as

 t value = Estimate / Std.Error.

As rule of thumb, the t value should be larger than 2 and the bigger the value, the better. The p value gives the probability that the t value lies between -2.29 and +2.29. If the probability is very small, then there is virtually no possibility that there is no relationship between the predictor and the dependent variable.

Signif. codes
The symbols show the significance level. As a general rule of thumb, the p value should be at most 0.05. R shows this by attaching one (*) to three (***) stars next to the predictor variable to show the significance.

Residual standard error and degrees of freedom
The degrees of freedom are the number of observations minus the number of parameters. In our case there are 506 observations and 2 parameters (one for the intercept and one for the crim variable). The more degrees of freedom, the less likely we are to overfit.
The residual standard error is defined as

 RSE = sqrt(RSS/df),

where the RSS is the Residual Sum of Squares (the sum of the squares of the residuals). The residual standard error gives an indication "how wrong" the prediction is, on average.

(Adjusted) R-squared
The R-squared gives the percentage of the total variation which is explained by the model, i.e. R^2 = 1 - RSS/TSS. RSS is the Residual Sum of Squares, as above. TSS is the Total Sum of Squares, that is the sum of the squared differences between the variable and its mean. TSS measures the variability given in the data.
Adjusted R^2 also incorporates the number of parameters:

 Adj.R^2 = 1 - (RSS/df) / (TSS/(N-1)),

where N is the number of observations (in our case 506). If there is no danger of overfitting, the Asjusted R^2 should be very close to the R^2.

The F-statistic tests the hypothesis that all parameters are zero. This is more useful in the case of multiple regression, where it gives us a general indication of whether the complete model is any good. The F statistic is calculated with the following formula:

F= ((TSS-RSS)/N) / (RSS/(N - k - 1))

Leave a Reply

Your email address will not be published. Required fields are marked *