Linear Regression in R: Annotated Output
This post shows how to run a linear regression model in R, and how to interpret the output produced, line by line. For more background on linear regression in general, see this post on ordinary linear regression.
# This is the code to generate the made-up data n <- 100 mutation_present <- factor(sample(c("Y", "N"), size = n, replace = TRUE)) hazards <- runif(n = n, min = 1, max = 10) asthma_sx <- rnorm(n = n, mean = 0, sd = 5) + 50 + ifelse(mutation_present == "N", .5*(hazards - mean(hazards)), 3 + 3*(hazards - mean(hazards))) asthma <- data.frame(asthma_sx, mutation_present, hazards)
For this example, we’ll be using some made-up data about asthma symptoms. The data include three variables:
asthma_sx which gives a rating of severity of symptoms,
mutation_present which indicates whether each patient has a genetic mutation associated with asthma (yes or no), and
hazards which is a rating of the presence of environmental hazards in the patients home. Here are the first five cases in the data:
Estimate the model:
To estimate a linear model in R, use the function
lm. It’s in the
stats package, which is included by default in base R and loaded at the beginning of any R session — so you don’t need to install anything or run
library(stats) to be able to use its functions; they’re all available to you by default when you open R or RStudio.
Prep your variables for use in the model
Important note: The
lm function pays attention to the types of variables you include (e.g. numeric vs. factor) as they are encoded in your data frame. Factors will automatically be dummy-coded with the first level as the reference group. If you have categorical variables in your model, check first that they are correctly labeled as factors in your data with the
str(asthma) 'data.frame': 100 obs. of 3 variables: $ asthma_sx : num 48.5 64.5 44.7 63.8 46.4 ... $ mutation_present: Factor w/ 2 levels "N","Y": 1 2 1 2 2 2 2 1 2 1 ... $ hazards : num 2.13 9.16 4.15 9.64 4.04 ...
Looks good. I see
mutation_present is showing up as a factor, with levels “N” and “Y”. Since “N” is the first level, that will be the reference value in the dummy coding.
It is a good idea to center any continuous predictors before running your model, to make the tests of your regression coefficients more informative (depending on the other predictors in your model it may also reduce multicollinearity, win-win!). Our only continuous predictor here is
hazards. We’ll center it now.
# center hazards by subtracting the mean asthma$hazards <- asthma$hazards - mean(asthma$hazards, na.rm = TRUE)
Run the model
To run a regression model in R, provide the model formula and the data frame as arguments to the
lm function. The formula is of the pattern
outcome ~ predictor1 + predictor2 + predictor3.... In this case, we want to run a model that includes the main effects of
mutation_present as well as their interaction. You can write this out as
hazards + mutation_present + hazards:mutation_present, or you can use the shorthand
* to specify both main effects and their interaction:
hazards * mutation_present. The order of predictors in the model does not matter.
If you just run the
lm function itself, R will give you only the bare coefficient estimates as output:
lm(asthma_sx ~ hazards * mutation_present, data = asthma) Call: lm(formula = asthma_sx ~ hazards * mutation_present, data = asthma) Coefficients: (Intercept) hazards 48.9800 0.6995 mutation_presentY hazards:mutation_presentY 3.6679 2.3404
In most cases, you will want additional information, such as the significance tests for each coefficient and overall model fit statistics. To get that information, and other useful diagnostics, save the output from
lm as an object, which we can then use as input to other functions.
# save a model object model <- lm(asthma_sx ~ hazards * mutation_present, data = asthma)
The information most people want from their model can be obtained from the
summary(model) Call: lm(formula = asthma_sx ~ hazards * mutation_present, data = asthma) Residuals: Min 1Q Median 3Q Max -17.4616 -2.9869 -0.1139 3.2278 10.0346 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 48.9800 0.7107 68.914 < 2e-16 *** hazards 0.6995 0.2789 2.508 0.01382 * mutation_presentY 3.6679 1.0259 3.575 0.00055 *** hazards:mutation_presentY 2.3404 0.4040 5.794 8.72e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.123 on 96 degrees of freedom Multiple R-squared: 0.5746, Adjusted R-squared: 0.5613 F-statistic: 43.22 on 3 and 96 DF, p-value: < 2.2e-16
We’ll step through this output line by line now and talk about what each piece means.
The first lines of the
summary output just repeat your model back to you. This is a good place to double-check that you ran the model you intended to, but it doesn’t give you any new information.
The next bit of output shows the quartiles for your model’s residuals. In a well-fitting linear model, the residuals should be normally distributed around 0. This is hard to judge from just the quartiles, but you might notice if, for example, the negative (or positive) end of the residuals extends much farther than the positive (negative) end, indicating negative (positive) skew and/or outliers, and a potential problem with your model specification. The residuals here look roughly symmetrical. Good.
The next part of the
summary output is the coefficients table. Each row is a coefficient from your model, and the columns show the estimate (the same as the coefficient estimates R gives us when we just run
lm without summarizing), the standard error of the estimate, the t-value (which is just the estimate divided by its SE), and the p-value associated with that t-value (the probability of getting a t-value more extreme than the observed t-value if the null hypothesis were true). For very small p-values, it will just print “< 2e-16”. Finally, there is a significance code for each coefficient using asterisks to indicate how small the p-value is. The key for the significance codes is printed beneath the coefficients table:
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1. A p-value between 0 and 0.001 will get
***, a p-value between 0.001 and 0.01 will get
**, etc. The significance codes are redundant with the information in the
Pr(>|t|) column. They just provide a quick visual summary.
The first coefficient will always be the intercept. The interpretation of the intercept depends on what predictors are in the model and how they are coded, but it is always the outcome estimate when all predictors = 0. In this case, that would mean we would predict an asthma severity score of 48.98 for a patient with a hazard score of 0 (since we centered
hazards, that would be a patient with average environmental hazards), and without the genetic mutation (mutation_present = “No”). The significance test tells us that estimate is significantly different from zero.
In this situation, as is often the case, the estimate for the intercept is not particularly meaningful — the ratings of asthma severity run from 31.89 to 73.64, so pretty much any estimate would be significantly different from zero1.
The next coefficient in our model is
hazards. Because we have a dummy-coded categorical variable in the model, this is the estimate for the effect of environmental hazards for the reference group, i.e. those without the genetic mutation. We can interpret it as follows in APA style: For each unit increase in the presence of environmental hazards, there is an estimated 0.7 unit increase in patients’ asthma symptom severity for the group without the genetic predisposition, t(96)=2.51, p=.014.
The next coefficient is
lm uses dummy coding for factors by default, R prints the name of the variable (mutation_present) with the level being tested (Y) against the reference group. Here we only have two levels in our factor, but if there were more, we would see a coefficient for each level other than the reference group.
Because there is an interaction between
hazards in our model, the estimate for
mutation_present is for patients with
hazards=0, which is patients with an average amount of environmental hazards since we centered hazards before estimating the model. We can interpret this coefficient as follows: For patients with an average amount of environmental hazards, there is a 3.67 unit increase in asthma symptoms associated with having the genetic mutation, t(96)=3.58, p<.001.
The final coefficient in the coefficients table is the interaction between
mutation_present. Again, because
mutation_present is dummy-coded, R indicates that this is the estimate for the interaction relative to the reference group. In this case, that means there is an estimated 2.34 unit increase in the slope of
hazards for patients who have the genetic mutation relative to those who do not2. In other words, the effect of environmental hazards on severity of asthma symptoms is significantly stronger for patients who have the genetic mutation relative to those who do not, b=2.34, t(96)=5.79, p<.001.
Residual standard error
The first of the final three lines of
summary output gives something called the “residual standard error”, but that’s actually a quirk in R; it should really be labeled “residual standard deviation” instead, but it’s been mislabeled so long it would confuse users to change it now. The residual standard deviation is the standard deviation of the model’s residuals — it is the average distance between the model’s predicted asthma severity score for a patient and that patient’s actual score. It gives you a rough sense of how accurately your model is able to predict your outcome.
The residual degrees of freedom (96, in this case) is the total sample size minus the number of estimated coefficients for the model including the intercept. Here we have 100 patients in the data and we estimate 4 coefficients in the model, so the residual df is 100 - 4 = 96. This df is what is used for the t-tests for each coefficient.
The second to last line in the
summary output gives the multiple R-squared for the model. It tells you how well your model as a whole is able to predict variance in the outcome variable. It can be interpreted as a percentage of variance — here, the model explains 57.5% of the variance in asthma symptom severity. R also prints “adjusted R-squared”, which is similar but applies a penalty for models with a lot of predictors relative to the number of observations. For more information on the distinction between R-squared and adjusted R-squared, see this post.
The final line of the
summary output gives the information for the F-test for the overall model. This tests whether R-squared is significantly different from zero — in other words, does your model explain a significant amount of variance in the outcome? Note that if any of your coefficients are significant, than this overall F-test will also be significant. When you have a model with only one predictor, the F-test here and the t-test for that predictor are equivalent. You can report R-squared and its F-test like this: The model explains a significant amount of variance in asthma symptom severity, R2=0.57, F(3,96)=43.22, p<.001.
Bonus: Other things you can do with the model object
There are several other useful things you can do with a model object saved from the
lm function in R. Try playing around with the following:
plot(model) # useful diagnostic plots to assess your model for problems, especially outliers hist(residuals(model)) # a histogram of your model's residuals predict(model) # get the predicted values for each observation in your data predict(model, new_data) # get the predictions your model makes on a new dataset called "new_data"
If you want to make your model intercept more informative, you can center your outcome before running the model — then testing whether the intercept estimate is significantly different from zero tells you whether the intercept is significantly different from the mean of your outcome for a patient with 0 for all of your predictor variables. ↩
Interactions are notoriously tricky to interpret, so don’t worry if you don’t understand the impact of an interaction just by looking at the coefficient estimates — few people do. Instead, try plotting the lines of best fit for different sub-groups of the data to help you understand how the estimates change for different the levels of your predictors. ↩