Linear Regression in R: Annotated Output
This post shows how to run a linear regression model in R, and how to interpret the output produced, line by line. For more background on linear regression in general, see this post on ordinary linear regression.
The data
# This is the code to generate the madeup data
n < 100
mutation_present < factor(sample(c("Y", "N"), size = n, replace = TRUE))
hazards < runif(n = n, min = 1, max = 10)
asthma_sx < rnorm(n = n, mean = 0, sd = 5) + 50 +
ifelse(mutation_present == "N", .5*(hazards  mean(hazards)), 3 + 3*(hazards  mean(hazards)))
asthma < data.frame(asthma_sx, mutation_present, hazards)
For this example, we’ll be using some madeup data about asthma symptoms. The data include three variables: asthma_sx
which gives a rating of severity of symptoms, mutation_present
which indicates whether each patient has a genetic mutation associated with asthma (yes or no), and hazards
which is a rating of the presence of environmental hazards in the patients home. Here are the first five cases in the data:
asthma_sx  mutation_present  hazards 

48.53436  N  2.130526 
64.50491  Y  9.157618 
44.69682  N  4.154580 
63.81352  Y  9.642411 
46.37521  Y  4.044607 
Estimate the model: lm
To estimate a linear model in R, use the function lm
. It’s in the stats
package, which is included by default in base R and loaded at the beginning of any R session — so you don’t need to install anything or run library(stats)
to be able to use its functions; they’re all available to you by default when you open R or RStudio.
Prep your variables for use in the model
Important note: The lm
function pays attention to the types of variables you include (e.g. numeric vs. factor) as they are encoded in your data frame. Factors will automatically be dummycoded with the first level as the reference group. If you have categorical variables in your model, check first that they are correctly labeled as factors in your data with the str
command:
str(asthma)
'data.frame': 100 obs. of 3 variables:
$ asthma_sx : num 48.5 64.5 44.7 63.8 46.4 ...
$ mutation_present: Factor w/ 2 levels "N","Y": 1 2 1 2 2 2 2 1 2 1 ...
$ hazards : num 2.13 9.16 4.15 9.64 4.04 ...
Looks good. I see mutation_present
is showing up as a factor, with levels “N” and “Y”. Since “N” is the first level, that will be the reference value in the dummy coding.
It is a good idea to center any continuous predictors before running your model, to make the tests of your regression coefficients more informative (depending on the other predictors in your model it may also reduce multicollinearity, winwin!). Our only continuous predictor here is hazards
. We’ll center it now.
# center hazards by subtracting the mean
asthma$hazards < asthma$hazards  mean(asthma$hazards, na.rm = TRUE)
Run the model
To run a regression model in R, provide the model formula and the data frame as arguments to the lm
function. The formula is of the pattern outcome ~ predictor1 + predictor2 + predictor3...
. In this case, we want to run a model that includes the main effects of hazards
and mutation_present
as well as their interaction. You can write this out as hazards + mutation_present + hazards:mutation_present
, or you can use the shorthand *
to specify both main effects and their interaction: hazards * mutation_present
. The order of predictors in the model does not matter.
If you just run the lm
function itself, R will give you only the bare coefficient estimates as output:
lm(asthma_sx ~ hazards * mutation_present, data = asthma)
Call:
lm(formula = asthma_sx ~ hazards * mutation_present, data = asthma)
Coefficients:
(Intercept) hazards
48.9800 0.6995
mutation_presentY hazards:mutation_presentY
3.6679 2.3404
In most cases, you will want additional information, such as the significance tests for each coefficient and overall model fit statistics. To get that information, and other useful diagnostics, save the output from lm
as an object, which we can then use as input to other functions.
# save a model object
model < lm(asthma_sx ~ hazards * mutation_present, data = asthma)
Annotated output
The information most people want from their model can be obtained from the summary
function:
summary(model)
Call:
lm(formula = asthma_sx ~ hazards * mutation_present, data = asthma)
Residuals:
Min 1Q Median 3Q Max
17.4616 2.9869 0.1139 3.2278 10.0346
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 48.9800 0.7107 68.914 < 2e16 ***
hazards 0.6995 0.2789 2.508 0.01382 *
mutation_presentY 3.6679 1.0259 3.575 0.00055 ***
hazards:mutation_presentY 2.3404 0.4040 5.794 8.72e08 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.123 on 96 degrees of freedom
Multiple Rsquared: 0.5746, Adjusted Rsquared: 0.5613
Fstatistic: 43.22 on 3 and 96 DF, pvalue: < 2.2e16
We’ll step through this output line by line now and talk about what each piece means.
Call
The first lines of the summary
output just repeat your model back to you. This is a good place to doublecheck that you ran the model you intended to, but it doesn’t give you any new information.
Residuals
The next bit of output shows the quartiles for your model’s residuals. In a wellfitting linear model, the residuals should be normally distributed around 0. This is hard to judge from just the quartiles, but you might notice if, for example, the negative (or positive) end of the residuals extends much farther than the positive (negative) end, indicating negative (positive) skew and/or outliers, and a potential problem with your model specification. The residuals here look roughly symmetrical. Good.
Coefficients
The next part of the summary
output is the coefficients table. Each row is a coefficient from your model, and the columns show the estimate (the same as the coefficient estimates R gives us when we just run lm
without summarizing), the standard error of the estimate, the tvalue (which is just the estimate divided by its SE), and the pvalue associated with that tvalue (the probability of getting a tvalue more extreme than the observed tvalue if the null hypothesis were true). For very small pvalues, it will just print “< 2e16”. Finally, there is a significance code for each coefficient using asterisks to indicate how small the pvalue is. The key for the significance codes is printed beneath the coefficients table: Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
. A pvalue between 0 and 0.001 will get ***
, a pvalue between 0.001 and 0.01 will get **
, etc. The significance codes are redundant with the information in the Pr(>t)
column. They just provide a quick visual summary.
Intercept
The first coefficient will always be the intercept. The interpretation of the intercept depends on what predictors are in the model and how they are coded, but it is always the outcome estimate when all predictors = 0. In this case, that would mean we would predict an asthma severity score of 48.98 for a patient with a hazard score of 0 (since we centered hazards
, that would be a patient with average environmental hazards), and without the genetic mutation (mutation_present = “No”). The significance test tells us that estimate is significantly different from zero.
In this situation, as is often the case, the estimate for the intercept is not particularly meaningful — the ratings of asthma severity run from 31.89 to 73.64, so pretty much any estimate would be significantly different from zero^{1}.
Main effects
The next coefficient in our model is hazards
. Because we have a dummycoded categorical variable in the model, this is the estimate for the effect of environmental hazards for the reference group, i.e. those without the genetic mutation. We can interpret it as follows in APA style: For each unit increase in the presence of environmental hazards, there is an estimated 0.7 unit increase in patients’ asthma symptom severity for the group without the genetic predisposition, t(96)=2.51, p=.014.
The next coefficient is mutation_presentY
. Because lm
uses dummy coding for factors by default, R prints the name of the variable (mutation_present) with the level being tested (Y) against the reference group. Here we only have two levels in our factor, but if there were more, we would see a coefficient for each level other than the reference group.
Because there is an interaction between mutation_present
and hazards
in our model, the estimate for mutation_present
is for patients with hazards
=0, which is patients with an average amount of environmental hazards since we centered hazards before estimating the model. We can interpret this coefficient as follows: For patients with an average amount of environmental hazards, there is a 3.67 unit increase in asthma symptoms associated with having the genetic mutation, t(96)=3.58, p<.001.
Interaction
The final coefficient in the coefficients table is the interaction between hazards
and mutation_present
. Again, because mutation_present
is dummycoded, R indicates that this is the estimate for the interaction relative to the reference group. In this case, that means there is an estimated 2.34 unit increase in the slope of hazards
for patients who have the genetic mutation relative to those who do not^{2}. In other words, the effect of environmental hazards on severity of asthma symptoms is significantly stronger for patients who have the genetic mutation relative to those who do not, b=2.34, t(96)=5.79, p<.001.
Residual standard error
The first of the final three lines of summary
output gives something called the “residual standard error”, but that’s actually a quirk in R; it should really be labeled “residual standard deviation” instead, but it’s been mislabeled so long it would confuse users to change it now. The residual standard deviation is the standard deviation of the model’s residuals — it is the average distance between the model’s predicted asthma severity score for a patient and that patient’s actual score. It gives you a rough sense of how accurately your model is able to predict your outcome.
The residual degrees of freedom (96, in this case) is the total sample size minus the number of estimated coefficients for the model including the intercept. Here we have 100 patients in the data and we estimate 4 coefficients in the model, so the residual df is 100  4 = 96. This df is what is used for the ttests for each coefficient.
Multiple Rsquared
The second to last line in the summary
output gives the multiple Rsquared for the model. It tells you how well your model as a whole is able to predict variance in the outcome variable. It can be interpreted as a percentage of variance — here, the model explains 57.5% of the variance in asthma symptom severity. R also prints “adjusted Rsquared”, which is similar but applies a penalty for models with a lot of predictors relative to the number of observations. For more information on the distinction between Rsquared and adjusted Rsquared, see this post.
Fstatistic
The final line of the summary
output gives the information for the Ftest for the overall model. This tests whether Rsquared is significantly different from zero — in other words, does your model explain a significant amount of variance in the outcome? Note that if any of your coefficients are significant, than this overall Ftest will also be significant. When you have a model with only one predictor, the Ftest here and the ttest for that predictor are equivalent. You can report Rsquared and its Ftest like this: The model explains a significant amount of variance in asthma symptom severity, R^{2}=0.57, F(3,96)=43.22, p<.001.
Bonus: Other things you can do with the model object
There are several other useful things you can do with a model object saved from the lm
function in R. Try playing around with the following:
plot(model) # useful diagnostic plots to assess your model for problems, especially outliers
hist(residuals(model)) # a histogram of your model's residuals
predict(model) # get the predicted values for each observation in your data
predict(model, new_data) # get the predictions your model makes on a new dataset called "new_data"

If you want to make your model intercept more informative, you can center your outcome before running the model — then testing whether the intercept estimate is significantly different from zero tells you whether the intercept is significantly different from the mean of your outcome for a patient with 0 for all of your predictor variables. ↩

Interactions are notoriously tricky to interpret, so don’t worry if you don’t understand the impact of an interaction just by looking at the coefficient estimates — few people do. Instead, try plotting the lines of best fit for different subgroups of the data to help you understand how the estimates change for different the levels of your predictors. ↩