 Rose Hartman

# Linear Regression in R: Annotated Output

This post shows how to run a linear regression model in R, and how to interpret the output produced, line by line. For more background on linear regression in general, see this post on ordinary linear regression.

## The data

# This is the code to generate the made-up data
n <- 100
mutation_present <- factor(sample(c("Y", "N"), size = n, replace = TRUE))
hazards <- runif(n = n, min = 1, max = 10)
asthma_sx <- rnorm(n = n, mean = 0, sd = 5) + 50 +
ifelse(mutation_present == "N", .5*(hazards - mean(hazards)), 3 + 3*(hazards - mean(hazards)))

asthma <- data.frame(asthma_sx, mutation_present, hazards)


For this example, we’ll be using some made-up data about asthma symptoms. The data include three variables: asthma_sx which gives a rating of severity of symptoms, mutation_present which indicates whether each patient has a genetic mutation associated with asthma (yes or no), and hazards which is a rating of the presence of environmental hazards in the patients home. Here are the first five cases in the data:

asthma_sx mutation_present hazards
48.53436 N 2.130526
64.50491 Y 9.157618
44.69682 N 4.154580
63.81352 Y 9.642411
46.37521 Y 4.044607

## Estimate the model: lm

To estimate a linear model in R, use the function lm. It’s in the stats package, which is included by default in base R and loaded at the beginning of any R session — so you don’t need to install anything or run library(stats) to be able to use its functions; they’re all available to you by default when you open R or RStudio.

### Prep your variables for use in the model

Important note: The lm function pays attention to the types of variables you include (e.g. numeric vs. factor) as they are encoded in your data frame. Factors will automatically be dummy-coded with the first level as the reference group. If you have categorical variables in your model, check first that they are correctly labeled as factors in your data with the str command:

str(asthma)

'data.frame':   100 obs. of  3 variables:
$asthma_sx : num 48.5 64.5 44.7 63.8 46.4 ...$ mutation_present: Factor w/ 2 levels "N","Y": 1 2 1 2 2 2 2 1 2 1 ...
$hazards : num 2.13 9.16 4.15 9.64 4.04 ...  Looks good. I see mutation_present is showing up as a factor, with levels “N” and “Y”. Since “N” is the first level, that will be the reference value in the dummy coding. It is a good idea to center any continuous predictors before running your model, to make the tests of your regression coefficients more informative (depending on the other predictors in your model it may also reduce multicollinearity, win-win!). Our only continuous predictor here is hazards. We’ll center it now. # center hazards by subtracting the mean asthma$hazards <- asthma$hazards - mean(asthma$hazards, na.rm = TRUE)


### Run the model

To run a regression model in R, provide the model formula and the data frame as arguments to the lm function. The formula is of the pattern outcome ~ predictor1 + predictor2 + predictor3.... In this case, we want to run a model that includes the main effects of hazards and mutation_present as well as their interaction. You can write this out as hazards + mutation_present + hazards:mutation_present, or you can use the shorthand * to specify both main effects and their interaction: hazards * mutation_present. The order of predictors in the model does not matter.

If you just run the lm function itself, R will give you only the bare coefficient estimates as output:

lm(asthma_sx ~ hazards * mutation_present, data = asthma)

Call:
lm(formula = asthma_sx ~ hazards * mutation_present, data = asthma)

Coefficients:
(Intercept)                    hazards
48.9800                     0.6995
mutation_presentY  hazards:mutation_presentY
3.6679                     2.3404


In most cases, you will want additional information, such as the significance tests for each coefficient and overall model fit statistics. To get that information, and other useful diagnostics, save the output from lm as an object, which we can then use as input to other functions.

# save a model object
model <- lm(asthma_sx ~ hazards * mutation_present, data = asthma)


## Annotated output

The information most people want from their model can be obtained from the summary function:

summary(model)

Call:
lm(formula = asthma_sx ~ hazards * mutation_present, data = asthma)

Residuals:
Min       1Q   Median       3Q      Max
-17.4616  -2.9869  -0.1139   3.2278  10.0346

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)                48.9800     0.7107  68.914  < 2e-16 ***
hazards                     0.6995     0.2789   2.508  0.01382 *
mutation_presentY           3.6679     1.0259   3.575  0.00055 ***
hazards:mutation_presentY   2.3404     0.4040   5.794 8.72e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.123 on 96 degrees of freedom
Multiple R-squared:  0.5746,    Adjusted R-squared:  0.5613
F-statistic: 43.22 on 3 and 96 DF,  p-value: < 2.2e-16


We’ll step through this output line by line now and talk about what each piece means.

### Call

The first lines of the summary output just repeat your model back to you. This is a good place to double-check that you ran the model you intended to, but it doesn’t give you any new information.

### Residuals

The next bit of output shows the quartiles for your model’s residuals. In a well-fitting linear model, the residuals should be normally distributed around 0. This is hard to judge from just the quartiles, but you might notice if, for example, the negative (or positive) end of the residuals extends much farther than the positive (negative) end, indicating negative (positive) skew and/or outliers, and a potential problem with your model specification. The residuals here look roughly symmetrical. Good.

### Coefficients

The next part of the summary output is the coefficients table. Each row is a coefficient from your model, and the columns show the estimate (the same as the coefficient estimates R gives us when we just run lm without summarizing), the standard error of the estimate, the t-value (which is just the estimate divided by its SE), and the p-value associated with that t-value (the probability of getting a t-value more extreme than the observed t-value if the null hypothesis were true). For very small p-values, it will just print “< 2e-16”. Finally, there is a significance code for each coefficient using asterisks to indicate how small the p-value is. The key for the significance codes is printed beneath the coefficients table: Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1. A p-value between 0 and 0.001 will get ***, a p-value between 0.001 and 0.01 will get **, etc. The significance codes are redundant with the information in the Pr(>|t|) column. They just provide a quick visual summary.

#### Intercept

The first coefficient will always be the intercept. The interpretation of the intercept depends on what predictors are in the model and how they are coded, but it is always the outcome estimate when all predictors = 0. In this case, that would mean we would predict an asthma severity score of 48.98 for a patient with a hazard score of 0 (since we centered hazards, that would be a patient with average environmental hazards), and without the genetic mutation (mutation_present = “No”). The significance test tells us that estimate is significantly different from zero.

In this situation, as is often the case, the estimate for the intercept is not particularly meaningful — the ratings of asthma severity run from 31.89 to 73.64, so pretty much any estimate would be significantly different from zero1.

#### Main effects

The next coefficient in our model is hazards. Because we have a dummy-coded categorical variable in the model, this is the estimate for the effect of environmental hazards for the reference group, i.e. those without the genetic mutation. We can interpret it as follows in APA style: For each unit increase in the presence of environmental hazards, there is an estimated 0.7 unit increase in patients’ asthma symptom severity for the group without the genetic predisposition, t(96)=2.51, p=.014.

The next coefficient is mutation_presentY. Because lm uses dummy coding for factors by default, R prints the name of the variable (mutation_present) with the level being tested (Y) against the reference group. Here we only have two levels in our factor, but if there were more, we would see a coefficient for each level other than the reference group.

Because there is an interaction between mutation_present and hazards in our model, the estimate for mutation_present is for patients with hazards=0, which is patients with an average amount of environmental hazards since we centered hazards before estimating the model. We can interpret this coefficient as follows: For patients with an average amount of environmental hazards, there is a 3.67 unit increase in asthma symptoms associated with having the genetic mutation, t(96)=3.58, p<.001.

#### Interaction

The final coefficient in the coefficients table is the interaction between hazards and mutation_present. Again, because mutation_present is dummy-coded, R indicates that this is the estimate for the interaction relative to the reference group. In this case, that means there is an estimated 2.34 unit increase in the slope of hazards for patients who have the genetic mutation relative to those who do not2. In other words, the effect of environmental hazards on severity of asthma symptoms is significantly stronger for patients who have the genetic mutation relative to those who do not, b=2.34, t(96)=5.79, p<.001.

## Residual standard error

The first of the final three lines of summary output gives something called the “residual standard error”, but that’s actually a quirk in R; it should really be labeled “residual standard deviation” instead, but it’s been mislabeled so long it would confuse users to change it now. The residual standard deviation is the standard deviation of the model’s residuals — it is the average distance between the model’s predicted asthma severity score for a patient and that patient’s actual score. It gives you a rough sense of how accurately your model is able to predict your outcome.

The residual degrees of freedom (96, in this case) is the total sample size minus the number of estimated coefficients for the model including the intercept. Here we have 100 patients in the data and we estimate 4 coefficients in the model, so the residual df is 100 - 4 = 96. This df is what is used for the t-tests for each coefficient.

## Multiple R-squared

The second to last line in the summary output gives the multiple R-squared for the model. It tells you how well your model as a whole is able to predict variance in the outcome variable. It can be interpreted as a percentage of variance — here, the model explains 57.5% of the variance in asthma symptom severity. R also prints “adjusted R-squared”, which is similar but applies a penalty for models with a lot of predictors relative to the number of observations. For more information on the distinction between R-squared and adjusted R-squared, see this post.

## F-statistic

The final line of the summary output gives the information for the F-test for the overall model. This tests whether R-squared is significantly different from zero — in other words, does your model explain a significant amount of variance in the outcome? Note that if any of your coefficients are significant, than this overall F-test will also be significant. When you have a model with only one predictor, the F-test here and the t-test for that predictor are equivalent. You can report R-squared and its F-test like this: The model explains a significant amount of variance in asthma symptom severity, R2=0.57, F(3,96)=43.22, p<.001.

## Bonus: Other things you can do with the model object

There are several other useful things you can do with a model object saved from the lm function in R. Try playing around with the following:

plot(model) # useful diagnostic plots to assess your model for problems, especially outliers

hist(residuals(model)) # a histogram of your model's residuals

predict(model) # get the predicted values for each observation in your data

predict(model, new_data) # get the predictions your model makes on a new dataset called "new_data"

1. If you want to make your model intercept more informative, you can center your outcome before running the model — then testing whether the intercept estimate is significantly different from zero tells you whether the intercept is significantly different from the mean of your outcome for a patient with 0 for all of your predictor variables.

2. Interactions are notoriously tricky to interpret, so don’t worry if you don’t understand the impact of an interaction just by looking at the coefficient estimates — few people do. Instead, try plotting the lines of best fit for different sub-groups of the data to help you understand how the estimates change for different the levels of your predictors.