Joy Payton

## Linear regression in R

Linear regression is the process of creating a model of how one or more explanatory or independent variables change the value of an outcome or dependent variable, when the outcome variable is not dichotomous (2-valued). When the outcome is dichotomous (e.g. “Male” / “Female”, “Survived” / “Died”, etc.), a logistic regression is more appropriate.

When we first learn linear regression we typically learn ordinary regression (or “ordinary least squares”), where we assert that our outcome variable must vary according to a linear combination of explanatory variables. We end up, in ordinary linear regression, with a straight line through our data. This line has a formula that’s very reminiscent of the line equations we learned in Algebra I as teenagers:

Where $\alpha$ (sometimes $\beta_0$) is the y-intercept, and the other $\beta$ are the coefficients corresponding to each predictor variable. You may also see this formula with an “error term” $\epsilon$ added, which is the “noise” our model can’t account for. This is pretty similar to $y=mx+b$, right?

There’s another type of linear regression with greater flexibility, called GLM, or generalized linear modeling, which we’ll address in a future article. The benefits of GLM include getting a prediction curve that isn’t necessarily a straight line and prediction that makes sense for non-normally distributed outcomes. GLM is very powerful, and deserves its own treatment. For now, we’ll stick with ordinary linear regression, for one predictor variable (simple linear regression) and for multiple predictor variables (multiple linear regression). We’ll address this in two main sections: Simple Linear Regression and Multiple Regression.

## Simple Linear Regression

Let’s take simple linear regression first. What we’re trying to do is reduce the sum of squares of all our residuals (i.e. create a line that passes through our data points as closely as possible). At the end of the day, this is an optimization project that calls for calculus and uses the correlation coefficient $r$. To find out more about the math that calculates the best line (“best” here means with the lowest sum of squared errors), check out the Khan Academy treatment of the subject. We’ll end up with a slope ($r\cdot\frac{s_y}{s_x}$, where s = sample standard deviation) and a point $(\bar{x},\bar{y}$), which, from Algebra I, we know we can put into the formula for a line (the point-slope formula). Usually, you won’t bother calculating all of this yourself, but it’s handy to recognize the relationship between some basic descriptive statistics and how your model is formed.

In R, it’s fairly simple to create a linear model using lm and formula notation, where we model an output ~ input, where we can interpret ~ as “as a function of” . Let’s take the mtcars dataset, which ships with base R, and do a bit of linear modeling:

You should get the following output:

> my_model

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt
37.285       -5.344

> summary(my_model)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
Min      1Q  Median      3Q     Max
-4.5432 -2.3647 -0.1252  1.4096  6.8727

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10


We can see that summary() gives us a bit more detail than just the slope and intercept: it gives us the statistical significance of these values, as well as our R-squared value, which is a measure of how much of the behavior of the outcome variable is explained by our model. In our case, we have an R-squared value of 0.7528, or, if we’re adjusting for the number of variables used (a good idea, especially since we’ll start doing multiple regression), 0.7446. So we have between 74% and 75% of the variation in mpg (miles per gallon fuel economy) explained by just wt (weight in thousands of pounds). Not too bad! Additionally we see that both our intercept and the coefficient for wt are highly statistically significant. Let’s plot both the points and our model (a straight line with a y-intercept of 37.2851 and a slope of -5.3445):

Or, you could also access the slope and intercept directly, like this:

Either should give you something like the following:

That’s a fairly good model. Visually, we see that it cuts through the points without any clear pattern of residuals (the points are scattered randomly above and below the line). Statistically, we see a good R-squared and convincing p-values. I’d feel comfortable saying this model does a good job of predicting MPG, given automotive weight.

## Multiple Linear Regression

Since mtcars has a lot of variables, let’s not stop at just one explanatory variable.

### Main Effects

Let’s say we want to predict mpg as a function of both weight and transmission type. Since am is really a factor variable (0 = automatic, 1 = manual), we’ll put in as.factor(). This may seem like overkill, as lm will figure out that this is a factor on its own, but it’s always a helpful reminder both for R and for us that we’re dealing with a factor variable.

This is a model that looks at main effects only (weight’s influence alone, considering nothing else, and transmission type alone, considering nothing else).

First of all, does it matter what order we put our explanatory variables? Let’s check, and try both orders. Note that we separate effects with a plus sign (+).

You should get the following output:

There’s no difference in the intercept, coefficients, or statistical metrics. Variable order does not matter in lm.

The three coefficients represent the y-intercept, a tiny amount to subtract if the transmission is a manual (am = 1), and the coefficient for the weight variable. Our formula would be something like:

*(Only if car is manual)

We’d have the same slope, but different y-intercepts (but barely!)

We could plot it like this. Note that the lines are very hard to distinguish as they almost coincide:

### Aside: Variable Order

This aside is optional reading! The reason there's no difference in the lm output is because R is using Type III sum of squares in its calculations for lm, in which sequence does not matter. Compare, however, changing the order with base R's anova: You'll get the following output: Different p-values! Why? Because in anova, R is using Type I (sequential) sum of squares. This is important to keep in mind! If you want Type II or III sum of squares, consider using the car package and its Anovafunction (note the capitalization!): There is an adequate explanation of how effects are calculated in the various types of sum of squares here.

### Interaction Effects

Fine, so it looks like transmission type alone doesn’t have much predictive power (it has a very high p value). But I do think there’s something going on with transmission type. Let me model mpg as a function of the interaction between weight and transmission type. In other words, I think that transmission type influences the relationship between weight and mpg.

We use the colon (:) to indicate interaction:

We get the following:

We see that there does seem to be a steeper drop in MPG for additional weight when am = 1 (the car has a manual transmission). Let’s plot the interaction. Our lines have the same y-intercept, but two different slopes:

### Main Effects + Interaction Effects

What if I want to calculate the influence of weight, transmission type, and the transmission effect on weight’s effect? We can calculate this as mpg ~ weight + as.factor(am) + wt:as.factor(am), or, much more simply, using a “cross”, indicated in our formula by *:

Giving the following output:

The output from this summary is trickier to interpret (but promises to pay off, as we see a very good R-squared). We know that am has two values, 0 and 1, and we see that two of our coefficients list am of 1, so those two coefficients are for when am = 1. But how do I use these coefficients to calculate a prediction or draw a line?

We’ll go one at a time:

• (Intercept) is the base y-intercept, assuming am of 0.
• wt is the slope of a line, assuming am of 0.
• as.factor(am)1 is the additional intercept added when am = 1.
• wt:as.factor(am)1 is the additional slope added when am = 1.

So, we have two lines:

• When am = 0: $mpg=-3.7859\cdot{wt}+31.4161$
• When am = 1: $mpg=-9.084268\cdot{wt}+46.29448$

Let’s plot!

Much better!

## Conclusions

• You can use lm to come up with a fast linear model for a single predictor or multiple predictors.
• You can combine multiple predictors with “+” (add a main effect), “:” (add an interaction effect), or “*” (add main effects and an interaction effect).
• Check your solution both visually (graph it) and statistically (read the output of summary()).

So far, we’ve assumed you knew ahead of time what variables to use to construct your model. But sometimes you might have 50 variables available to you, with no clear hypothesis about which ones to use. You want to use enough information to get a close prediction, but not so much that the model is nearly impossible to use in the real world. In an upcoming article, we’ll talk about how to get a linear model in these conditions, by choosing the most predictive sets of variables.