Multivariate Regression

The bivariate OLS tutorial covers most of the details of model building and output, so this tutorial is comparatively short. It addresses some additional details about multivariate OLS models.

We'll begin by generating some fake data involving a few covariates. We'll then generate two outcomes, one that is a simple linear function of the covariates and one that involves an interaction.

set.seed(50)
n <- 200
x1 <- rbinom(n, 1, 0.5)
x2 <- rnorm(n)
x3 <- rnorm(n, 0, 4)
y1 <- x1 + x2 + x3 + rnorm(n)
y2 <- x1 + x2 + x3 + 2 * x1 * x2 + rnorm(n)

Now we can see how to model each of these processes.

Regresssion formulae for multiple covariates

As covered in the formulae tutorial, we can easily represent a multivariate model using a formula just like we did for a bivariate model. For example, a bivariate model might look like:

y1 ~ x1
## y1 ~ x1
## <environment: 0x000000001c3d67b0>

And a multivariate model would look like:

y1 ~ x1 + x2 + x3
## y1 ~ x1 + x2 + x3
## <environment: 0x000000001c3d67b0>

To include the interaction we need to use the * operator, though we could also use : for the same result:

y1 ~ x1 * x2 + x3
## y1 ~ x1 * x2 + x3
## <environment: 0x000000001c3d67b0>
y1 ~ x1 + x2 + x1 * x2 + x3
## y1 ~ x1 + x2 + x1 * x2 + x3
## <environment: 0x000000001c3d67b0>

The order of variables in a regression formula doesn't matter. Generally, R will print out the regression results in the order that the variables are listed in the formula, but there are exception. For example, all interactions are listed after main effects, as we'll see below.

Regression estimation

Estimating a multivariate model is just like a bivariate model:

lm(y1 ~ x1 + x2 + x3)
## 
## Call:
## lm(formula = y1 ~ x1 + x2 + x3)
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##       0.111        0.924        1.012        0.993

We can do the same with an interaction model:

lm(y1 ~ x1 * x2 + x3)
## 
## Call:
## lm(formula = y1 ~ x1 * x2 + x3)
## 
## Coefficients:
## (Intercept)           x1           x2           x3        x1:x2  
##       0.116        0.917        0.892        0.992        0.228

By default, the estimated coefficients print to the console. If we want to do anything else with the model, we need to store it as an object and then we can perform further procedures:

m1 <- lm(y1 ~ x1 + x2 + x3)
m2 <- lm(y1 ~ x1 * x2 + x3)

Extracting coefficients

To obtain just the coefficients themselves, we can use the coef function applied to the model object:

coef(m1)
## (Intercept)          x1          x2          x3 
##      0.1106      0.9236      1.0119      0.9932

Similarly, we can use residuals to see the model residuals. We'll just list the first 15 here:

residuals(m1)[1:15]
##        1        2        3        4        5        6        7        8 
##  0.11464 -0.51139 -0.53711 -0.39099 -0.05157  0.46566 -0.11753 -1.25015 
##        9       10       11       12       13       14       15 
## -1.03919 -0.32588 -0.97016  0.89039  0.30830 -1.58698  0.83643

The model objects also include all of the data used to estimate the model in a sub-object called model. Let's look at its first few rows:

head(m1$model)
##        y1 x1        x2      x3
## 1   4.012  1 -1.524406   4.436
## 2 -10.837  0  0.043114 -10.551
## 3  -2.991  0  0.084210  -2.668
## 4  -7.166  1  0.354005  -8.223
## 5   4.687  1  1.104567   2.604
## 6   4.333  0  0.004345   3.778

There are lots of other things stored in a model object that don't concern us right now, but that you could see with str(m1) or str(m2).

Regression summaries

Much more information about a model becomes available when we use the summary function:

summary(m1)
## 
## Call:
## lm(formula = y1 ~ x1 + x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5808 -0.6384 -0.0632  0.5966  2.8379 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1106     0.0976    1.13     0.26    
## x1            0.9236     0.1431    6.45  8.4e-10 ***
## x2            1.0119     0.0716   14.13  < 2e-16 ***
## x3            0.9932     0.0168   59.01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.01 on 196 degrees of freedom
## Multiple R-squared:  0.951,  Adjusted R-squared:  0.951 
## F-statistic: 1.28e+03 on 3 and 196 DF,  p-value: <2e-16

Indeed, as with a bivariate model, a complete representation of the regression results is printed to the console, including coefficients, standard errors, t-statistics, p-values, some summary statistics about the regression residuals, and various model fit statistics. The summary object itself can be saved and objects extracted from it:

s1 <- summary(m1)

A look at the structure of s1 shows that there is considerable detail stored in the summary object:

str(s1)
## List of 11
##  $ call         : language lm(formula = y1 ~ x1 + x2 + x3)
##  $ terms        :Classes 'terms', 'formula' length 3 y1 ~ x1 + x2 + x3
##   .. ..- attr(*, "variables")= language list(y1, x1, x2, x3)
##   .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:4] "y1" "x1" "x2" "x3"
##   .. .. .. ..$ : chr [1:3] "x1" "x2" "x3"
##   .. ..- attr(*, "term.labels")= chr [1:3] "x1" "x2" "x3"
##   .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: 0x000000001c3d67b0> 
##   .. ..- attr(*, "predvars")= language list(y1, x1, x2, x3)
##   .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:4] "y1" "x1" "x2" "x3"
##  $ residuals    : Named num [1:200] 0.1146 -0.5114 -0.5371 -0.391 -0.0516 ...
##   ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
##  $ coefficients : num [1:4, 1:4] 0.1106 0.9236 1.0119 0.9932 0.0976 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:4] FALSE FALSE FALSE FALSE
##   ..- attr(*, "names")= chr [1:4] "(Intercept)" "x1" "x2" "x3"
##  $ sigma        : num 1.01
##  $ df           : int [1:3] 4 196 4
##  $ r.squared    : num 0.951
##  $ adj.r.squared: num 0.951
##  $ fstatistic   : Named num [1:3] 1278 3 196
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:4, 1:4] 0.009406 -0.009434 -0.000255 0.000116 -0.009434 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
##   .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
##  - attr(*, "class")= chr "summary.lm"

This includes all of the details that were printed to the console, which we extract separately, such as the coefficients:

coef(s1)
##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept)   0.1106    0.09757   1.133  2.584e-01
## x1            0.9236    0.14311   6.454  8.374e-10
## x2            1.0119    0.07159  14.135  9.787e-32
## x3            0.9932    0.01683  59.008 9.537e-127
s1$coefficients
##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept)   0.1106    0.09757   1.133  2.584e-01
## x1            0.9236    0.14311   6.454  8.374e-10
## x2            1.0119    0.07159  14.135  9.787e-32
## x3            0.9932    0.01683  59.008 9.537e-127

Model fit statistics:

s1$sigma
## [1] 1.006
s1$r.squared
## [1] 0.9514
s1$adj.r.squared
## [1] 0.9506
s1$fstatistic
## value numdf dendf 
##  1278     3   196

And so forth. These details become useful to be able to extract when we want to output our results to another format, such as Word, LaTeX, or something else.

Interaction Model Output

The output from a model that includes interactions is essentially the same as for a model without any interactions, but note that the interaction coefficients are printed at the end of the output:

coef(m2)
## (Intercept)          x1          x2          x3       x1:x2 
##      0.1161      0.9172      0.8923      0.9925      0.2278
s2 <- summary(m2)
s2
## 
## Call:
## lm(formula = y1 ~ x1 * x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6387 -0.6031 -0.0666  0.6153  2.8076 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1161     0.0973    1.19     0.23    
## x1            0.9172     0.1426    6.43  9.5e-10 ***
## x2            0.8923     0.1035    8.62  2.2e-15 ***
## x3            0.9925     0.0168   59.17  < 2e-16 ***
## x1:x2         0.2278     0.1428    1.59     0.11    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1 on 195 degrees of freedom
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.951 
## F-statistic:  967 on 4 and 195 DF,  p-value: <2e-16
coef(s2)
##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept)   0.1161    0.09725   1.194  2.340e-01
## x1            0.9172    0.14261   6.431  9.536e-10
## x2            0.8923    0.10346   8.625  2.225e-15
## x3            0.9925    0.01677  59.171 1.549e-126
## x1:x2         0.2278    0.14282   1.595  1.123e-01

Plots of Multivariate OLS Models

As with bivariate models, we can easily plot our observed data pairwise:

plot(y1 ~ x2)

plot of chunk unnamed-chunk-17

And we can overlay predicted values of the outcome on that kind of plot:

plot(y1 ~ x2)
abline(m1$coef[1], m1$coef[3])

plot of chunk unnamed-chunk-18

or plot the model residuals against included variables:

layout(matrix(1:2, nrow = 1))
plot(m1$residuals ~ x1)
plot(m1$residuals ~ x2)

plot of chunk unnamed-chunk-19

For more details on plotting regressions, see the section of tutorials on Regression Plotting.