The bivariate OLS tutorial covers most of the details of model building and output, so this tutorial is comparatively short. It addresses some additional details about multivariate OLS models.
We'll begin by generating some fake data involving a few covariates. We'll then generate two outcomes, one that is a simple linear function of the covariates and one that involves an interaction.
set.seed(50)
n <- 200
x1 <- rbinom(n, 1, 0.5)
x2 <- rnorm(n)
x3 <- rnorm(n, 0, 4)
y1 <- x1 + x2 + x3 + rnorm(n)
y2 <- x1 + x2 + x3 + 2 * x1 * x2 + rnorm(n)
Now we can see how to model each of these processes.
As covered in the formulae tutorial, we can easily represent a multivariate model using a formula just like we did for a bivariate model. For example, a bivariate model might look like:
y1 ~ x1
## y1 ~ x1
## <environment: 0x000000001c3d67b0>
And a multivariate model would look like:
y1 ~ x1 + x2 + x3
## y1 ~ x1 + x2 + x3
## <environment: 0x000000001c3d67b0>
To include the interaction we need to use the *
operator, though we could also use :
for the same result:
y1 ~ x1 * x2 + x3
## y1 ~ x1 * x2 + x3
## <environment: 0x000000001c3d67b0>
y1 ~ x1 + x2 + x1 * x2 + x3
## y1 ~ x1 + x2 + x1 * x2 + x3
## <environment: 0x000000001c3d67b0>
The order of variables in a regression formula doesn't matter. Generally, R will print out the regression results in the order that the variables are listed in the formula, but there are exception. For example, all interactions are listed after main effects, as we'll see below.
Estimating a multivariate model is just like a bivariate model:
lm(y1 ~ x1 + x2 + x3)
##
## Call:
## lm(formula = y1 ~ x1 + x2 + x3)
##
## Coefficients:
## (Intercept) x1 x2 x3
## 0.111 0.924 1.012 0.993
We can do the same with an interaction model:
lm(y1 ~ x1 * x2 + x3)
##
## Call:
## lm(formula = y1 ~ x1 * x2 + x3)
##
## Coefficients:
## (Intercept) x1 x2 x3 x1:x2
## 0.116 0.917 0.892 0.992 0.228
By default, the estimated coefficients print to the console. If we want to do anything else with the model, we need to store it as an object and then we can perform further procedures:
m1 <- lm(y1 ~ x1 + x2 + x3)
m2 <- lm(y1 ~ x1 * x2 + x3)
To obtain just the coefficients themselves, we can use the coef
function applied to the model object:
coef(m1)
## (Intercept) x1 x2 x3
## 0.1106 0.9236 1.0119 0.9932
Similarly, we can use residuals
to see the model residuals. We'll just list the first 15 here:
residuals(m1)[1:15]
## 1 2 3 4 5 6 7 8
## 0.11464 -0.51139 -0.53711 -0.39099 -0.05157 0.46566 -0.11753 -1.25015
## 9 10 11 12 13 14 15
## -1.03919 -0.32588 -0.97016 0.89039 0.30830 -1.58698 0.83643
The model objects also include all of the data used to estimate the model in a sub-object called model
. Let's look at its first few rows:
head(m1$model)
## y1 x1 x2 x3
## 1 4.012 1 -1.524406 4.436
## 2 -10.837 0 0.043114 -10.551
## 3 -2.991 0 0.084210 -2.668
## 4 -7.166 1 0.354005 -8.223
## 5 4.687 1 1.104567 2.604
## 6 4.333 0 0.004345 3.778
There are lots of other things stored in a model object that don't concern us right now, but that you could see with str(m1)
or str(m2)
.
Much more information about a model becomes available when we use the summary
function:
summary(m1)
##
## Call:
## lm(formula = y1 ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5808 -0.6384 -0.0632 0.5966 2.8379
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1106 0.0976 1.13 0.26
## x1 0.9236 0.1431 6.45 8.4e-10 ***
## x2 1.0119 0.0716 14.13 < 2e-16 ***
## x3 0.9932 0.0168 59.01 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.01 on 196 degrees of freedom
## Multiple R-squared: 0.951, Adjusted R-squared: 0.951
## F-statistic: 1.28e+03 on 3 and 196 DF, p-value: <2e-16
Indeed, as with a bivariate model, a complete representation of the regression results is printed to the console, including coefficients, standard errors, t-statistics, p-values, some summary statistics about the regression residuals, and various model fit statistics. The summary object itself can be saved and objects extracted from it:
s1 <- summary(m1)
A look at the structure of s1
shows that there is considerable detail stored in the summary object:
str(s1)
## List of 11
## $ call : language lm(formula = y1 ~ x1 + x2 + x3)
## $ terms :Classes 'terms', 'formula' length 3 y1 ~ x1 + x2 + x3
## .. ..- attr(*, "variables")= language list(y1, x1, x2, x3)
## .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:4] "y1" "x1" "x2" "x3"
## .. .. .. ..$ : chr [1:3] "x1" "x2" "x3"
## .. ..- attr(*, "term.labels")= chr [1:3] "x1" "x2" "x3"
## .. ..- attr(*, "order")= int [1:3] 1 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: 0x000000001c3d67b0>
## .. ..- attr(*, "predvars")= language list(y1, x1, x2, x3)
## .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:4] "y1" "x1" "x2" "x3"
## $ residuals : Named num [1:200] 0.1146 -0.5114 -0.5371 -0.391 -0.0516 ...
## ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
## $ coefficients : num [1:4, 1:4] 0.1106 0.9236 1.0119 0.9932 0.0976 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:4] FALSE FALSE FALSE FALSE
## ..- attr(*, "names")= chr [1:4] "(Intercept)" "x1" "x2" "x3"
## $ sigma : num 1.01
## $ df : int [1:3] 4 196 4
## $ r.squared : num 0.951
## $ adj.r.squared: num 0.951
## $ fstatistic : Named num [1:3] 1278 3 196
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:4, 1:4] 0.009406 -0.009434 -0.000255 0.000116 -0.009434 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
## .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
## - attr(*, "class")= chr "summary.lm"
This includes all of the details that were printed to the console, which we extract separately, such as the coefficients:
coef(s1)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1106 0.09757 1.133 2.584e-01
## x1 0.9236 0.14311 6.454 8.374e-10
## x2 1.0119 0.07159 14.135 9.787e-32
## x3 0.9932 0.01683 59.008 9.537e-127
s1$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1106 0.09757 1.133 2.584e-01
## x1 0.9236 0.14311 6.454 8.374e-10
## x2 1.0119 0.07159 14.135 9.787e-32
## x3 0.9932 0.01683 59.008 9.537e-127
Model fit statistics:
s1$sigma
## [1] 1.006
s1$r.squared
## [1] 0.9514
s1$adj.r.squared
## [1] 0.9506
s1$fstatistic
## value numdf dendf
## 1278 3 196
And so forth. These details become useful to be able to extract when we want to output our results to another format, such as Word, LaTeX, or something else.
The output from a model that includes interactions is essentially the same as for a model without any interactions, but note that the interaction coefficients are printed at the end of the output:
coef(m2)
## (Intercept) x1 x2 x3 x1:x2
## 0.1161 0.9172 0.8923 0.9925 0.2278
s2 <- summary(m2)
s2
##
## Call:
## lm(formula = y1 ~ x1 * x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6387 -0.6031 -0.0666 0.6153 2.8076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1161 0.0973 1.19 0.23
## x1 0.9172 0.1426 6.43 9.5e-10 ***
## x2 0.8923 0.1035 8.62 2.2e-15 ***
## x3 0.9925 0.0168 59.17 < 2e-16 ***
## x1:x2 0.2278 0.1428 1.59 0.11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1 on 195 degrees of freedom
## Multiple R-squared: 0.952, Adjusted R-squared: 0.951
## F-statistic: 967 on 4 and 195 DF, p-value: <2e-16
coef(s2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1161 0.09725 1.194 2.340e-01
## x1 0.9172 0.14261 6.431 9.536e-10
## x2 0.8923 0.10346 8.625 2.225e-15
## x3 0.9925 0.01677 59.171 1.549e-126
## x1:x2 0.2278 0.14282 1.595 1.123e-01
As with bivariate models, we can easily plot our observed data pairwise:
plot(y1 ~ x2)
And we can overlay predicted values of the outcome on that kind of plot:
plot(y1 ~ x2)
abline(m1$coef[1], m1$coef[3])
or plot the model residuals against included variables:
layout(matrix(1:2, nrow = 1))
plot(m1$residuals ~ x1)
plot(m1$residuals ~ x2)
For more details on plotting regressions, see the section of tutorials on Regression Plotting.