Regression-related plotting

R's graphical capabilities are very strong. This is particularly helpful when deal with regression.

If we have a regression model, there are a number of ways we can plot the relationships between variables. We can also use plots for checking model specification and assumptions.

Let's start with a basic bivariate regression:

set.seed(1)
x <- rnorm(1000)
y <- 1 + x + rnorm(1000)
ols1 <- lm(y ~ x)

The easiest plot we can draw is the relationship between y and x.

plot(y ~ x)

plot of chunk unnamed-chunk-2

We can also add a line representing this relationship to the plot. To get the coefficients from the model, we can use coef:

coef(ols1)

## (Intercept)           x 
##      0.9838      1.0064

We can then use those coefficients in the line-plotting function abline:

plot(y ~ x)
abline(a = coef(ols1)[1], b = coef(ols1)[2])

plot of chunk unnamed-chunk-4

We can specify “graphical parameters” in both plot and abline to change the look. For example, we could change the color:

plot(y ~ x, col = "gray")
abline(a = coef(ols1)[1], b = coef(ols1)[2], col = "red")

plot of chunk unnamed-chunk-5

We can also use plot to extract several diagnostics for our model. Almost all of these help us to identify outliers or other irregularities. If we type:

plot(ols1)

plot of chunk unnamed-chunk-6

We are given a series of plots describing the model. We can also see two other plots that are not displayed by default. To obtain a given plot, we use the which parameter inside plot:

plot(ols1, which = 4)

plot of chunk unnamed-chunk-7

(1) A residual plot (2) A Quantile-Quantile plot to check the distribution of our residuals (3) A scale-location plot (4) Cook's distance, to identify potential outliers (5) A residual versus leverage plot, to identify potential outliers (6) Cook's distance versus leverage plot

Besides the default plot(ols1, which=1) to get residuals, we can also plot residuals manually:

plot(ols1$residuals ~ x)

plot of chunk unnamed-chunk-8

We might want to do this to check whether another variable should be in our model:

x2 <- rnorm(1000)
plot(ols1$residuals ~ x2)

plot of chunk unnamed-chunk-9

Obviously, in this case x2 doesn't belong in the model. Let's see a case where the plot would help us:

y2 <- x + x2 + rnorm(1000)
ols2 <- lm(y2 ~ x)
plot(ols2$residuals ~ x2)

plot of chunk unnamed-chunk-10

Clearly, x2 is strongly related to our residuals, so it belongs in the model.

We can also use residuals plots to check for nonlinear relationships (i.e., functional form):

y3 <- x + (x^2) + rnorm(1000)
ols3 <- lm(y3 ~ x)
plot(ols3$residuals ~ x)

plot of chunk unnamed-chunk-11

Even though x is in our model, it is not in the correct form. Let's try fixing that and see what happens to our plot:

ols3b <- lm(y3 ~ x + I(x^2))

Note: We need to use the I() operator inside formulae in order to have R generate the x^2 variable! Note (continued): This saves us from having to defined a new variable: xsq <- x^2 and then running the model.

plot(ols3b$residuals ~ x)

plot of chunk unnamed-chunk-13

Clearly, the model now incorporates x in the correct functional form.

Of course, if we had plotted our data originally:

plot(y3 ~ x)

plot of chunk unnamed-chunk-14

We would have seen the non-linear relationship and could have skipped the incorrect model entirely.

Residual plots can also show heteroskedasticity

x3 <- runif(1000, 1, 10)
y4 <- (3 * x3) + rnorm(1000, 0, x3)
ols4 <- lm(y4 ~ x3)
plot(ols4$residuals ~ x3)

plot of chunk unnamed-chunk-15

Here we see that x3 is correctly specified in the model. There is no relationship between x3 and y4. But, the variance of the residuals is much higher at higher levels of x3. We might need to rely on a different estimate of our regression SEs than the default provided by R. And, again, this is a problem we could have identified by plotting our original data:

plot(y4 ~ x3)

plot of chunk unnamed-chunk-16

Multivariate OLS plotting

If our model has more than one independent variable, these plotting tools all still work.

set.seed(1)
x5 <- rnorm(1000)
z5 <- runif(1000, 1, 5)
y5 <- x5 + z5 + rnorm(1000)
ols5 <- lm(y5 ~ x5 + z5)

We can see all six of our diagnostic plots:

plot(ols5, 1:6)

plot of chunk unnamed-chunk-18

We can plot our outcome against the input variables:

plot(y5 ~ x5)

plot of chunk unnamed-chunk-19

plot(y5 ~ z5)

plot of chunk unnamed-chunk-19

We can see residual plots:

plot(ols5$residuals ~ x5)

plot of chunk unnamed-chunk-20

plot(ols5$residuals ~ z5)

plot of chunk unnamed-chunk-20

We might also want to check for colinearity between our input variables. We could do this with cor:

cor(x5, z5)

## [1] 0.03504

Or we could see it visually with a scatterplot:

plot(x5, z5)

plot of chunk unnamed-chunk-22

In either case, there's no relationship.

We can also plot our effects from our model against our input data:

coef(ols5)

## (Intercept)          x5          z5 
##     -0.0639      1.0183      1.0182

Lets plot the two input variables together using layout:

layout(matrix(1:2, ncol = 2))
plot(y5 ~ x5, col = "gray")
abline(a = coef(ols5)[1] + mean(z5), b = coef(ols5)["x5"], col = "red")
plot(y5 ~ z5, col = "gray")
abline(a = coef(ols5)[1] + mean(x5), b = coef(ols5)["z5"], col = "red")

plot of chunk unnamed-chunk-24

Note: We add the expected value of the other input variable so that lines are drawn correctly. If we plot each bivariate relationship separately, we'll see how we get the lines of best fit:

ols5a <- lm(y5 ~ x5)
ols5b <- lm(y5 ~ z5)
layout(matrix(1:2, ncol = 2))
plot(y5 ~ x5, col = "gray")
abline(a = coef(ols5a)[1], b = coef(ols5a)["x5"], col = "red")
plot(y5 ~ z5, col = "gray")
abline(a = coef(ols5b)[1], b = coef(ols5b)["z5"], col = "red")

plot of chunk unnamed-chunk-25

If we regress the residuals from ols5a on z5 we'll see some magic happen. The estimated coefficient for z5 is almost identical to that from our full y5 ~ x5 + z5 model:

tmpz <- lm(ols5a$residuals ~ z5)
coef(tmpz)["z5"]

##    z5 
## 1.017

coef(ols5)["z5"]

##    z5 
## 1.018

The same pattern works if we repeat this process for our x5 input variable:

tmpx <- lm(ols5b$residuals ~ x5)
coef(tmpx)["x5"]

##    x5 
## 1.017

coef(ols5)["x5"]

##    x5 
## 1.018

In other words, the coefficients from our full model ols5 reflect the regression of each input variable… on the residuals of y unexplained by the contribution of the other input variable(s). Let's see this visually by drawing the bivariate regression lines in blue. And then overlapping these with the full model estimates in red:

layout(matrix(1:2, ncol = 2))
plot(y5 ~ x5, col = "gray")
abline(a = coef(ols5a)[1], b = coef(ols5a)["x5"], col = "blue")
abline(a = coef(ols5)[1] + mean(z5), b = coef(ols5)["x5"], col = "red")
plot(y5 ~ z5, col = "gray")
coef(lm(ols5a$residuals ~ z5))["z5"]

##    z5 
## 1.017

abline(a = coef(ols5b)[1], b = coef(ols5b)["z5"], col = "blue")
abline(a = coef(ols5)[1] + mean(x5), b = coef(ols5)["z5"], col = "red")

plot of chunk unnamed-chunk-28

In that example, x5 and z5 were uncorrelated, so there was no bias from excluding one variable. Let's look at a situation where we find omitted variable bias due to correlation between input variables.

set.seed(1)
x6 <- rnorm(1000)
z6 <- x6 + rnorm(1000, 0, 1.5)
y6 <- x6 + z6 + rnorm(1000)

We can see from a plot and correlation and that our two input variables are correlated:

cor(x6, z6)

## [1] 0.5565

plot(x6, z6)

plot of chunk unnamed-chunk-30

Let's estimate some models:

ols6 <- lm(y6 ~ x6 + z6)
ols6a <- lm(y6 ~ x6)
ols6b <- lm(y6 ~ z6)

And then let's compare the bivariate estimates (blue) to the multivariate estimates (red):

layout(matrix(1:2, ncol = 2))
plot(y6 ~ x6, col = "gray")
abline(a = coef(ols6a)[1], b = coef(ols6a)["x6"], col = "blue")
abline(a = coef(ols6)[1] + mean(z6), b = coef(ols6)["x6"], col = "red")
plot(y6 ~ z6, col = "gray")
coef(lm(ols6a$residuals ~ z6))["z6"]

##     z6 
## 0.7004

abline(a = coef(ols6b)[1], b = coef(ols6b)["z6"], col = "blue")
abline(a = coef(ols6)[1] + mean(x6), b = coef(ols6)["z6"], col = "red")

plot of chunk unnamed-chunk-32

As we can see, the estimates from our bivariate models overestimate the impact of each input. We could of course see this in the raw coefficients, as well:

coef(ols6)

## (Intercept)          x6          z6 
##     0.01624     1.03437     1.01468

coef(ols6a)

## (Intercept)          x6 
##   -0.008398    2.058839

coef(ols6b)

## (Intercept)          z6 
##     0.01563     1.33198

These plots show, however, that omitted variable bias can be dangerous even when it seems our estimates are correct. The blue lines seem to fit the data, but those simple plots (and regressions) fail to account for correlations between inputs.

And the problem is that you can't predict omitted variable bias a priori. Let's repeat that last analysis but simply change the data generating process slightly:

set.seed(1)
x6 <- rnorm(1000)
z6 <- x6 + rnorm(1000, 0, 1.5)
y6 <- x6 - z6 + rnorm(1000)  #' this is the only differences from the previous example
cor(x6, z6)

## [1] 0.5565

ols6 <- lm(y6 ~ x6 + z6)
ols6a <- lm(y6 ~ x6)
ols6b <- lm(y6 ~ z6)
layout(matrix(1:2, ncol = 2))
plot(y6 ~ x6, col = "gray")
abline(a = coef(ols6a)[1], b = coef(ols6a)["x6"], col = "blue")
abline(a = coef(ols6)[1] + mean(z6), b = coef(ols6)["x6"], col = "red")
plot(y6 ~ z6, col = "gray")
coef(lm(ols6a$residuals ~ z6))["z6"]

##      z6 
## -0.6802

abline(a = coef(ols6b)[1], b = coef(ols6b)["z6"], col = "blue")
abline(a = coef(ols6)[1] + mean(x6), b = coef(ols6)["z6"], col = "red")

plot of chunk unnamed-chunk-34

The blue lines seem to fit the data, but they're biased estimates.