Bivariate Regression

Regression on a binary covariate

The easiest way to understand bivariate regression is to view it as equivalent to a two-sample t-test. Imagine we have a binary variable (like male/female or treatment/control):

set.seed(1)
bin <- rbinom(1000, 1, 0.5)

Then we have an outcome that is influenced by that group:

out <- 2 * bin + rnorm(1000)

We can use by to calculate the treatment group means:

by(out, bin, mean)
## bin: 0
## [1] -0.01588
## -------------------------------------------------------- 
## bin: 1
## [1] 1.966

This translates to a difference of:

diff(by(out, bin, mean))
## [1] 1.982

A two-sample t-test shows us whether there is a significant difference between the two groups:

t.test(out ~ bin)
## 
##  Welch Two Sample t-test
## 
## data:  out by bin
## t = -30.3, df = 992.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.111 -1.854
## sample estimates:
## mean in group 0 mean in group 1 
##        -0.01588         1.96624

If we run a linear regression, we find that the mean-difference is the same as the regression slope:

lm(out ~ bin)
## 
## Call:
## lm(formula = out ~ bin)
## 
## Coefficients:
## (Intercept)          bin  
##     -0.0159       1.9821

And t-statistic (and its significance) for the regression slope matches that from the t.test:

summary(lm(out ~ bin))$coef[2, ]
##   Estimate Std. Error    t value   Pr(>|t|) 
##  1.982e+00  6.544e-02  3.029e+01 1.949e-143

It becomes quite easy to see this visually in a plot of the regression:

plot(out ~ bin, col = "gray")
points(0:1, by(out, bin, mean), col = "blue", bg = "blue", pch = 23)
abline(coef(lm(out ~ bin)), col = "blue")

plot of chunk unnamed-chunk-8

Regression on a continuous covariate

A regression involving a continuous covariate is similar, but rather than representing the difference in means between two groups (with the covariate is binary), it represents the conditional mean of the outcome at each level of the covariate. We can see this in some simple fake data:

set.seed(1)
x <- runif(1000, 0, 10)
y <- 3 * x + rnorm(1000, 0, 5)

Here, we'll cut our covariate into five levels and estimate the density of the outcome y in each of those levels:

x1 <- ifelse(x < 2, 1, ifelse(x >= 2 & x < 4, 2, ifelse(x >= 4 & x < 6, 3, ifelse(x >= 
    6 & x < 8, 4, ifelse(x >= 8 & x < 10, 5, NA)))))
d1 <- density(y[x1 == 1])
d2 <- density(y[x1 == 2])
d3 <- density(y[x1 == 3])
d4 <- density(y[x1 == 4])
d5 <- density(y[x1 == 5])

We'll then use those values to show how the regression models the mean of y conditional on x. Let's start with the model:

m1 <- lm(y ~ x)

It is also worth highlighting that, in a bivariate regression model, the regression slope is simply a weighted version of the correlation coefficient. We can see this by calculating the correlation between x and y and then weighting that by the ratio of the covariances of each. You'll see that this is exactly the slope coefficient reported by R:

cor(y, x)
## [1] 0.8593
slope <- cor(y, x) * sqrt(cov(y, y)/cov(x, x))  # manually calculate coefficient as weighted correlation
coef(m1)[2]  # coefficient on x
##     x 
## 3.011
slope
## [1] 3.011

But let's plot the data to get a better understanding of what it looks like:

plot(x, y, col = "gray")
# add the regression equation:
abline(coef(m1), col = "blue")
# add the conditional densities:
abline(v = c(1, 3, 5, 7, 9), col = "gray", lty = 2)
points(1 + d1$y * 10, d1$x, type = "l", col = "black")
points(3 + d2$y * 10, d2$x, type = "l", col = "black")
points(5 + d3$y * 10, d3$x, type = "l", col = "black")
points(7 + d4$y * 10, d4$x, type = "l", col = "black")
points(9 + d5$y * 10, d5$x, type = "l", col = "black")
# add points representing conditional means:
points(1, mean(y[x1 == 1]), col = "red", pch = 15)
points(3, mean(y[x1 == 2]), col = "red", pch = 15)
points(5, mean(y[x1 == 3]), col = "red", pch = 15)
points(7, mean(y[x1 == 4]), col = "red", pch = 15)
points(9, mean(y[x1 == 5]), col = "red", pch = 15)

plot of chunk unnamed-chunk-13

As is clear, the regression line travels through the conditional means of y at each level of x. We can also see in the densities that y is approximately normally distributed at each value of x (because we made our data that way). These data thus nicely satisfy the assumptions for linear regression.

Obviously, our data rarely satisfy those assumptions so nicely. We can modify our fake data to have less desirable properties and see how that affects our inference. Let's put a discontinuity in our y value by simply increasing it by 10 for all values of x greater than 6:

y2 <- y
y2[x > 6] <- y[x > 6] + 10

We can build a new model for these data:

m2 <- lm(y2 ~ x)

Let's estimate the conditional densities, as we did above, but for the new data:

e1 <- density(y2[x1 == 1])
e2 <- density(y2[x1 == 2])
e3 <- density(y2[x1 == 3])
e4 <- density(y2[x1 == 4])
e5 <- density(y2[x1 == 5])

And then let's look at how that model fits the new data:

plot(x, y2, col = "gray")
# add the regression equation:
abline(coef(m2), col = "blue")
# add the conditional densities:
abline(v = c(1, 3, 5, 7, 9), col = "gray", lty = 2)
points(1 + e1$y * 10, e1$x, type = "l", col = "black")
points(3 + e2$y * 10, e2$x, type = "l", col = "black")
points(5 + e3$y * 10, e3$x, type = "l", col = "black")
points(7 + e4$y * 10, e4$x, type = "l", col = "black")
points(9 + e5$y * 10, e5$x, type = "l", col = "black")
# add points representing conditional means:
points(1, mean(y2[x1 == 1]), col = "red", pch = 15)
points(3, mean(y2[x1 == 2]), col = "red", pch = 15)
points(5, mean(y2[x1 == 3]), col = "red", pch = 15)
points(7, mean(y2[x1 == 4]), col = "red", pch = 15)
points(9, mean(y2[x1 == 5]), col = "red", pch = 15)

plot of chunk unnamed-chunk-17

As should be clear in the plot, the line no longer goes through the conditional means (see, especially, the third density curve) because the outcome y is not a linear function of x. To obtain a better fit, we can estimate two separate lines, one on each side of the discontinuing:

m3a <- lm(y2[x <= 6] ~ x[x <= 6])
m3b <- lm(y2[x > 6] ~ x[x > 6])

Now let's redraw our data and the plot for x<=6 in red and the plot for x>6 in blue:

plot(x, y2, col = "gray")
segments(0, coef(m3a)[1], 6, coef(m3a)[1] + 6 * coef(m3a)[2], col = "red")
segments(6, coef(m3b)[1] + (6 * coef(m3b)[2]), 10, coef(m3b)[1] + 10 * coef(m3b)[2], 
    col = "blue")
# redraw the densities:
abline(v = c(1, 3, 5, 7, 9), col = "gray", lty = 2)
points(1 + e1$y * 10, e1$x, type = "l", col = "black")
points(3 + e2$y * 10, e2$x, type = "l", col = "black")
points(5 + e3$y * 10, e3$x, type = "l", col = "black")
points(7 + e4$y * 10, e4$x, type = "l", col = "black")
points(9 + e5$y * 10, e5$x, type = "l", col = "black")
# redraw points representing conditional means:
points(1, mean(y2[x1 == 1]), col = "red", pch = 15)
points(3, mean(y2[x1 == 2]), col = "red", pch = 15)
points(5, mean(y2[x1 == 3]), col = "red", pch = 15)
points(7, mean(y2[x1 == 4]), col = "blue", pch = 15)
points(9, mean(y2[x1 == 5]), col = "blue", pch = 15)

plot of chunk unnamed-chunk-19

Our two new models m3a and m3b are better fits to the data because they satisfy the requirement that the regression line travel through the conditional means of y. Thus, regardless of the form of our covariate(s), our regression models only provide valid inference if the regression line travels through the conditional mean of y for every value of x.

Regression on a discrete covariate

Binary and continuous covariates are easy to model, but we often have data that are not binary or continuous, but instead are categorical. Building regression models with these kinds of variables gives us many options to consider. In particular, developing a model that points through the conditional means of the outcome can be more complicated because the relationship between the outcome and a categorical covariate (if treated as continuous) is unlikely to be linear. We then have to decide how best to model the data. Let's start with some fake data to illustrate this:

a <- sample(1:5, 500, TRUE)
b <- numeric(length = 500)
b[a == 1] <- a[a == 1] + rnorm(sum(a == 1))
b[a == 2] <- 2 * a[a == 2] + rnorm(sum(a == 2))
b[a == 3] <- 2 * a[a == 3] + rnorm(sum(a == 3))
b[a == 4] <- 0.5 * a[a == 4] + rnorm(sum(a == 4))
b[a == 5] <- 2 * a[a == 5] + rnorm(sum(a == 5))

Let's treat a as a continuous covariate, assume the a-b relationship is linear, and build the corresponding linear regression model:

n1 <- lm(b ~ a)

We can see the relationship in the data by plotting b as a function of a:

plot(a, b, col = "gray")
abline(coef(n1), col = "blue")
# draw points representing conditional means:
points(1, mean(b[a == 1]), col = "red", pch = 15)
points(2, mean(b[a == 2]), col = "red", pch = 15)
points(3, mean(b[a == 3]), col = "red", pch = 15)
points(4, mean(b[a == 4]), col = "red", pch = 15)
points(5, mean(b[a == 5]), col = "red", pch = 15)

plot of chunk unnamed-chunk-22

Clearly, the regression line misses the conditional mean values of b at all values of a. Our model is therefore not very good. To correct for this, we can either (1) attempt to transform our variables to force a straight-line (which probably isn't possible in this case, but might be if the relationship were curvilinear) or (2) convert the a covariate to a factor and thus model the relationship as a series of indicator (or “dummy”) variables.

Discrete covariate as factor

When we treat a discrete covariate as a factor, R automatically transforms the variable into a series of indicator variables during the estimation of the regression. Let's compare our original model to this new model:

# our original model (treating `a` as continuous):
summary(n1)
## 
## Call:
## lm(formula = b ~ a)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.392 -0.836  0.683  1.697  4.605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.4679     0.2569   -1.82    0.069 .  
## a             1.7095     0.0759   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.48 on 498 degrees of freedom
## Multiple R-squared:  0.505,  Adjusted R-squared:  0.504 
## F-statistic:  507 on 1 and 498 DF,  p-value: <2e-16
# our new model:
n2 <- lm(b ~ factor(a))
summary(n2)
## 
## Call:
## lm(formula = b ~ factor(a))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9332 -0.6578 -0.0772  0.7627  2.9459 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.063      0.104   10.21  < 2e-16 ***
## factor(a)2     2.772      0.154   18.03  < 2e-16 ***
## factor(a)3     4.881      0.150   32.47  < 2e-16 ***
## factor(a)4     0.848      0.152    5.56  4.4e-08 ***
## factor(a)5     8.946      0.143   62.35  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared:  0.909,  Adjusted R-squared:  0.908 
## F-statistic: 1.23e+03 on 4 and 495 DF,  p-value: <2e-16

Obviously, the regression output is quite different for the two models. For n1, we see the slope of the line we drew in the plot above. For n2, we instead see the slopes comparing b for a==1 to b for all other levels of a (i.e., dummy coefficient slopes). R defaults to taking the lowest factor level as the baseline, but we can change this by reordering the levels of the factor:

# a==5 as baseline:
summary(lm(b ~ factor(a, levels = 5:1)))
## 
## Call:
## lm(formula = b ~ factor(a, levels = 5:1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9332 -0.6578 -0.0772  0.7627  2.9459 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               10.0087     0.0987   101.4   <2e-16 ***
## factor(a, levels = 5:1)4  -8.0978     0.1487   -54.5   <2e-16 ***
## factor(a, levels = 5:1)3  -4.0642     0.1466   -27.7   <2e-16 ***
## factor(a, levels = 5:1)2  -6.1733     0.1501   -41.1   <2e-16 ***
## factor(a, levels = 5:1)1  -8.9456     0.1435   -62.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared:  0.909,  Adjusted R-squared:  0.908 
## F-statistic: 1.23e+03 on 4 and 495 DF,  p-value: <2e-16
# a==4 as baseline:
summary(lm(b ~ factor(a, levels = c(4, 1, 2, 3, 5))))
## 
## Call:
## lm(formula = b ~ factor(a, levels = c(4, 1, 2, 3, 5)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9332 -0.6578 -0.0772  0.7627  2.9459 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              1.911      0.111   17.17  < 2e-16
## factor(a, levels = c(4, 1, 2, 3, 5))1   -0.848      0.152   -5.56  4.4e-08
## factor(a, levels = c(4, 1, 2, 3, 5))2    1.925      0.159   12.13  < 2e-16
## factor(a, levels = c(4, 1, 2, 3, 5))3    4.034      0.155   25.97  < 2e-16
## factor(a, levels = c(4, 1, 2, 3, 5))5    8.098      0.149   54.45  < 2e-16
##                                          
## (Intercept)                           ***
## factor(a, levels = c(4, 1, 2, 3, 5))1 ***
## factor(a, levels = c(4, 1, 2, 3, 5))2 ***
## factor(a, levels = c(4, 1, 2, 3, 5))3 ***
## factor(a, levels = c(4, 1, 2, 3, 5))5 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared:  0.909,  Adjusted R-squared:  0.908 
## F-statistic: 1.23e+03 on 4 and 495 DF,  p-value: <2e-16

Another approach is model the regression without an intercept:

# a==1 as baseline with no intercept:
n3 <- lm(b ~ 0 + factor(a))
summary(n3)
## 
## Call:
## lm(formula = b ~ 0 + factor(a))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9332 -0.6578 -0.0772  0.7627  2.9459 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## factor(a)1   1.0631     0.1042    10.2   <2e-16 ***
## factor(a)2   3.8354     0.1131    33.9   <2e-16 ***
## factor(a)3   5.9445     0.1084    54.9   <2e-16 ***
## factor(a)4   1.9109     0.1113    17.2   <2e-16 ***
## factor(a)5  10.0087     0.0987   101.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared:  0.968,  Adjusted R-squared:  0.967 
## F-statistic: 2.97e+03 on 5 and 495 DF,  p-value: <2e-16

In this model, the coefficients are exactly the conditionals of b for each value of a:

coef(n3)  # coefficients
## factor(a)1 factor(a)2 factor(a)3 factor(a)4 factor(a)5 
##      1.063      3.835      5.944      1.911     10.009
sapply(1:5, function(x) mean(b[a == x]))  # conditional means
## [1]  1.063  3.835  5.944  1.911 10.009

All of these models produce the same substantive inference, but might simplify interpretation in any particular situation.

##Variable transformations ## Sometimes, rather than forcing the categorical variable to be a set of indicators through the use of factor, we can treat the covariate as continuous once we transform it or the outcome in some way. Let's start with some fake data (based on our previous example):

c <- a^3
d <- 2 * a + rnorm(length(a))

These data have a curvilinear relationship that is not well represented by a linear regression line:

plot(c, d, col = "gray")
sapply(unique(c), function(x) points(x, mean(d[c == x]), col = "red", pch = 15))
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
abline(lm(d ~ c))

plot of chunk unnamed-chunk-28

As before, we can model this by treating the covariate c as a factor and find that model gives us the conditional means of d:

coef(lm(d ~ 0 + factor(c)))  # coefficients
##   factor(c)1   factor(c)8  factor(c)27  factor(c)64 factor(c)125 
##        1.906        3.988        6.078        7.917       10.188
sapply(sort(unique(c)), function(x) mean(d[c == x]))  # conditional means
## [1]  1.906  3.988  6.078  7.917 10.188

We can also obtain the same substantive inference by transforming the variable(s) to produce a linear fit. In this case, we know (because we made up the data) that there is a cubic relationship between c and d. If we make a new version of the covariate c that is the cube-root of c, we should be able to force a linear fit:

c2 <- c^(1/3)
plot(c2, d, col = "gray")
sapply(unique(c2), function(x) points(x, mean(d[c2 == x]), col = "red", pch = 15))
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
abline(lm(d ~ c2))

plot of chunk unnamed-chunk-30

We could also transform the outcome d by taking it cubed:

d2 <- d^3
plot(c, d2, col = "gray")
sapply(unique(c), function(x) points(x, mean(d2[c == x]), col = "red", pch = 15))
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
abline(lm(d2 ~ c))

plot of chunk unnamed-chunk-31

Again, the plot shows this transformation also produces a linear fit. Thus we can reasonably model the relationship between a discrete covariate and a continuous outcome in a number of ways that satisfy the basic assumption of drawing the regression line through the conditional means of the outcome.