OLS as Regression on Means

One of the things that I find most difficult about regression is visualizing what is actually happening when a regression model is fit. One way to better understand that process is to recognize that a regression is simply a curve through the conditional mean values of an outcome at each value of one or more predictors. Thus, we can actually estimate (i.e., “figure out”) the regression line simply by determining the conditional mean of our outcome at each value of of our input. This is easiest to see in a bivariate regression, so let's create some data and build the model:

set.seed(100)
x <- sample(0:50, 10000, TRUE)
# xsq <- x^2 # a squared term just for fun the data-generating process:
y <- 2 + x + (x^2) + rnorm(10000, 0, 300)

Now let's calculate the conditional means of x, xsq, and y:

condmeans_x <- by(x, x, mean)
condmeans_x2 <- by(x^2, x, mean)
condmeans_y <- by(y, x, mean)

If we run the regression on the original data (assuming we know the data-generating process), we'll get the following:

lm1 <- lm(y ~ x + I(x^2))
summary(lm1)
## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1229.5  -200.0    -0.7   200.6  1196.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.0643     8.7796    0.92     0.36    
## x            -0.2293     0.8087   -0.28     0.78    
## I(x^2)        1.0259     0.0156   65.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 300 on 9997 degrees of freedom
## Multiple R-squared:  0.871,  Adjusted R-squared:  0.871 
## F-statistic: 3.37e+04 on 2 and 9997 DF,  p-value: <2e-16

If we run the regression instead on just the conditional means (i.e., one value of y at each value of x), we will get the following:

lm2 <- lm(condmeans_y ~ condmeans_x + condmeans_x2)
summary(lm2)
## 
## Call:
## lm(formula = condmeans_y ~ condmeans_x + condmeans_x2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64.57 -17.20   0.85  17.21  50.86 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.2157     9.7430    0.74     0.46    
## condmeans_x   -0.1806     0.9012   -0.20     0.84    
## condmeans_x2   1.0250     0.0174   58.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.1 on 48 degrees of freedom
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.999 
## F-statistic: 2.65e+04 on 2 and 48 DF,  p-value: <2e-16

The results from the two models look very similar. Aside from some minor variations, they provide identical substantive inference about the process at-hand. We can see this if we plot the original data (in gray) and overlay it with the conditional means of y (in red):

plot(y ~ x, col = rgb(0, 0, 0, 0.05), pch = 16)
points(condmeans_y ~ condmeans_x, col = "red", pch = 15)

plot of chunk unnamed-chunk-5

We can add predicted output lines (one for each model) to this plot to see the similarity of the two models even more clearly. Indeed, the lines overlay each other perfectly:

plot(y ~ x, col = rgb(0, 0, 0, 0.05), pch = 16)
points(condmeans_y ~ condmeans_x, col = "red", pch = 15)
lines(0:50, predict(lm1, data.frame(x = 0:50)), col = "green", type = "l", lwd = 2)
lines(0:50, predict(lm2, data.frame(x = 0:50)), col = "blue", type = "l", lwd = 2)

plot of chunk unnamed-chunk-6

So, if you ever struggle to think about what regression is doing, just remember that it is simply drawing a (potentially multidimensional) curve through the conditional means of the outcome at every value of the covariate(s).