Heteroskedasticity-Consistent SEs for OLS

We often need to analyze data that fails to satisfy assumptions of the statistical techniques we use. One common violation of assumptions in OLS regression is the assumption of homoskedasticity. This assumption requires that the error term have constant variance across all values of the independent variable(s). When this assumption fails, the standard errors from our OLS regression estimates are inconsistent. But, we can calculate heteroskedasticity-consistent standard errors, relatively easily. Unlike in Stata, where this is simply an option for regular OLS regression, in R, these SEs are not built into the base package, but instead come in an add-on package called sandwich, which we need to install and load:

install.packages("sandwich", repos = "http://cran.r-project.org")
## Warning: package 'sandwich' is in use and will not be installed
library(sandwich)

To see the sandwich package in action, let's generate some heteroskedastic data:

set.seed(1)
x <- runif(500, 0, 1)
y <- 5 * rnorm(500, x, x)

A simple plot of y against x (and the associated regression line) will reveal any heteroskedasticity:

plot(y ~ x, col = "gray", pch = 19)
abline(lm(y ~ x), col = "blue")

plot of chunk unnamed-chunk-3

Clearly, the variance of y and thus of the error term in an OLS model of y~x will increase as x increases.

Now let's run the OLS model and see the results:

ols <- lm(y ~ x)
s <- summary(ols)
s
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12.17  -1.27  -0.15   1.31   9.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.259      0.273    0.95     0.34    
## x              4.241      0.479    8.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.03 on 498 degrees of freedom
## Multiple R-squared:  0.136,  Adjusted R-squared:  0.134 
## F-statistic: 78.5 on 1 and 498 DF,  p-value: <2e-16

It may be particularly helpful to look just as the coefficient matrix from the summary object:

s$coef
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   0.2592     0.2732  0.9488 3.432e-01
## x             4.2414     0.4786  8.8615 1.402e-17

The second column shows the SEs. These SEs are themselves generated from the variance-covariance matrix for the coefficients, which we can see with:

vcov(ols)
##             (Intercept)       x
## (Intercept)     0.07463 -0.1135
## x              -0.11355  0.2291

The variance estimates for the coefficients are on the diagonal:

diag(vcov(ols))
## (Intercept)           x 
##     0.07463     0.22909

To convert these to SEs, we simply take the squared roote:

sqrt(diag(vcov(ols)))
## (Intercept)           x 
##      0.2732      0.4786

Now that we know where the regular SEs are coming from, let's get the heteroskedasticity-consistent SEs for this model from sandwich. The SEs come from the vcovHC function and the resulting object is the variance-covariance matrix for the coefficients:

vcovHC(ols)
##             (Intercept)        x
## (Intercept)     0.03335 -0.08751
## x              -0.08751  0.29242

This is, again, a variance-covariance matrix for the coefficients. So to get SES, we take the square root of the diagonal, like we did above:

sqrt(diag(vcovHC(ols)))
## (Intercept)           x 
##      0.1826      0.5408

We can then compare the SE estimate from the standard formula to the heteroskedasticity-consistent formula:

sqrt(diag(vcov(ols)))
## (Intercept)           x 
##      0.2732      0.4786
sqrt(diag(vcovHC(ols)))
## (Intercept)           x 
##      0.1826      0.5408

One annoying thing about not having the heteroskedasticity-consistent formula built-in is that when we call summary on ols, it prints the default SEs rather than the ones we really want. But, remember, everything in R is an object. So, we can overwrite the default SEs with the heteroskedasticity-consistent SEs quite easily. To do that, let's first look at the structure of our summary object s:

str(s)
## List of 11
##  $ call         : language lm(formula = y ~ x)
##  $ terms        :Classes 'terms', 'formula' length 3 y ~ x
##   .. ..- attr(*, "variables")= language list(y, x)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. ..$ : chr "x"
##   .. ..- attr(*, "term.labels")= chr "x"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: 0x000000001c3d67b0> 
##   .. ..- attr(*, "predvars")= language list(y, x)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  $ residuals    : Named num [1:500] 0.1231 0.7807 -0.0241 -0.6949 0.5952 ...
##   ..- attr(*, "names")= chr [1:500] "1" "2" "3" "4" ...
##  $ coefficients : num [1:2, 1:4] 0.259 4.241 0.273 0.479 0.949 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "x"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:2] FALSE FALSE
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
##  $ sigma        : num 3.03
##  $ df           : int [1:3] 2 498 2
##  $ r.squared    : num 0.136
##  $ adj.r.squared: num 0.134
##  $ fstatistic   : Named num [1:3] 78.5 1 498
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:2, 1:2] 0.00813 -0.01237 -0.01237 0.02497
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "x"
##   .. ..$ : chr [1:2] "(Intercept)" "x"
##  - attr(*, "class")= chr "summary.lm"

s is a list, one element of which is coefficients (which we saw above when we first ran our OLS model). The s$coefficients object is a matrix, with four columns, the second of which contains the default standard errors. If we replace those standard errors with the heteroskedasticity-robust SEs, when we print s in the future, it will show the SEs we actually want. Let's see the effect by comparing the current output of s to the output after we replace the SEs:

s
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12.17  -1.27  -0.15   1.31   9.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.259      0.273    0.95     0.34    
## x              4.241      0.479    8.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.03 on 498 degrees of freedom
## Multiple R-squared:  0.136,  Adjusted R-squared:  0.134 
## F-statistic: 78.5 on 1 and 498 DF,  p-value: <2e-16
s$coefficients[, 2] <- sqrt(diag(vcovHC(ols)))
s
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12.17  -1.27  -0.15   1.31   9.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.259      0.183    0.95     0.34    
## x              4.241      0.541    8.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.03 on 498 degrees of freedom
## Multiple R-squared:  0.136,  Adjusted R-squared:  0.134 
## F-statistic: 78.5 on 1 and 498 DF,  p-value: <2e-16

The summary output now reflects the correct SEs. But remember, if we call summary(ols), again, we'll see the original SEs. We need to call our s object to see the updated version.