Correlation and partial correlation

Correlation

The correlation coefficients speaks to degree to which two variables can be summarized by a straight line.

set.seed(1)
n <- 1000
x1 <- rnorm(n, -1, 10)
x2 <- rnorm(n, 3, 2)
y <- 5 * x1 + x2 + rnorm(n, 1, 2)

To obtain the correlation of two variables we simply list them consecutively:

cor(x1, x2)

## [1] 0.006401

If we want to test the significance of the correlation, we need to use the cor.test function:

cor.test(x1, x2)

## 
##  Pearson's product-moment correlation
## 
## data:  x1 and x2
## t = 0.2022, df = 998, p-value = 0.8398
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05561  0.06837
## sample estimates:
##      cor 
## 0.006401

To obtain a correlation matrix, we have to supply an input matrix:

cor(cbind(x1, x2, y))

##          x1       x2       y
## x1 1.000000 0.006401 0.99838
## x2 0.006401 1.000000 0.04731
## y  0.998376 0.047312 1.00000

Correlations of non-linear relationships

a <- rnorm(n)
b <- a^2 + rnorm(n)

If we plot the relationship of b on a, we see a strong (non-linear) relationship:

plot(b ~ a)

plot of chunk unnamed-chunk-6

Yet the correlation between the two variables is low:

cor(a, b)

## [1] -0.01712

cor.test(a, b)

## 
##  Pearson's product-moment correlation
## 
## data:  a and b
## t = -0.541, df = 998, p-value = 0.5886
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07903  0.04492
## sample estimates:
##      cor 
## -0.01712

If we can identify the functional form of the relationship, however, we can figure out what the relationship really is. Clearly a linear relationship is inappropriate:

plot(b ~ a, col = "gray")
curve((x), col = "red", add = TRUE)

plot of chunk unnamed-chunk-8

But what about y ~ x^2 (of course, we know this is correct):

plot(b ~ a, col = "gray")
curve((x^2), col = "blue", add = TRUE)

plot of chunk unnamed-chunk-9

The correlation between b and a² is thus much higher:

cor(a^2, b)

## [1] 0.843

We can see this visually by plotting b against the transformed a variable:

plot(b ~ I(a^2), col = "gray")

plot of chunk unnamed-chunk-11

If we now overlay a linear relationship, we see how well the transformed data are represented by a line:

plot(b ~ I(a^2), col = "gray")
curve((x), col = "blue", add = TRUE)

plot of chunk unnamed-chunk-12

Now let's see this side-by-side to see how the transform works:

layout(matrix(1:2, nrow = 1))
plot(b ~ a, col = "gray")
curve((x^2), col = "blue", add = TRUE)
plot(b ~ I(a^2), col = "gray")
curve((x), col = "blue", add = TRUE)

plot of chunk unnamed-chunk-13

Partial correlations

An approach that is sometimes used to examine the effects of variables involves “partial correlations.” A partial correlation measures the strength of the linear relationship between two variables, controlling for the influence of one or more covariates. For example, the correlation of y and z is:

z <- x1 + rnorm(n, 0, 2)
cor(y, z)

## [1] 0.9813

This correlation might be inflated or deflated to do the common antecedent variable x1 in both y and z. Thus we may want to remove the variation due to x1 from both y and z via linear regression:

part1 <- lm(y ~ x1)
part2 <- lm(z ~ x1)

The correlation of the residuals of those two models is thus the partial correlation:

cor(part1$residual, part2$residual)

## [1] 0.03828

As we can see, the correlation between these variables is actually much lower once we account for the variation attributable to x1.